SGMCMC samplers

There are several SGMCMC samplers available. Each comes with its own set of tradeoffs. These are built from a diffusion along with an estimator for the gradient. Here we list and very briefly describe the pros and cons of each diffusion and gradient estimator. You can see them in action in this notebook.

Diffusions:

Each of these diffusions must include an estimate for the gradient denoted \(\hat{\nabla} \log \pi(x_n)\). These can be either the standard estimator, the control variates estimator, or the SVRG estimator. Throughout we define \(\xi \sim \mathcal{N}(0,1)\).

SGLD:

Stochastic gradient Langevin dynamics (SGLD) is the most basic SGMCMC algorithm: it’s solves the overdamped Langevin diffusion using an Euler solver. The update is given by:

\[x_{n+1} = x_n + dt\hat{\nabla} \log \pi(x_n) + \sqrt{2dt}\xi\]

Pros: simple to code up and understand and is fast.

Cons: will not sample as efficiently as other more sophisticated samplers.

SGHMC:

Stochastic gradient HMC: Euler discretisation of overamped Langevin with momentum resampling and stochastic gradients.

\[\begin{split}\begin{cases} x_{n+1} &= x_n + v_n \\ v_{n+1} &= v_n + dt\hat{\nabla} \log \pi(x_{n+1}) - \alpha v_n + \sqrt{2(\alpha - \hat{\beta_n})dt}\xi \end{cases}\end{split}\]

The tuning parameters in the update equation are \(dt\), \(\alpha\), and \(\beta\). The original paper recommends a small value for \(\alpha\) (such as 0.01) and \(\beta=0\). The number of leapfrog steps \(L\) must also be tuned.

Pros: The momentum term can provide faster sampling compared to SGLD

Cons: there are several additional hyperparameters to tune: number of leapfrog steps L and the friction coefficient \(\alpha\)

SGNHT:

Stochastic Gradient Nose-Hoover thermostats: Extends SGHMC with a third variable and uses an Euler solver.

\[\begin{split}\begin{cases} v_{n+1} &= v_n + dt\hat{\nabla} \log \pi(x_n) - \alpha_n v_n + \sqrt{2a dt}\xi \\ x_{n+1} &= x_{n+1} + v_n \\ \alpha_{n+1} &= \alpha_n + \frac{1}{D}v_{n+1}^Tv_{n+1} - dt \end{cases}\end{split}\]

Here \(D\) is the dimension of the parameter. The tunable parameters are \(dt\) and \(a\) (default \(a=0.01\)).

Pros: The friction term adapts to the amount of noise in the gradient estimate.

Cons: The performance of the sampler is quite sensitive to step size as the Euler solver is not as accurate and stable as a splitting scheme such as BADODAB. Finally, the sampler takes time to adapt the friction term to match the noise of the gradients.

pSGLD:

Preconditioned SGLD: SGLD with an adaptive (diagonal) preconditioner; this is essentially RMSProp merged with SGLD. Note that we set \(\Gamma(x_n)=0\) as recommended in the paper.

\[\begin{split}\begin{cases} v_{n+1} &= \alpha v_n + (1-\alpha) \hat{\nabla} \log \pi(x_n)^2 \\ G_{n+1} &= 1/ (\sqrt{v_{n+1}}+\epsilon) \\ x_{n+1} &= x_n + \frac{dt}{2} \left(G_{n+1} \hat{\nabla} \log \pi(x_n) + \Gamma(x_{n}) \right) + \sqrt{dt G_{n+1}}\xi \end{cases}\end{split}\]

Pros: The preconditioner can help with poorly scaled posteriors. This algorithm was designed for training deep neural networks.

Cons: Unlike RMSProp the step size does not tune automatically; a good value of \(dt\) is necessary for good performance.

BAOAB:

BAOAB: This splitting scheme is a numerical method to solve underdamped Langevin dynamics. This was originally derived for exact gradients.

\[\begin{split}\begin{cases} v_{n+1/2} &= v_n + \frac{dt}{2} \hat{\nabla} \log \pi(x_n) \\ x_{n+1/2} &= x_n + \frac{dt}{2}v_{n+1/2} \\ \tilde{v}_{n+1/2} &= e^{-\gamma dt}v_{n+1/2} + \sqrt{\tau(1 - e^{-2\gamma dt}) }\xi \\ x_{n+1} &= x_{n+1/2} + \frac{dt}{2}\tilde{v}_{n+1/2} \\ v_{n+1} &= \tilde{v}_{n+1/2} + \frac{dt}{2} \hat{\nabla} \log \pi(x_{n+1}) \\ \end{cases}\end{split}\]

The tuning parameters are the step size \(dt\), the friction coefficient \(\gamma\), and the temperature \(\tau\). By default we set \(\tau=1\).

BADODAB:

BADODAB scheme for SGNHT: This splitting scheme is a numerical method to solve the SGNHT equations:

\[\begin{split}\begin{cases} v_{n+1/2} &= v_n + \frac{dt}{2} \hat{\nabla} \log \pi(x_n) \\ x_{n+1/2} &= x_n + \frac{dt}{2}v_{n+1/2} \\ \alpha_{n+1/2} &= \alpha_n + \frac{dt}{2\mu} \left( v_{n+1/2}^Tv_{n+1/2} - D \right) \\ \text{if } \alpha_{n+1/2} \neq 0 : \tilde{v}_{n+1/2} &= e^{-\alpha_{n+1/2} dt}v_{n+1/2} + \sigma\sqrt{(1 - e^{-2\alpha_{n+1/2} dt})/ 2\alpha_{n+1/2} } \xi \\ \text{else }: \tilde{v}_{n+1/2} &= v_{n+1/2} + \sigma \sqrt{dt} \xi\\ \alpha_{n+1} &= \alpha_{n+1/2} + \frac{dt}{2\mu} \left( \tilde{v}_{n+1/2}^T \tilde{v}_{n+1/2} - D \right) \\ x_{n+1} &= x_{n+1/2} + \frac{dt}{2}\tilde{v}_{n+1/2} \\ v_{n+1} &= \tilde{v}_{n+1/2} + \frac{dt}{2} \hat{\nabla} \log \pi(x_{n+1}) \\ \end{cases}\end{split}\]

The tuning parameters are \(dt\) and \(a\) (the initial value of \(\alpha\) with default: \(a=0.01\)). The two other parameters are fixed: \(\mu=1\) and \(\sigma=1\).

Pros: The friction term adapts to the amount of noise in the gradient estimate, and the splitting scheme is more accurate and stable than the Euler method in SGNHT. This allows a larger range of step sizes and smaller minibatches.

Cons: The sampler takes time to adapt the friction term to match the noise of the gradients.

Gradient estimators:

Each of these estimators can be plugged into one of the diffusions defined above.

Standard estimator:

This is simply the sample mean of the gradients of a minibatch of data. This is the estimator from the original paper.

Pros: Easy to understand and code up.

Cons: The variance becomes pretty high as the minibatch size decreases. This results in poor quality samples.

Control Variates:

SGLD with control variates: This estimator uses a centering value to lower the variance of the gradient estimator.

Pros: The gradient estimate is much more accurate for log-concave posteriors for only a small added computational cost

Cons: The gradient estimate will lose accuracy for posteriors that are not log-concave. You also need to obtain the centering value (by optimising the posterior) before running the sampler

SVRG:

SGLD with SVRG: The same control variates estimator but where the centering value is updated regularly.

Pros: Similarly to control variates, the gradient estimate is much more accurate. Furthermore, there is no need to find a good centering value as this is updated regularly. Finally this gradient estimator works better for posteriors that are not log-concave.

Cons: The algorithm requires another tuning parameter: the rate at which the centering value is updated. This update can also be expensive at it requires calculating the fullbatch gradient.