In this note, we take a look at the reparameterization trick, an idea that forms the basis of the Variational Autoencoder. My material here comes from the fantastic paper by Ruiz et al [1]. The main idea is that the reparameterization trick [4,5] gives us a lower variance estimator than that obtained from the score function gradient, but suffers because the class of distributions to which it can be applied is somewhat limited – Kingma and Welling discusses Bernaulli and Gaussian variants. Nevertheless, it is possible to remedy this defect, as is done in papers [1], [2], [3]. The paper by Ruiz is especially instructive. It walks us through the machinery involved in reparameterization and discusses variance reduction [Casella and Berger, Robert and Casella] for variational inference – also [BBVI]. I was originally intending on writing a much more complete summary of the papers [1], [2], [3] but ran out of juice, leaving it for the future as might transpire (or not). If I may insert inappropriately (also, perhaps to acknowledge a decade that just ended):

*What might have been is an abstraction*

*Remaining a perpetual possibility*

*Only in a world of speculation.*

*What might have been and what has been*

*Point to one end, which is always present.*

Notwithstanding such tangential musings, it behooves us to study [Casella and Berger]. The authors note that it is a 22 month long course to learn statistics the hard way. Incidentally, the style of books bearing George Casella’s imprint – either actual or inspirational – ([Robert and Casella], [Robert]) very much reminds me of [Bender and Orszag], with engaging quotes from Holmes and their straight forward slant on equations.

## The Variational Objective

Recall that in variational problems we are interested in obtaining the ELBO. We briefly derive this using Jensen’s inequality ()

Take the joint form presented in Ruiz:

Or

Or

The equation written in this form is quite instructive. We want to optimize the joint by varying the variational parameters v through the surrogate distribution .

## Gradient estimation with score function

Our aim is to obtain Monte Carlo estimates of the gradient (by taking expectations), but this is not possible as is. However, it is possible to arrange this as follows:

This is known as the score function estimator, or the REINFORCE gradient. We can view it as a discrete gradient that allows us to take gradients of non-differentiable functions by taking samples.

However, the estimate obtained tends to be noisy, and needs provisos for variance reduction – e.g. Rao-Black Wellization, control variates.

## Gradient of expectation, expectation of gradient

A principal contribution of the VAE approach is that we have an alternative way to derive the estimator, one that is generally of lower variance than the score function method described above. That being said, it has the drawback that the method is not as widely applicable as the score function approach.

Recall that we would like to take gradients of the term containing the log joint in the ELBO.

In this equation, we can take samples , but as the estimator contains variational parameters within it, we cannot carry out any sort of diffentiation operations to it with respect to – necessary to take gradients. The reparameterization trick gets around this problem.

We derive an alternative estimator by transforming to a distribution that does not depend on so that we can now take the gradient operator inside the expectation. For this to work, we rely on what they call a ‘standardization’ operation to transform into another distribution independent of (and other terms containing ). In the end, we want to have something like this:

We have pushed the gradient operator inside the expectation which allows us to take samples and allow taking gradients from it.

That’s a lot of words. Let us derive this estimator to put it more concretely.

We assume that there exists an invertable transformation with pdfs . In some cases, it is possible to find a transformation such that the reparameterized distribution is independent of the variational parameters . For example, consider the standard normal distribution:

We can consider this as the standardized version of a normal distribution .

Transform (pretend 1D for now):

For more general cases, the differential is replaced by a Jacobian:

After standardization, we lose dependence on : so that

Now we can take gradient and move it inside the expectation:

## Reparameterization in more general cases

The standardization procedure is now extended so that is weakly dependent on – it has zero mean, but it’s first moment does not depend on . Nevertheless, has dependence on the variational parameters. In this case, the expectation will have to be evaluated term by term with chain rule

As we can see, the first term is the regular reparameterization gradient. The second term is the score function estimator, a correction term for this version of this standardization setup.

In the case of the normal distribution, the second term vanishes.

## Interpretation

The terms are massaged so as to look like control variates, an idea used in Monte Carlo variance reduction. The authors note that while Rao-Blackwellization is not used in the paper, it is perfectly reasonable to use the setup in conjuction with it, as is done in Black Box Variational Inference [BBVI], where both Rao-Blackwellization and control variates are used to reduce the variance of the estimator.

The basic idea of control variates is as follows ([Casella and Berger] – Chapter 7 on “Point Estimation”). Given an estimator satisfying , we seek to find another estimator of lower variance, using an estimator with :

The variance for this estimator is (Casella and Berger):

We would get lower variance for than if we could find such that .

To get back to our variational estimator, rewrite as follows for it to be interpretable as control variates (see [1]):

In the first line, we have the score function expression, which is modified in subsequent lines. It is not entirely clear to me how the expectation of the terms that correct the noisy gradient is zero, but I suppose we will take it in the spirit with which it was intended.

## References

[1] Ruiz et al: The generalized reparameterization gradient: https://arxiv.org/abs/1610.02287

[2] Naesseth et al: Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms: https://arxiv.org/abs/1610.05683

[3] Figurnov et al: Implicit Reparameterization Gradients: https://arxiv.org/abs/1805.08498

[4] Kingma and Welling: Autoencoding Variational Bayes: https://arxiv.org/abs/1312.6114

[5] Rezende, Mohamed and Wierstra: Stochastic Backpropagation and Approximate Inference in Deep Generative Models: https://arxiv.org/abs/1401.4082

[BBVI] Black Box Variational Inference: https://arxiv.org/abs/1401.0118

[Casella and Berger]: Statistical Inference: https://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126

[Robert and Casella]: Monte Carlo Statistical Methods: https://www.amazon.com/Monte-Statistical-Methods-Springer-Statistics/dp/1441919392

[Robert]: The Bayesian Choice: https://www.amazon.com/Bayesian-Choice-Decision-Theoretic-Computational-Implementation/dp/0387715983

[Bender and Orszag]: Advanced Mathematical Methods for Scientists and Engineers: https://www.amazon.com/Advanced-Mathematical-Methods-Scientists-Engineers/dp/0387989315