当前位置: 首页 > 文档资料 > Edward 中文文档 >

KL(q||p) Minimization

优质
小牛编辑
130浏览
2023-12-01

$(\text{KL}(q\|p))$ Minimization

One form of variational inference minimizes the Kullback-Leibler divergence from $(q(\mathbf{z}\;;\;\lambda))$ to $(p(\mathbf{z} \mid \mathbf{x}))$, \[\begin{aligned} \lambda^* &= \arg\min_\lambda \text{KL}( q(\mathbf{z}\;;\;\lambda) \;\|\; p(\mathbf{z} \mid \mathbf{x}) )\\ &= \arg\min_\lambda\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log q(\mathbf{z}\;;\;\lambda) - \log p(\mathbf{z} \mid \mathbf{x}) \big].\end{aligned}\] The KL divergence is a non-symmetric, information theoretic measure of similarity between two probability distributions (Hinton & Camp, 1993; Jordan, Ghahramani, Jaakkola, & Saul, 1999; Waterhouse, MacKay, & Robinson, 1996).

The Evidence Lower Bound

The above optimization problem is intractable because it directly depends on the posterior $(p(\mathbf{z} \mid \mathbf{x}))$. To tackle this, consider the property \[\begin{aligned} \log p(\mathbf{x}) &= \text{KL}( q(\mathbf{z}\;;\;\lambda) \;\|\; p(\mathbf{z} \mid \mathbf{x}) )\\ &\quad+\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big]\end{aligned}\] where the left hand side is the logarithm of the marginal likelihood $(p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) \text{d}\mathbf{z})$, also known as the model evidence. (Try deriving this using Bayes’ rule!)

The evidence is a constant with respect to the variational parameters $(\lambda)$, so we can minimize $(\text{KL}(q\|p))$ by instead maximizing the Evidence Lower BOund, \[\begin{aligned} \text{ELBO}(\lambda) &=\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big].\end{aligned}\] In the ELBO, both $(p(\mathbf{x}, \mathbf{z}))$ and $(q(\mathbf{z}\;;\;\lambda))$ are tractable. The optimization problem we seek to solve becomes \[\begin{aligned} \lambda^* &= \arg \max_\lambda \text{ELBO}(\lambda).\end{aligned}\] As per its name, the ELBO is a lower bound on the evidence, and optimizing it tries to maximize the probability of observing the data. What does maximizing the ELBO do? Splitting the ELBO reveals a trade-off \[\begin{aligned} \text{ELBO}(\lambda) &=\; \mathbb{E}_{q(\mathbf{z} \;;\; \lambda)}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_{q(\mathbf{z} \;;\; \lambda)}[\log q(\mathbf{z}\;;\;\lambda)],\end{aligned}\] where the first term represents an energy and the second term (including the minus sign) represents the entropy of $(q)$. The energy encourages $(q)$ to focus probability mass where the model puts high probability, $(p(\mathbf{x}, \mathbf{z}))$. The entropy encourages $(q)$ to spread probability mass to avoid concentrating to one location.

Edward uses two generic strategies to obtain gradients for optimization.

  • Score function gradient;
  • Reparameterization gradient.

Score function gradient

Gradient descent is a standard approach for optimizing complicated objectives like the ELBO. The idea is to calculate its gradient \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &= \nabla_\lambda\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big],\end{aligned}\] and update the current set of parameters proportional to the gradient.

The score function gradient estimator leverages a property of logarithms to write the gradient as \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &=\; \mathbb{E}_{q(\mathbf{z}\;;\;\lambda)} \big[ \nabla_\lambda \log q(\mathbf{z}\;;\;\lambda) \: \big( \log p(\mathbf{x}, \mathbf{z}) - \log q(\mathbf{z}\;;\;\lambda) \big) \big].\end{aligned}\] The gradient of the ELBO is an expectation over the variational model $(q(\mathbf{z}\;;\;\lambda))$; the only new ingredient it requires is the score function $(\nabla_\lambda \log q(\mathbf{z}\;;\;\lambda))$ (Paisley, Blei, & Jordan, 2012; Ranganath, Gerrish, & Blei, 2014).

We can use Monte Carlo integration to obtain noisy estimates of both the ELBO and its gradient. The basic procedure follows these steps:

  1. draw $(S)$ samples $(\{\mathbf{z}_s\}_1^S \sim q(\mathbf{z}\;;\;\lambda))$,
  2. evaluate the argument of the expectation using $(\{\mathbf{z}_s\}_1^S)$, and
  3. compute the empirical mean of the evaluated quantities.

A Monte Carlo estimate of the gradient is then \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &\approx\; \frac{1}{S} \sum_{s=1}^{S} \big[ \big( \log p(\mathbf{x}, \mathbf{z}_s) - \log q(\mathbf{z}_s\;;\;\lambda) \big) \: \nabla_\lambda \log q(\mathbf{z}_s\;;\;\lambda) \big].\end{aligned}\] This is an unbiased estimate of the actual gradient of the ELBO.

Reparameterization gradient

If the model has differentiable latent variables, then it is generally advantageous to leverage gradient information from the model in order to better traverse the optimization space. One approach to doing this is the reparameterization gradient (Kingma & Welling, 2014; Rezende, Mohamed, & Wierstra, 2014).

Some variational distributions $(q(\mathbf{z}\;;\;\lambda))$ admit useful reparameterizations. For example, we can reparameterize a normal distribution $(\mathbf{z} \sim \text{Normal}(\mu, \Sigma))$ as $(\mathbf{z} \sim \mu + L \text{Normal}(0, I))$ where $(\Sigma = LL^\top)$. In general, write this as \[\begin{aligned} \epsilon &\sim q(\epsilon)\\ \mathbf{z} &= \mathbf{z}(\epsilon \;;\; \lambda),\end{aligned}\] where $(\epsilon)$ is a random variable that does not depend on the variational parameters $(\lambda)$. The deterministic function $(\mathbf{z}(\cdot;\lambda))$ encapsulates the variational parameters instead, and following the process is equivalent to directly drawing $(\mathbf{z})$ from the original distribution.

The reparameterization gradient leverages this property of the variational distribution to write the gradient as \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &=\; \mathbb{E}_{q(\epsilon)} \big[ \nabla_\lambda \big( \log p(\mathbf{x}, \mathbf{z}(\epsilon \;;\; \lambda)) - \log q(\mathbf{z}(\epsilon \;;\; \lambda) \;;\;\lambda) \big) \big].\end{aligned}\] The gradient of the ELBO is an expectation over the base distribution $(q(\epsilon))$, and the gradient can be applied directly to the inner expression.

We can use Monte Carlo integration to obtain noisy estimates of both the ELBO and its gradient. The basic procedure follows these steps:

  1. draw $(S)$ samples $(\{\epsilon_s\}_1^S \sim q(\epsilon))$,
  2. evaluate the argument of the expectation using $(\{\epsilon_s\}_1^S)$, and
  3. compute the empirical mean of the evaluated quantities.

A Monte Carlo estimate of the gradient is then \[\begin{aligned} \nabla_\lambda\; \text{ELBO}(\lambda) &\approx\; \frac{1}{S} \sum_{s=1}^{S} \big[ \nabla_\lambda \big( \log p(\mathbf{x}, \mathbf{z}(\epsilon_s \;;\; \lambda)) - \log q(\mathbf{z}(\epsilon_s \;;\; \lambda) \;;\;\lambda) \big) \big].\end{aligned}\] This is an unbiased estimate of the actual gradient of the ELBO. Empirically, it exhibits lower variance than the score function gradient, leading to faster convergence in a large set of problems.

For more details, see the API as well as its implementation in Edward’s code base.

References

Hinton, G. E., & Camp, D. van. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Conference on learning theory. ACM.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.

Kingma, D., & Welling, M. (2014). Auto-encoding variational Bayes. In International conference on learning representations.

Paisley, J., Blei, D. M., & Jordan, M. I. (2012). Variational bayesian inference with stochastic search. In International conference on machine learning.

Ranganath, R., Gerrish, S., & Blei, D. (2014). Black box variational inference. In Artificial intelligence and statistics.

Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In ICML (pp. 1278–1286).

Waterhouse, S., MacKay, D., & Robinson, T. (1996). Bayesian methods for mixtures of experts. Advances in Neural Information Processing Systems, 351–357.