Classes of Inference
Classes of Inference
Inference is broadly classified under three classes: variational inference, Monte Carlo, and exact inference. We highlight how to use inference algorithms from each class.
As an example, we assume a mixture model with latent mixture assignments z
, latent cluster means beta
, and observations x
: \[p(\mathbf{x}, \mathbf{z}, \beta) = \text{Normal}(\mathbf{x} \mid \beta_{\mathbf{z}}, \mathbf{I}) ~ \text{Categorical}(\mathbf{z}\mid \pi) ~ \text{Normal}(\beta\mid \mathbf{0}, \mathbf{I}).\]
Variational Inference
In variational inference, the idea is to posit a family of approximating distributions and to find the closest member in the family to the posterior (Jordan, Ghahramani, Jaakkola, & Saul, 1999). We write an approximating family, \[\begin{aligned} q(\beta;\mu,\sigma) &= \text{Normal}(\beta; \mu,\sigma), \\[1.5ex] q(\mathbf{z};\pi) &= \text{Categorical}(\mathbf{z};\pi),\end{aligned}\] using TensorFlow variables to represent its parameters $(\lambda=\{\pi,\mu,\sigma\})$.
from edward.models import Categorical, Normal
qbeta = Normal(loc=tf.Variable(tf.zeros([K, D])),
scale=tf.exp(tf.Variable(tf.zeros[K, D])))
qz = Categorical(logits=tf.Variable(tf.zeros[N, K]))
inference = ed.VariationalInference({beta: qbeta, z: qz}, data={x: x_train})
Given an objective function, variational inference optimizes the family with respect to tf.Variable
s.
Specific variational inference algorithms inherit from the VariationalInference
class to define their own methods, such as a loss function and gradient. For example, we represent MAP estimation with an approximating family (qbeta
and qz
) of PointMass
random variables, i.e., with all probability mass concentrated at a point.
from edward.models import PointMass
qbeta = PointMass(params=tf.Variable(tf.zeros([K, D])))
qz = PointMass(params=tf.Variable(tf.zeros(N)))
inference = ed.MAP({beta: qbeta, z: qz}, data={x: x_train})
MAP
inherits from VariationalInference
and defines a loss function and update rules; it uses existing optimizers inside TensorFlow.
Monte Carlo
Monte Carlo approximates the posterior using samples (Robert & Casella, 1999). Monte Carlo is an inference where the approximating family is an empirical distribution, \[\begin{aligned} q(\beta; \{\beta^{(t)}\}) &= \frac{1}{T}\sum_{t=1}^T \delta(\beta, \beta^{(t)}), \\[1.5ex] q(\mathbf{z}; \{\mathbf{z}^{(t)}\}) &= \frac{1}{T}\sum_{t=1}^T \delta(\mathbf{z}, \mathbf{z}^{(t)}).\end{aligned}\] The parameters are $(\lambda=\{\beta^{(t)},\mathbf{z}^{(t)}\})$.
from edward.models import Empirical
T = 10000 # number of samples
qbeta = Empirical(params=tf.Variable(tf.zeros([T, K, D]))
qz = Empirical(params=tf.Variable(tf.zeros([T, N]))
inference = ed.MonteCarlo({beta: qbeta, z: qz}, data={x: x_train})
Monte Carlo algorithms proceed by updating one sample $(\beta^{(t)},\mathbf{(z)}^{(t)})$ at a time in the empirical approximation. Markov chain Monte Carlo does this sequentially to update the current sample (index $(t)$ of tf.Variable
s) conditional on the last sample (index $(t-1)$ of tf.Variable
s). Specific Monte Carlo samplers determine the update rules; they can use gradients such as in Hamiltonian Monte Carlo (Neal, 2011) and graph structure such as in sequential Monte Carlo (Doucet, De Freitas, & Gordon, 2001).
Non-Bayesian Methods
As a library for probabilistic modeling (not necessarily Bayesian modeling), Edward is agnostic to the paradigm for inference. This means Edward can use frequentist (population-based) inferences, strictly point estimation, and alternative foundations for parameter uncertainty.
For example, Edward supports non-Bayesian methods such as generative adversarial networks (GANs) (Goodfellow et al., 2014). For more details, see the GAN tutorial.
In general, we think opening the door to non-Bayesian approaches is a crucial feature for probabilistic programming. This enables advances in other fields such as deep learning to be complementary: all is in service for probabilistic models and thus it makes sense to combine our efforts.
Exact Inference
In order to uncover conjugacy relationships between random variables (if they exist), we use symbolic algebra on nodes in the computational graph. Users can then integrate out variables to automatically derive classical Gibbs (Gelfand & Smith, 1990), mean-field updates (Bishop, 2006), and exact inference.
For example, can calculate a conjugate posterior analytically by using the ed.complete_conditional
function:
from edward.models import Bernoulli, Beta
# Beta-Bernoulli model
pi = Beta(1.0, 1.0)
x = Bernoulli(probs=pi, sample_shape=10)
# Beta posterior; it conditions on the sample tensor associated to x
pi_cond = ed.complete_conditional(pi)
# Generate samples from p(pi | x = NumPy array)
sess.run(pi_cond, {x: np.array([0, 1, 0, 0, 0, 0, 0, 0, 0, 1])})
References
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer New York.
Doucet, A., De Freitas, N., & Gordon, N. (2001). An introduction to sequential Monte Carlo methods. In Sequential monte carlo methods in practice (pp. 3–14). Springer.
Gelfand, A. E., & Smith, A. F. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410), 398–409.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In Neural information processing systems.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37(2), 183–233.
Neal, R. M. (2011). MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo.
Robert, C. P., & Casella, G. (1999). Monte carlo statistical methods. Springer.