Composing Inferences
Composing Inferences
Core to Edward’s design is compositionality. Compositionality enables fine control of inference, where we can write inference as a collection of separate inference programs.
We outline how to write popular classes of compositional inferences using Edward: hybrid algorithms and message passing algorithms. We use the running example of a mixture model with latent mixture assignments z
, latent cluster means beta
, and observations x
.
Hybrid algorithms
Hybrid algorithms leverage different inferences for each latent variable in the posterior. As an example, we demonstrate variational EM, with an approximate E-step over local variables and an M-step over global variables. We alternate with one update of each (Neal & Hinton, 1993).
from edward.models import Categorical, PointMass
qbeta = PointMass(params=tf.Variable(tf.zeros([K, D])))
qz = Categorical(logits=tf.Variable(tf.zeros[N, K]))
inference_e = ed.VariationalInference({z: qz}, data={x: x_data, beta: qbeta})
inference_m = ed.MAP({beta: qbeta}, data={x: x_data, z: qz})
...
for _ in range(10000):
inference_e.update()
inference_m.update()
In data
, we include bindings of prior latent variables (z
or beta
) to posterior latent variables (qz
or qbeta
). This performs conditional inference, where only a subset of the posterior is inferred while the rest are fixed using other inferences.
This extends to many algorithms: for example, exact EM for exponential families; contrastive divergence (Hinton, 2002); pseudo-marginal and ABC methods (Andrieu & Roberts, 2009); Gibbs sampling within variational inference (Wang & Blei, 2012); Laplace variational inference (Wang & Blei, 2013); and structured variational auto-encoders (Johnson, Duvenaud, Wiltschko, Datta, & Adams, 2016).
Message passing algorithms
Message passing algorithms operate on the posterior distribution using a collection of local inferences (Koller & Friedman, 2009). As an example, we demonstrate expectation propagation. We split a mixture model to be over two random variables x1
and x2
along with their latent mixture assignments z1
and z2
.
from edward.models import Categorical, Normal
N1 = 1000 # number of data points in first data set
N2 = 2000 # number of data points in second data set
D = 2 # data dimension
K = 5 # number of clusters
# MODEL
beta = Normal(loc=tf.zeros([K, D]), scale=tf.ones([K, D]))
z1 = Categorical(logits=tf.zeros([N1, K]))
z2 = Categorical(logits=tf.zeros([N2, K]))
x1 = Normal(loc=tf.gather(beta, z1), scale=tf.ones([N1, D]))
x2 = Normal(loc=tf.gather(beta, z2), scale=tf.ones([N2, D]))
# INFERENCE
qbeta = Normal(loc=tf.Variable(tf.zeros([K, D])),
scale=tf.nn.softplus(tf.Variable(tf.zeros([K, D]))))
qz1 = Categorical(logits=tf.Variable(tf.zeros[N1, K]))
qz2 = Categorical(logits=tf.Variable(tf.zeros[N2, K]))
inference_z1 = ed.KLpq({beta: qbeta, z1: qz1}, {x1: x1_train})
inference_z2 = ed.KLpq({beta: qbeta, z2: qz2}, {x2: x2_train})
...
for _ in range(10000):
inference_z1.update()
inference_z2.update()
We alternate updates for each local inference, where the global posterior factor $(q(\beta))$ is shared across both inferences (Gelman et al., 2017).
With TensorFlow’s distributed training, compositionality enables distributed message passing over a cluster with many workers. The computation can be further sped up with the use of GPUs via data and model parallelism.
This extends to many algorithms: for example, classical message passing, which performs exact local inferences; Gibbs sampling, which draws samples from conditionally conjugate inferences (Geman & Geman, 1984); expectation propagation, which locally minimizes $(\text{KL}(p || q))$ over exponential families (Minka, 2001); integrated nested Laplace approximation, which performs local Laplace approximations (Rue, Martino, & Chopin, 2009); and all the instantiations of EP-like algorithms in Gelman et al. (2017).
In the above, we perform local inferences split over individual random variables. At the moment, Edward does not support local inferences within a random variable itself. We cannot do local inferences when representing the random variable for all data points and their cluster membership as x
and z
rather than x1
, x2
, z1
, and z2
.
References
Andrieu, C., & Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics, 697–725.
Gelman, A., Vehtari, A., Jylänki, P., Sivula, T., Tran, D., Sahai, S., … Robert, C. (2017). Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data. arXiv Preprint arXiv:1412.4869v2.
Geman, S., & Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6), 721–741.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Johnson, M. J., Duvenaud, D., Wiltschko, A. B., Datta, S. R., & Adams, R. P. (2016). Composing graphical models with neural networks for structured representations and fast inference. arXiv Preprint arXiv:1603.06277.
Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. MIT press.
Minka, T. P. (2001). Expectation propagation for approximate Bayesian inference. In Uncertainty in artificial intelligence.
Neal, R. M., & Hinton, G. E. (1993). A new view of the EM algorithm that justifies incremental and other variants. In Learning in graphical models (pp. 355–368).
Rue, H., Martino, S., & Chopin, N. (2009). Approximate bayesian inference for latent gaussian models by using integrated nested laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(2), 319–392.
Wang, C., & Blei, D. M. (2012). Truncation-free online variational inference for bayesian nonparametric models. In Neural information processing systems.
Wang, C., & Blei, D. M. (2013). Variational inference in nonconjugate models. Journal of Machine Learning Research, 14, 1005–1031.