Learning Bayesian Networks With Hidden Variables

Learning Bayesian networks with hidden variables: To learn a Bayesian network with hidden variables, we apply the same insights that worked for mixtures of Gaussians. Figure 20.10 represents a situation in which there are two bags of candies that have been mixed together. Candies are described by three features: in addition to the Flavor and the Wrapper, some candies have a Hole in the middle and some do not.

Figure 20.9 Graphs showing the log-likelihood of the data, L, as a function of the EM iteration. The horizontal line shows the log-likelihood according to the true model. (a) Graph for the Gaussian mixture model in Figure 20.8. (b) Graph for the Bayesian network in Figure 20.10(a).

Figure 20.10 (a) A mixture model for candy. The proportions of different flavors, wrappers, and numbers of holes depend on the bag, which is not observed. (b) Bayesian network for a Gaussian mixture. The mean and covariance of the observable variables X depend on the component C.

The distribution of candies in each bag is described by a naive Bayes model: the features are independent, given the bag, but the conditional probability distribution for each feature depends on the bag. The parameters are as follows:  is the prior probability that a candy comes from Bag 1; θF1 and θF2 are the probabilities that the flavor is cherry, given that the candy comes from Bag 1 and Bag 2 respectively; W1 and W2 give the probabilities that the wrapper is red; and θH1 and θH2 give the probabilities that the candy has a hole. Notice that the overall model is a mixture model. (In fact, we can also model the mixture of Gaussians as a Bayesian network, as shown in Figure 20.10(b).) In the figure, the bag is is a hidden variable because, once the candies have been mixed together, we no longer know which bag each candy came from. In such a case, can we recover the descriptions of the two bags by observing candies from the mixture?
Let us work through an iteration of EM for this problem. First, let’s look at the data.
We generated 1000 samples from a model whose true parameters are

That is, the candies are equally likely to come from either bag; the first is mostly cherries with red wrappers and holes; the second is mostly limes with green wrappers and no holes. The counts for the eight possible kinds of candy are as follows:

We start by initializing the parameters. For numerical simplicity, we will choose

First, let us work on the  parameter. In the fully observable case, we would estimate this directly from the observed counts of candies from bags 1 and 2. Because the bag is a hidden variable, we calculate the expected counts instead. The expected count ^N(Bag =1) is the sum, over all candies, of the probability that the candy came from bag 1:

These probabilities can be computed by any inference algorithm for Bayesian networks. For a naive Bayes model such as the one in our example, we can do the inference “by hand,” using Bayes’ rule and applying conditional independence:

(Notice that the normalizing constant also depends on the parameters.) Applying this formula to, say, the 273 red-wrapped cherry candies with holes, we get a contribution of

Continuing with the other seven kinds of candy in the table of counts, we obtain (1) =0:6124.