Bayesian Parameter Learning
Bayesian parameter learning: Maximum-likelihood learning gives rise to some very simple procedures, but it has some serious deficiencies with small data sets. For example, after seeing one cherry candy, the maximum-likelihood hypothesis is that the bag is 100% cherry (i.e., =1:0). Unless one’s hypothesis prior is that bags must be either all cherry or all lime, this is not a reasonable conclusion. The Bayesian approach to parameter learning places a hypothesis prior over the possible values of the parameters and updates this distribution as data arrive.
The candy example in Figure 20.2(a) has one parameter, θ: the probability that a randomly selected piece of candy is cherry flavored. In the Bayesian view, θ is the (unknown) value of a random variable Θ; the hypothesis prior is just the prior distribution P(Θ). Thus, P(Θ = θ) is the prior probability that the bag has a fraction of cherry candies.
If the parameter can be any value between 0 and 1, then P(Θ) must be a continuous distribution that is nonzero only between 0 and 1 and that integrates to 1. The uniform density P(θ) = U[0; 1](θ) is one candidate. It turns out that the uniform density is a member of the family of beta distributions. Each beta distribution is defined by two hyperparameters4 a and b such that
for θ in the range [0; 1]. The normalization constant depends on a and b. Figure 20.5 shows what the distribution looks like for various values of a and b. The mean value of the distribution is a=(a b), so larger values of a suggest a belief that Θ is closer to 1 than to 0. Larger values of a b make the distribution more peaked, suggesting greater certainty about the value of Θ. Thus, the beta family provides a useful range of possibilities for the hypothesis prior.
Besides its flexibility, the beta family has another wonderful property: if Θ has a prior beta[a; b], then, after a data point is observed, the posterior distribution for Θ is also a beta distribution. The beta family is called the conjugate prior for the family of distributions for a Boolean variable. Let’s see how this works. Suppose we observe a cherry candy; then
Thus, after seeing a cherry candy, we simply increment the a parameter to get the posterior; similarly, after seeing a lime candy, we increment the b parameter. Thus, we can view the a and b hyperparameters as virtual counts, in the sense that a prior beta[a; b] behaves exactly as if we had started out with a uniform prior beta[1; 1] and seen a - 1 actual cherry candies and b - 1 actual lime candies.
By examining a sequence of beta distributions for increasing values of a and b, keeping the proportions fixed, we can see vividly how the posterior distribution over the parameter Θ changes as data arrive. For example, suppose the actual bag of candy is 75% cherry. Figure 20.5(b) shows the sequence beta[3; 1], beta[6; 2], beta[30; 10]. Clearly, the distribution is converging to a narrow peak around the true value of Θ. For large data sets, then, Bayesian learning (at least in this case) converges to give the same results as maximum-likelihood learning.
The network in Figure 20.2(b) has three parameters, θ, θ1, and θ2, where θ1 is the probability of a red wrapper on a cherry candy and θ2 is the probability of a red wrapper on a lime candy. The Bayesian hypothesis prior must cover all three parameters—that is, we need to specify P(Θ;Θ1;Θ2). Usually, we assume parameter independence:
P(Θ;Θ1;Θ2) = P(Θ)P(Θ1)P(Θ2) :
With this assumption, each parameter can have its own beta distribution that is updated separately as data arrive.
Once we have the idea that unknown parameters can be represented by random variables such as Θ, it is natural to incorporate them into the Bayesian network itself. To do this, we also need to make copies of the variables describing each instance. For example, if we have observed three candies then we need Flavor 1, Flavor2, Flavor3 and Wrapper1, Wrapper2, Wrapper3. The parameter variable Θ determines the probability of each Flavor i variable:
P(Flavori =cherry I Θ = θ) = θ:
Similarly, the wrapper probabilities depend on 1 and 2, For example,
P(Wrapperi =red I Flavori =cherry; Θ1 =θ1) = θ1 :
Now, the entire Bayesian learning process can be formulated as an inference problem in a suitably constructed Bayes net, as shown in Figure 20.6. Prediction for a new instance is done simply by adding new instance variables to the network, some of which are queried. This formulation of learning and prediction makes it clear that Bayesian learning requires no extra “principles of learning.” Furthermore, there is, in essence, just one learning algorithm, i.e., the inference algorithm for Bayesian networks.