Artificial Intelligence

Maximum-likelihood Parameter Learning: Continuous Models

Maximum-likelihood parameter learning: Continuous models: Continuous probability models such as the linear-Gaussian model were introduced in Section
14.3. Because continuous variables are ubiquitous in real-world applications, it is important to know how to learn continuous models from data. The principles for maximumlikelihood learning are identical to those of the discrete case.
Let us begin with a very simple case: learning the parameters of a Gaussian density function on a single variable. That is, the data are generated as follows:

The parameters of this model are the mean µ and the standard deviation . (Notice that the normalizing “constant” depends on σ, so we cannot ignore it.) Let the observed values be x1; : : : ; xN. Then the log likelihood is

Setting the derivatives to zero as usual, we obtain

..................(4)

That is, the maximum-likelihood value of the mean is the sample average and the maximumlikelihood value of the standard deviation is the square root of the sample variance. Again, these are comforting results that confirm “commonsense” practice.
Now consider a linear Gaussian model with one continuous parent X and a continuous child Y . As explained on page 502, Y has a Gaussian distribution whose mean depends linearly on the value of X and whose standard deviation is fixed. To learn the conditional

Figure 20.4 (a) A linear Gaussian model described as y =1x 2 plus Gaussian noise with fixed variance. (b) A set of 50 data points generated from this model.

distribution P(Y | X), we can maximize the conditional likelihood

The quantity (yj - (1xj 2)) is the error for (xj ; yj)—that is, the difference between the actual value yj and the predicted value (1xj 2)—so E is the well-known sum of squared errors. This is the quantity that is minimized by the standard linear regression procedure.

Now we can understand why: minimizing the sum of squared errors gives the maximumlikelihood straight-line model, provided that the data are generated with Gaussian noise of fixed variance.