Six Sigma

Neural Nets For Regression

Introduction:

Neural nets fascinate many people because their structure is somewhat reminiscent of the way that the human brain generates predictions based on data. Also, these methods are not useful for analyzing data for categorical responses.

 

Neural Nets for Regression Type Problems:

These methods also offer potentially great flexibility with respect to the ability to approximate a wide variety of functions. However, this flexibility is also a concern since they offer much potential for misuse. This section focuses on neural nets used for situations involving continuous responses.

These neural nets constitute alternatives to linear regression and kriging models. Here, a simple, spreadsheet-based neural network modeling technique is proposed and illustrated with an example.

This method is based on results in Ribardo (2000).

Kohonen (1989) and Chambers (2000) review neural net modeling in the context of predicting continuous responses such as undercut in millimeters.

Neural nets have also been proposed for classification problems involving discrete responses.

Here also, an attempt is made to avoid using the neural net terminology as much as possible to facilitate the comparison with the other methods.

Basically, neural nets are a curve-fitting procedure that, like regression, involves estimating several coefficients (or “weights”) by solving an optimization problem to minimize the sum of squares error.

In linear regression, this minimization is relatively trivial from the numerical standpoint, since formulas are readily available for the solution, i.e.,

 

βest = (X′X)-1X′y

 

In neural net modeling, however, this minimization is typically far less trivial and involves using a formal optimization solver. In the implementation here, the Excel solver is used. In more standard treatments, however, the solvers involve methods that are to some degree tailored to the specific functional form (or “architecture”) of the model (or “net”) being fit.

A solver algorithm called “back-propagation” is the most commonly used method for estimating the model parameters (or “training on the training set”) for the model forms that were selected.

An architecture called “single hidden layer, feed forward neural network based on sigmoidal transfer functions”

with five randomly chosen data in the test set and the simplest “training termination criterion” was arbitrarily selected.

One reason for selecting this architecture type is that substantial literature exists on this type of network, e.g., sees Chambers (2000) for a review.

Also, it has been demonstrated rigorously that, with a large enough number of nodes, this type of network can approximate any continuous function (Cybenko 1989) to any desired degree of accuracy. Also, it may be possible to obtain a relatively accurate net with fewer total nodes using a different type of architecture.

The choice of the number of nodes and the other specific architectural considerations is largely determined by the accepted compromise between the observed variation (high adjusted R2) and what is referred to as “over-fitting”.

In the selected feed-forward architecture, for each of the l nodes in the hidden layer (not including the constant node, which always equals 1.0), the number of coefficients (or “weights”) equals the number of factors, m, plus two. Therefore, the total number of weights is

w = l (m 2) 2.

The additional two weights derive from multiplying the constant node in the final prediction node and the (optional) overall scale factor, which can help in achieving realistic weights. Several rules of thumb for selecting w and l exist and are discussed in Chambers (2000). If w equals or exceeds the number of data point’s n, then provably the sum of squares error is zero and the neural net passes through all the points as shown in Figure below.

The simple heuristic method for selecting the number of nodes described in Ribardo (2000) will be adopted to address this over-fitting issue in the response surface context.

This approach involves choosing the number of nodes so that the number of weights approximately equals the number of terms in a quadratic Taylor series or, equivalently, a response surface model. The number of such terms is (m 1) (m 2)/2. In the case study, m = 5 and the number of terms in the RSM polynomial is 21. 

This suggests using l = 3 nodes in the hidden layer so that the number of weights is 3 × (5 2) 2 = 23 weights.