Variance Inflation Factors And Correlation Matrices
3 min read
Variance Inflation Factors and Correlation Matrices:
- The least squares estimation formula reveals that coefficient estimates can be written as βest = Ay where A = (X′X)-1X′ and A is the “alias” matrix.
- The alias matrix is a function of the model form fitted and the input factor settings in the data. If the combination is poor, then if any random error, ε_{i}, influencing a response in y occurs, the result will be inflated and greatly change the coefficients.
- The term “input data quality” refers to the ability of the input pattern to support accurate fitting a modeling of interest. Define the following in relation to quantitative evalution of input data quality:
1) Ds are the input pattern in the flat file.
2) H and L are the highs and lows respectively of the numbers in each column of the input data, Ds.
3) D is the input data in coded units that range between –1 and 1.
4) X is the design matrix.
5) Xs is the scaled design matrix.
6) nis the number of data points or rows in the flat file.
7) mis the number of factors in the regression model being fitted.
8) k is the number of terms in the regression model being fitted.
- The following procedure is useful for quantitative evaluation of the extent to which errors are inflated and coefficient estimates are unstable.
- Note that, in Step 4 the finding of a VIF greater than 10 or a ri,j greater than 0.5 does not imply that the model form does not describe nature. Rather, the conclusions would be that the model form cannot be fitted accurately because of the limitations of the pattern of the input data settings, i.e., the input data quality.
- More and better quality data would be needed to fit that model. Note also that most statistical software packages do not include the optional Step 1 in their automatic calculations. Therefore, they only perform a single scaling. Therefore also, the interpretation of their output in Step 4 is less credible. In general, the assessment of input data quality is an active area of research, and the above procedure can sometimes prove misleading. In some cases, the procedure might signal that the input data quality is poor while the model has acceptable accuracy. Also, in some cases the procedure might suggest that the
- input data quality is acceptable, but the model does not predict well and results in poor inference.