Data Mining & Data Warehousing

Categorical Variables

Introduction:

A categorical variable is a generalization of the binary variable in that it can take on more than two states. For example, map color is a categorical variable that may have, say, five states: red, yellow, green, pink, and blue.

Let the number of states of a categorical variable be M. The states can be denoted by letters, symbols, or a set of integers, such as 1, 2, …. , M. Notice that such integers are used just for data handling and do not represent any specific ordering.

The dissimilarity between two objects i and j can be computed based on the ratio of mismatches:

d(i, j) = (p-m)/p

where m is the number of matches (i.e., the number of variables for which i and j are in the same state), and p is the total number of variables. Weights can be assigned to increase the effect of m or to assign greater weight to the matches in variables having a larger number of states.

Example: Dissimilarity between categorical variables. Suppose that we have the sample data of Table 7.3, except that only the object-identifier and the variable (or attribute) test-1 are available, where test-1 is categorical. (We will use test-2 and test-3 in later examples.) Let’s compute the dissimilarity matrix (7.2), that is,

Since here we have one categorical variable, test-1, we set p = 1 in Equation (7.12) so that d(i, j) evaluates to 0 if objects i and j match, and 1 if the objects differ. Thus, we get

Categorical variables can be encoded by asymmetric binary variables by creating a new binary variable for each of the M states. For an object with a given state value, the binary variable representing that state is set to 1, while the remaining binary variables are set to 0. For example, to encode the categorical variable map color, a binary variable can be created for each of the five colors listed above. For an object having the color yellow, the yellow variable is set to 1, while the remaining four variables are set to 0.