Data Mining & Data Warehousing

Dimensionality Reduction For Text

Document Classification Analysis: Automated document classification is an important text mining task because, with theexistence of a tremendous number of on-line documents, it is tedious yet essential tobe able to automatically organize such documents into classes to facilitate documentretrieval and subsequent analysis. Document classification has been used in automatedtopic tagging (i.e., assigning labels to documents), topic directory construction, identificationof the document writing styles (which may help narrow down the possible authorsof anonymous documents), and classifying the purposes of hyperlinks associated with aset of documents.

A general procedure is as follows: First, a set of pre-classified documents is taken as the training set. The training set is then analyzed in order to derive a classification scheme. Such a classification scheme often needs to be refined with a testing process. The so-derived classification scheme can be used for classification of other on-line documents.

This process appears similar to the classification of relational data. However, there is a fundamental difference. Relational data are well structured: each tuple is defined by a set of attribute-value pairs. For example, in the tuple {sunny; warm; dry; not windy; play tennis}, the value “sunny” corresponds to the attribute weather outlook, “warm” corresponds to the attribute temperature, and so on. The classification analysis decides which set of attribute-value pairs has the greatest discriminating power in determining whether a person is going to play tennis. On the other hand, document databases are not structured according to attribute-value pairs. That is, a set of keywords associated with a set of documents is not organized into a fixed set of attributes or dimensions. If we view each distinct keyword, term, or feature in the document as a dimension, there may be thousands of dimensions in a set of documents. Therefore, commonly used relational data-oriented classification methods, such as decision tree analysis, may not be effective for the classification of document databases.

According to the vector-space model, two documents are similar if they share similar document vectors. This model motivates the construction of the k-nearest-neighbor classifier, based on the intuition that similar documents are expected to be assigned the same class label. We can simply index all of the training documents, each associated with its corresponding class label. When a test document is submitted, we can treat it as a query to the IR system and retrieve from the training set k documents that are most similar to the query, where k is a tunable constant. The class label of the test document can be determined based on the class label distribution of its k nearest neighbors. Such class label distribution can also be refined, such as based on weighted counts instead of raw counts, or setting aside a portion of labeled documents for validation. By tuning k and incorporating the suggested refinements, this kind of classifier can achieve accuracy comparable with the best classifier. However, since the method needs nontrivial space to store (possibly redundant) training information and additional time for inverted index lookup, it has additional space and time overhead in comparison with other kinds of classifiers.

The vector-space model may assign large weight to rare items disregarding its class distribution characteristics. Such rare items may lead to ineffective classification. Let’s examine an example in the TF-IDF measure computation. Suppose there are two terms t1 and t2 in two classes C1 and C2, each having 100 training documents. Termt1 occurs in five documents in each class (i.e.,5%of the overall corpus), but t2 occurs in 20 documents in class C1 only (i.e., 10% of the overall corpus). Termt1 will have a higher TF-IDF value because it is rarer, but it is obvious t2 has stronger discriminative power in this case.

A feature selection2 process can be used to remove terms in the training documents that are statistically uncorrelated with the class labels. This will reduce the set of terms to be used in classification, thus improving both efficiency and accuracy.

After feature selection, which removes non feature terms, the resulting “cleansed” training documents can be used for effective classification. Bayesian classification is one of several popular techniques that can be used for effective document classification. Since document classification can be viewed as the calculation of the statistical distribution of documents in specific classes, a Bayesian classifier first trains the model by calculating a generative document distribution P(djc) to each class c of document d and then tests which class is most likely to generate the test document. Since both methods handle high-dimensional data sets, they can be used for effective document classification. Other classification methods have also been used in documentation classification. For example, if we represent classes by numbers and construct a direct mapping function from term space to the class variable, support vector machines can be used to perform effective classification since they work well in high-dimensional space. The least-square linear regression method is also used as a method for discriminative classification.

Finally, we introduce association-based classification, which classifies documents based on a set of associated, frequently occurring text patterns. Notice that very frequent terms are likely poor discriminators. Thus only those terms that are not very frequent and that have good discriminative power will be used in document classification. Such an association-based classification method proceeds as follows: First, keywords and terms can be extracted by information retrieval and simple association analysis techniques. Second, concept hierarchies of keywords and terms can be obtained using available term classes, such as Word Net, or relying on expert knowledge, or some keyword classification systems. Documents in the training set can also be classified into class hierarchies. A term association mining method can then be applied to discover sets of associated terms that can be used to maximally distinguish one class of documents from others. This derives a set of association rules associated with each document class. Such classification rules can be ordered based on their discriminative power and occurrence frequency, and used to classify new documents. Such kind of association-based document classifier has been proven effective.