Data Mining & Data Warehousing

Multi-relational Data Mining

Introduction: Multi-relational data mining (MRDM) methods search for patterns that involve multiple tables (relations) from a relational database. Each table or relation represents an entity or a relationship, described by a set of attributes. Links between relations show the relationship between them. One method to apply traditional data mining methods (which assume that the data reside in a single table) is propositionalization, which converts multiple relational data into a single flat data relation, using joins and aggregations. This, however, could lead to the generation of a huge, undesirable “universal relation” (involving all of the attributes). Furthermore, it can result in the loss of information, including essential semantic information represented by the links in the database design.

Multi-relational data mining aims to discover knowledge directly from relational data. There are different multi-relational data mining tasks, including multi-relational classification, clustering, and frequent pattern mining. Multi-relational classification aims to build a classification model that utilizes information in different relations. Multi-relational clustering aims to group tuples into clusters using their own attributes as well as tuples related to them in different relations. Multi-relational frequent pattern mining aims at finding patterns involving interconnected items in different relations. We first use multi-relational classification as an example to illustrate the purpose and procedure of multi-relational data mining. We then introduce multi-relational classification and multi-relational clustering in detail in the following sections.

In a database for multi-relational classification, there is one target relation, Rt, whose tuples are called target tuples and are associated with class labels. The other relations are non target relations. Each relation may have one primary key (which uniquely identifies tuples in the relation) and several foreign keys (where a primary key in one relation can be linked to the foreign key in another). If we assume a two-class problem, then we pick one class as the positive class and the other as the negative class. The most important task for building an accurate multi-relational classifier is to find relevant features in different relations that help distinguish positive and negative target tuples.