Data Mining For Biological Data Analysis

Ø Alignment, indexing, similarity search, and comparative analysis of multiple nucleotide/ protein sequences: Various biological sequence alignment methods have been developed in the past two decades. BLAST and FASTA, in particular, are tools for the systematic analysis of genomic and proteomic data. Biological sequence analysis methods differ from many sequential pattern analysis algorithms proposed in data mining research. They should allow for gaps and mismatches between a query sequence and the sequence data to be searched in order to deal with insertions, deletions, and mutations. Moreover, for protein sequences, two amino acids should also be considered a “match” if one can be derived from the other by substitutions that are likely to occur in nature. Sophisticated statistical analysis and dynamic programming methods often play a key role in the development of alignment algorithms. Indices can be constructed on such data sets so that precise and similarity searches can be performed efficiently.

There is a combinatorial number of ways to approximately align multiple sequences. Therefore, multiple sequence alignment is considered a more challenging task. Methods that can help include (1) reducing a multiple alignment to a series of pair wise alignments and then combining the result, and (2) using Hidden Markov Models or HMMs (Chapter 8). However, the efficient and systematic alignment of multiple biological sequences remains an active research topic. Multiple sequence alignments can be used to identify highly conserved residues among genomes, and such conserved regions can be used to build phylogenetic trees to infer evolutionary relationships among species. Moreover, it may help disclose the secrets of evolution at the genomic level.

From the point of view of medical sciences, genomic and proteomic sequences isolated from diseased and healthy tissues can be compared to identify critical differences between them. Sequences occurring more frequently in the diseased samples may indicate the genetic factors of the disease. Those occurring more frequently only in the healthy samples may indicate mechanisms that protect the body from the disease. Although genetic analysis requires similarity search, the technique needed here is different from that used for time-series data (Chapter 8). The analysis of time series typically uses data transformation methods such as scaling, normalization, and window stitching, which are ineffective for genetic data because such data are nonnumeric. These methods do not consider the interconnections between nucleotides, which play an important role in biologic function. It is important to further develop efficient sequential pattern analysis methods for comparative analysis of biological sequences.

Ø Discovery of structural patterns and analysis of genetic networks and protein pathways: In biology, protein sequences are folded into three-dimensional structures, and such structures interact with each other based on their relative positions and the distances between them. Such complex interactions form the basis of sophisticated genetic networks and protein pathways. It is crucial to discover structural patterns and regularities among such huge but complex biological networks. To this extent, it is important to develop powerful and scalable data mining methods to discover approximate and frequent structural patterns and to study the regularities and irregularities among such interconnected biological networks.

Ø Association and path analysis: identifying co-occurring gene sequences and linking genes to different stages of disease development: Currently, many studies have focused on the comparison of one gene to another. However, most diseases are not triggered by a single gene but by a combination of genes acting together. Association analysis methods can be used to help determine the kinds of genes that are likely to co-occur in target samples. Such analysis would facilitate the discovery of groups of genes and the study of interactions and relationships between them.

While a group of genes may contribute to a disease process, different genes may become active at different stages of the disease. If the sequence of genetic activities across the different stages of disease development can be identified, it may be possible to develop pharmaceutical interventions that target the different stages separately, therefore achieving more effective treatment of the disease. Such path analysis is expected to play an important role in genetic studies.

Ø Visualization tools in genetic data analysis: Alignments among genomic or proteomic sequences and the interactions among complex biological structures are most effectively presented in graphic forms, transformed into various kinds of easy-to-understand visual displays. Such visually appealing structures and patterns facilitate pattern understanding, knowledge discovery, and interactive data exploration. Visualization and visual data mining therefore play an important role in biological data analysis.

Theoretical Foundations Of Data Mining
←