Download presentation
Presentation is loading. Please wait.
Published byMarcia Grace Miles Modified over 8 years ago
1
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang
2
2 Introduction How to bridge data mining and bioinformatics for successful data mining of biological data? Three major themes: Data Cleaning, data preprocessing, and semantic integration of heterogeneous, distributed biomedical databases Exploration of existing data mining tools for biodata analysis Development of advanced, effective, and scalable data mining methods in biodata analysis
3
3 Research Topics on Advanced Data Mining Methods for Biodata Analysis Analysis of frequent patterns, sequential patterns and structured patterns: identification of cooccurring or correlated biosequences or biostructure patterns Effective classification and comparison of biodata
4
4 Various kinds of cluster analysis methods Discovering pairwise frequent patterns and clustering biodata based on such frequent patterns Computational modeling of biological networks Identifying the sequence of genetic activities across different stages of disease development Data visualization and visual data mining
5
5 Data Cleaning, Data Preprocessing, and Data Integration Biomedical data are stored in multiple distributed databases. Need automated preprocessing techniques Data cleaning: to ensure data quality (data interpretability) How do the data enter the system? Minimum Information About a microarray Experiment (MIAME) MicroArray and Gene Expression (MAGE)
6
6 Data Cleaning (continued) How are the data delivered? Verifying checksums or relationships between data streams Using reliable transmission protocols Where do the data go after being received? Hardware and software constraints
7
7 Data Cleaning (continued) Are the data combined with other data sets? How are the data retrieved? How are the data analyzed? Computer science models and biomedical models have to come together
8
8 Data Preprocessing Multidisciplinary efforts are needed Process management: supporting standardization of content and format, automation of preprocessing Documentation of biomedical domain expertise: establishing metadata standard (MAGE-ML), creating annotation files, developing text-mining software Statistical and database analyses: including data cleaning, integration, transformation, and reduction
9
9 Semantic Integration of Heterogeneous Data Combining multiple sources into a coherent data store and finding semantically equivalent real-world entities from several biomedical sources to be matched up Semantic integration is still an open problem due to the complexity of bioontology and heterogeneous distributed nature of the recorded high-dimensional data
10
10 Semantic Integration of Heterogeneous Data Two approaches: Construction of integrated biodata warehouses or biodatabases: requires common ontology and terminology and sophisticated data mapping rules Construction of a federation of heterogeneous distributed biodatabases: builds up mapping rules or semantic ambiguity resolution rules across multiple databases
11
11 Exploration of Existing Data Mining Tools for Biodata Analysis DNA and Protein Sequence Analysis Three basic approaches: sequence comparison, similarity search, pattern finding Tools: Pairwise alignment tools: the Basic Local Alignment Search Tool (BLAST) Multiple sequence alignment tools: ClustalW Challenging problems: promoter search, protein functional motif search
12
12 Genome Analysis How is the whole genome put together from many small pieces of sequences? Where are the genes located on a chromosome? Challenging problem: prediction of gene structures Macromolecule Structure Analysis Prediction of secondary structure of RNA and proteins Comparison of protein structures Protein Structure classification Visualization of protein structures Structure prediction is still an unsolved problem
13
13 Pathway Analysis To build, model, and visualize biological processes among gene products Microarray Analysis Algorithms: hierarchical clustering, k- means, self-organizing map, support vector machine, association rules, neural networks Software: GeneSpring, Spotfire
14
14 Discovery of Frequent Sequential and Structured Patterns Most biodata patterns contain a substantial amount of noise or faults Mining Sequential Patterns BLAST: For a protein or DNA sequence S, BLAST will find all similar sequences S’ in the database such that the aggregate mutation score from S to S’ is above some user-specified threshold. Tandem repeat detection: A segment that occurs more than a certain number of times within a DNA sequence
15
15 Mining Structures Patterns Apriori-like candidate generation and test approach: FSG Frequent pattern growth approach: gSpan Mining closed subgraph patterns rather than all subgraph patterns: A subgraph G is closed if there exists no supergraph G’ such that and support(G) = support(G’)
16
16 Classification Methods Normal cells vs. cancer cells Support vector machine (SVM) is considered the most accurate classification tool for many bioinformatics applications Drawback of SVM: complexity of training an SVM is O(N 2 )
17
17 Cluster Analysis Methods Clustering microarray data by biclustering or p-clustering In microarray gene expression dataset, each column represents a condition, whereas each row represents a gene. A bicluster is a subset of genes and conditions such that the subset of genes exhibits similar fluctuations under a given subset of conditions
18
18 Clustering sequential biodata The functionality of a gene depends largely on its layout or the sequential order of amino acids or nucleotides. If two genes or proteins have similar components, their functionality may be similar.
19
19 Computational Modeling of Biological Networks Molecular interactions in a cell can be represented using graphs of network connections. A set of connected molecular interactions can be considered as a pathway. Three subsystems: metabolic network or pathway, protein network, genetic or gene regulatory network
20
20 Data Visualization and Visual Data Mining Three types of visualization tools Generic data visualization tools Knowledge discovery in databases and model visualization tools Interactive visualization environments for integrating data mining and visualization processes
21
21 Emerging Frontiers Text Mining in Bioinformatics To find all the related literature and publications studying the same genes and proteins from different aspects Automated mining of biochemical knowledge from digital repositories of scientific literature Two approaches for recognizing interactions between proteins and other molecules: Based on occurrence statistics of gene names from MEDLINE documents to predict the connections among genes Use specific linguistic structures to extract protein interaction information from MEDLINE documents
22
22 Emerging Frontiers Systems Biology To understand a system’s structure and dynamics Four key properties: System structures: the network of gene interactions and biochemical pathways System dynamics: how a system behaves over time under various conditions The control method: the mechanisms that systematically control the state of the cell The design method: strategies to modify and construct biological systems having desired properties
23
23 Open Research Problems Data Quality Maintenance Visualization difficulties with high-dimensional data File standards, data storage, access, data mining, and information retrieval How to integrate biological knowledge into the designing and developing of data mining models and algorithms Find the rules or regularities that may disclose the mystery of the “dark matter” of a genome
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.