Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.

Similar presentations


Presentation on theme: "Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications."— Presentation transcript:

1 Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications

2 Data Mining and Biological Information Any result in bioinformatics, should answer a biological question. For your results to be useful, they must be interpretable. Data mining -- the process of finding, interpreting, and evaluating patterns in large sets of data – for us, in the context of some bioinformatics applications.

3 Data Mining and Machine Learning Techniques Most Data mining methods are also machine learning techniques in AI. Machine learning programs adapt their behavior with experience. To “learn” is to be trained by data with a set of well defined instructions – machine learning algorithms. Data mining tools are supplements, rather than substitutes, for human knowledge and intuition. The objective of running the learning algorithm on the data is to find some patterns or trends that will aid in understanding the data.

4 Challenges for Knowledge Discovery in Biology -- Russ Altman Bioinformatics is the study of information flow in biology – from genotype to phenotype Sequence, structure, and function analysis Challenges: Computational models of physiology Design of new computer algorithms Engineering new biological pathway Data mining for new science

5 Types of Models Narrative, textual description of a formal system (theory of evolution) Physical (aircraft model, DNA molecule) Analog (similarity or parallelism) Mathematical (spreadsheet, Decision trees, neural networks) Heuristic (machine intelligence and expert systems)

6 Model Classification by Outcome PredictiveClassifier Knowledge Based Expert systems Fuzzy systems Evolutionary programs Neural network Expert systems Genetic algorithm Mathematical Regression analysis Correlation Adaptive learning Cluster analysis Classification and Decision Trees (CART, C5, QUEST) Self-Organizing maps

7 Classification Problem Given dataset D and class label C, find a classifier d such that misclassification rate of d is minimized. Goal – to produce accurate classifier and to understand problem structure Requirements: high accuracy, interpretable, fast construction for very large training data

8 Decision Trees A decision tree T encode d (a classifier) in form of a tree Internal node – binary, k-ary splits Leaf node – labeled with one class label

9 Decision Tree Construction Top-down tree construction schema: Examine training data and find best splitting attribute for the root node Partitioning training data Recurse on each child node

10 Decision Tree Construction (contd.) BuildTree (Node t, Training data D, Split Selection Method S) (1)Apply S to D to find splitting criterion (2)If (t is not a leaf node) (3) create chidren nodes of t (4) partition D into children partitions (5) recurse on each partition (6)Endif Three algorithmic components: Split selection (C5, CART, QUEST, …) Pruning Data access

11 Split Selection Methods Impurity-based split selection: CART, C5 (most common in today’s data mining tools) Model-based split selection: QUEST (Loh and Shih, 1997, freeware, available at www.stat.wisc.edu/~loh, quick, unbiased, efficient, statistical tree)www.stat.wisc.edu/~loh

12 Decision Trees and C5 One of data mining methods commonly reported in the bioinformatics literature. C5 is a software package based on decision tree method by J. R. Quinlan. One major advantage of decision trees over other machine learning techniques is that they produce models (rules) that can be interpreted by humans. To learn more about Rule Induction …Rule Induction

13 CSUS Access to C5 Login to quad Change directory to /opt/C50Release1 Read the “ReadMe” file for example and format requirements You are ready to use C5 An example of C5 application

14 Applications: Getting better by using it BIOKDD 01 – a case study Lawrence Hunter’s Home page Interface01 – a bigger pictureInterface01 ISMB 2001 PSB 2002 O’Reilly Bioinformatics Tech. Conf. 2002O’Reilly Bioinformatics Tech. Conf. 2002

15 Extracting Knowledge from Gene Expression Data: A Case Study of Batten Disease – S. M. Lin Duke University Medical Center proposed a prototype KDD system to enable scientists to analyze the massive microarray data, form hypotheses, and draw insights directly into underlying mechanisms of diseases. Data  Microarray database  data mining  patterns  human experts  Genomics knowledge base  discoveries

16 Data Mining Lab Assignment (150 -- 200 words due 12/7/01) Choose from one of the following and post your work both at WebCT discussion for grading and on your web page: 1.Select a paper from on-line proceeding of BIOKDD 2001 and do a paper review. 2.Do a report on something interested you most from KDD CUP 2001.


Download ppt "Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications."

Similar presentations


Ads by Google