Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Data Mining Techniques: Classification. Classification What is Classification? –Classifying tuples in a database –In training set E each tuple consists.
Decision Tree Approach in Data Mining
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Induction and Decision Trees. Artificial Intelligence The design and development of computer systems that exhibit intelligent behavior. What is intelligence?
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Data warehouse example
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Spatial and Temporal Data Mining V. Megalooikonomou Introduction to Decision Trees ( based on notes by Jiawei Han and Micheline Kamber and on notes by.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Lecture 5 (Classification with Decision Trees)
Three kinds of learning
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Data Mining Chun-Hung Chou
Data Mining: Classification
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Basic Data Mining Technique
Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.
Vision + Focus + Execution Meiliu Lu, RVR 5016, For CSc 209 Spring 2003, 5/6/03.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
1 Machine Learning 1.Where does machine learning fit in computer science? 2.What is machine learning? 3.Where can machine learning be applied? 4.Should.
1 Improving quality of graduate students by data mining Asst. Prof. Kitsana Waiyamai, Ph.D. Dept. of Computer Engineering Faculty of Engineering, Kasetsart.
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
CS690L Data Mining: Classification
1 STAT 5814 Statistical Data Mining. 2 Use of SAS Data Mining.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Chapter 4 Decision Support System & Artificial Intelligence.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College Bio Informatics January
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Lecture Notes for Chapter 4 Introduction to Data Mining
Classification using Decision Trees 1.Data Mining and Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Data Mining and Decision Support
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Chapter 2 Data, Text, and Web Mining. Data Mining Concepts and Applications  Data mining (DM) A process that uses statistical, mathematical, artificial.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
FNA/Spring CENG 562 – Machine Learning. FNA/Spring Contact information Instructor: Dr. Ferda N. Alpaslan
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
10. Decision Trees and Markov Chains for Gene Finding.
BNFO 615 Fall 2016 Usman Roshan NJIT. Outline Machine learning for bioinformatics – Basic machine learning algorithms – Applications to bioinformatics.
SNS COLLEGE OF TECHNOLOGY
Classification with Gene Expression Data
Vision + Focus + Execution
Machine Learning overview Chapter 18, 21
DATA MINING © Prentice Hall.
KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA
Prepared by: Mahmoud Rafeek Al-Farra
RESEARCH APPROACH.
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Research Areas Christoph F. Eick
I don’t need a title slide for a lecture
Predicting Student Performance: An Application of Data Mining Methods with an Educational Web-based System FIE 2003, Boulder, Nov 2003 Behrouz Minaei-Bidgoli,
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Supporting End-User Access
©Jiawei Han and Micheline Kamber
Welcome! Knowledge Discovery and Data Mining
Presentation transcript:

Data Mining and Decision Trees 1.Data Mining and Biological Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications

Data Mining and Biological Information Any result in bioinformatics, should answer a biological question. For your results to be useful, they must be interpretable. Data mining -- the process of finding, interpreting, and evaluating patterns in large sets of data – for us, in the context of some bioinformatics applications.

Data Mining and Machine Learning Techniques Most Data mining methods are also machine learning techniques in AI. Machine learning programs adapt their behavior with experience. To “learn” is to be trained by data with a set of well defined instructions – machine learning algorithms. Data mining tools are supplements, rather than substitutes, for human knowledge and intuition. The objective of running the learning algorithm on the data is to find some patterns or trends that will aid in understanding the data.

Challenges for Knowledge Discovery in Biology -- Russ Altman Bioinformatics is the study of information flow in biology – from genotype to phenotype Sequence, structure, and function analysis Challenges: Computational models of physiology Design of new computer algorithms Engineering new biological pathway Data mining for new science

Types of Models Narrative, textual description of a formal system (theory of evolution) Physical (aircraft model, DNA molecule) Analog (similarity or parallelism) Mathematical (spreadsheet, Decision trees, neural networks) Heuristic (machine intelligence and expert systems)

Model Classification by Outcome PredictiveClassifier Knowledge Based Expert systems Fuzzy systems Evolutionary programs Neural network Expert systems Genetic algorithm Mathematical Regression analysis Correlation Adaptive learning Cluster analysis Classification and Decision Trees (CART, C5, QUEST) Self-Organizing maps

Classification Problem Given dataset D and class label C, find a classifier d such that misclassification rate of d is minimized. Goal – to produce accurate classifier and to understand problem structure Requirements: high accuracy, interpretable, fast construction for very large training data

Decision Trees A decision tree T encode d (a classifier) in form of a tree Internal node – binary, k-ary splits Leaf node – labeled with one class label

Decision Tree Construction Top-down tree construction schema: Examine training data and find best splitting attribute for the root node Partitioning training data Recurse on each child node

Decision Tree Construction (contd.) BuildTree (Node t, Training data D, Split Selection Method S) (1)Apply S to D to find splitting criterion (2)If (t is not a leaf node) (3) create chidren nodes of t (4) partition D into children partitions (5) recurse on each partition (6)Endif Three algorithmic components: Split selection (C5, CART, QUEST, …) Pruning Data access

Split Selection Methods Impurity-based split selection: CART, C5 (most common in today’s data mining tools) Model-based split selection: QUEST (Loh and Shih, 1997, freeware, available at quick, unbiased, efficient, statistical tree)

Decision Trees and C5 One of data mining methods commonly reported in the bioinformatics literature. C5 is a software package based on decision tree method by J. R. Quinlan. One major advantage of decision trees over other machine learning techniques is that they produce models (rules) that can be interpreted by humans. To learn more about Rule Induction …Rule Induction

CSUS Access to C5 Login to quad Change directory to /opt/C50Release1 Read the “ReadMe” file for example and format requirements You are ready to use C5 An example of C5 application

Applications: Getting better by using it BIOKDD 01 – a case study Lawrence Hunter’s Home page Interface01 – a bigger pictureInterface01 ISMB 2001 PSB 2002 O’Reilly Bioinformatics Tech. Conf. 2002O’Reilly Bioinformatics Tech. Conf. 2002

Extracting Knowledge from Gene Expression Data: A Case Study of Batten Disease – S. M. Lin Duke University Medical Center proposed a prototype KDD system to enable scientists to analyze the massive microarray data, form hypotheses, and draw insights directly into underlying mechanisms of diseases. Data  Microarray database  data mining  patterns  human experts  Genomics knowledge base  discoveries

Data Mining Lab Assignment ( words due 12/7/01) Choose from one of the following and post your work both at WebCT discussion for grading and on your web page: 1.Select a paper from on-line proceeding of BIOKDD 2001 and do a paper review. 2.Do a report on something interested you most from KDD CUP 2001.