Data Mining in Micro array Analysis

Slides:



Advertisements
Similar presentations
Programming Language Concepts
Advertisements

McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
Introduction to Artificial Intelligence CS440/ECE448 Lecture 21
Functional Genomics and Microarray Analysis (2)
Data Pre-processing Data Cleaning : Sampling:
Classification and Prediction
Bab /44 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 1 Classification With Decision tree Muhamad Arief Hidayat
Data Mining Lecture 9.
Decision Tree Learning - ID3
Decision Trees Decision tree representation ID3 learning algorithm
Bayesian Learning Provides practical learning algorithms
INC 551 Artificial Intelligence Lecture 11 Machine Learning (Continue)
Enhancements to basic decision tree induction, C4.5
Classification Algorithms
Decision Tree Approach in Data Mining
ICS320-Foundations of Adaptive and Learning Systems
Oliver Schulte Machine Learning 726
Classification Techniques: Decision Tree Learning
Kansas State University Department of Computing and Information Sciences Laboratory for Knowledge Discovery in Databases (KDD) KDD Group Research Seminar.
Lecture 5 (Classification with Decision Trees)
Classification.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Data Pre-processing Data cleaning Data integration Data transformation
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Intelligent Data Analysis and Probability Inference Data Mining : Intelligent Data Analysis for Knowledge Discovery Yike Guo Dept. of Computing Imperial.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Privacy Preserving Data Mining Yehuda Lindell Benny Pinkas Presenter: Justin Brickell.
Longin Jan Latecki Temple University
Classification and Prediction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot Readings: Chapter 6 – Han and Kamber.
Classification Techniques: Bayesian Classification
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
CS690L Data Mining: Classification
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Classification Algorithms Decision trees Rule-based induction Neural networks Memory(Case) based reasoning Genetic algorithms Bayesian networks Basic Principle.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Classification And Bayesian Learning
Bayesian Learning Provides practical learning algorithms
Data Mining and Decision Support
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Exploration Seminar 8 Machine Learning Roy McElmurry.
Friday’s Deliverable As a GROUP, you need to bring 2N+1 copies of your “initial submission” –This paper should be a complete version of your paper – something.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Unsupervised Feature Learning Introduction Oliver Schulte School of Computing Science Simon Fraser University.
CMPT 310 Simon Fraser University Oliver Schulte Learning.
Data Mining Functionalities
Semi-Supervised Clustering
DATA MINING © Prentice Hall.
Chapter 6 Classification and Prediction
Data Mining Lecture 11.
Data Mining Concept Description
Prepared by: Mahmoud Rafeek Al-Farra
Supervised vs. unsupervised Learning
Classification and Prediction
Presentation transcript:

Data Mining in Micro array Analysis Classification (Supervised Learning) Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., predict disease based on gene expression profiles Similar to Prediction: Predict some unknown or missing categorical value rather than a numerical values Presentation: decision-tree, classification rule, neural network Cluster analysis (Unsupervised Learning) Class label is unknown: Group data to form new classes, e.g., cluster genes to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity E.g. Group genes based on their gene expression profiles

Supervised vs Unsupervised Learning Classification Unsupervised Clustering unknown number of classes known number of classes based on a training set no prior knowledge used to classify future observations Als dritte Methode werde ich hier etwas über Pattern recognition erzählen. Bei dieser Methode beschäftigt man sich mit Entscheidungsfindungsprozessen. Diese Prozesse will man zuerst verstehen um sie dann mithilfe von Computern zu automatisieren. Die Methode des Pattern recognitions lässt sich in die 2 Klassen supervised und unsupervised unterteilen. In der Kategorie der supervised pattern recognition geht man von einer bekannten Anzahl Klassen aus. Bei der unsupervised PR ist die Anzahl Klassen unbekannt. Supervised PR basiert auf einem sogenannten training set. Dies ist eine Reihe von Beobachtungen, bei denen man die Einteilung in die Klassen bereits kennt. Aufgrund dieser vorher bekannten Zuordnung werden die eigentlichen Beobachtungen mit unbekannter Klasseneinteilung den Klassen zugeordnet. In der Variante der unsupervised PR geht man von keinem a priori Wissen aus. Für die Klassierung von zukünftigen Beobachtungen in vorgegebene Klassen wird supervised PR verwendet. Die Clusteranalyse, wie sie gerade vorgestellt wurde, gilt als eine Form der unsupervised PR. Somit möchte ich nicht weiter auf unsupervised PR eingehen. Im folgenden soll die Form der supervised PR vorgestellt werden. used to understand (explore) data

Supervised vs. Unsupervised Learning * o income debt debt + + + + + + + + + + + + + + + + + + + + + Supervised Learning Unsupervised Learning + debt * o income debt income

Classification Training Set Data with unknown classes Data with known classes Data with unknown classes Class Assignment Classification Technique Classifier

Types of Classifiers * income debt Linear Classifier: Non Linear Classifier: * o income debt debt * * * o o o * * o o * * o * * o * o * o o income a*income + b*debt < t => No loan !

Predictive Modelling: Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No Predict categorical class labels Classify data (construct a model) based on the training set and the values (class labels) in a classifying attribute and Use it in classifying new data

Classification Learning : Induce classifiers from training data Task: determine which of a fixed set of classes an example belongs to Input: training set of examples annotated with class values. Output:induced hypotheses (model/concept description/classifiers) Learning : Induce classifiers from training data Inductive Learning System Training Data: Classifiers (Derived Hypotheses) Predication : Using Hypothesis for Prediction: classifying any example described in the same manner Classifier Decision on class assignment Data to be classified

Decision Tree: Example Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal No Strong Weak

Classification: Relevant Gene Identification Goal: Identify subset of genes that distinguish between treatments, tissues, etc. Method Collect several samples grouped by treatments (e.g. Diseased vs. Healthy) Use genes as “features” Build a classifier to distinguish treatments

Gene Expression Example ID G1 G2 G3 G4 Cancer 1 11.12 1.34 1.97 11.0 No 2 12.34 2.01 1.22 11.1 No 3 13.11 1.34 1.34 2.0 Yes 4 13.34 11.11 1.38 2.23 Yes 5 14.11 13.10 1.06 2.44 Yes 6 11.34 14.21 1.07 1.23 No 7 21.01 12.32 1.97 1.34 Yes 8 66.11 33.3 1.97 1.34 Yes 9 33.11 44.1 1.96 11.23 Yes 10 11.54 11.1 1.97 10.01 Yes 11 12.00 15.1 1.98 9.01 Yes 12 15.23 1.11 1.89 12.48 No 13 31.22 2.0 1.99 13.51 Yes 14 11.33 11.1 1.01 11.01 No 15 ….. … .. .. .. G1 >22 G3 G4 <=12 >12 No Yes <=52 >52 <=22 Problem: With large number of genes (~10000) Need to use feature selection/reduction techniques