Data Mining in Micro array Analysis Classification (Supervised Learning) Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., predict disease based on gene expression profiles Similar to Prediction: Predict some unknown or missing categorical value rather than a numerical values Presentation: decision-tree, classification rule, neural network Cluster analysis (Unsupervised Learning) Class label is unknown: Group data to form new classes, e.g., cluster genes to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity E.g. Group genes based on their gene expression profiles
Supervised vs Unsupervised Learning Classification Unsupervised Clustering unknown number of classes known number of classes based on a training set no prior knowledge used to classify future observations Als dritte Methode werde ich hier etwas über Pattern recognition erzählen. Bei dieser Methode beschäftigt man sich mit Entscheidungsfindungsprozessen. Diese Prozesse will man zuerst verstehen um sie dann mithilfe von Computern zu automatisieren. Die Methode des Pattern recognitions lässt sich in die 2 Klassen supervised und unsupervised unterteilen. In der Kategorie der supervised pattern recognition geht man von einer bekannten Anzahl Klassen aus. Bei der unsupervised PR ist die Anzahl Klassen unbekannt. Supervised PR basiert auf einem sogenannten training set. Dies ist eine Reihe von Beobachtungen, bei denen man die Einteilung in die Klassen bereits kennt. Aufgrund dieser vorher bekannten Zuordnung werden die eigentlichen Beobachtungen mit unbekannter Klasseneinteilung den Klassen zugeordnet. In der Variante der unsupervised PR geht man von keinem a priori Wissen aus. Für die Klassierung von zukünftigen Beobachtungen in vorgegebene Klassen wird supervised PR verwendet. Die Clusteranalyse, wie sie gerade vorgestellt wurde, gilt als eine Form der unsupervised PR. Somit möchte ich nicht weiter auf unsupervised PR eingehen. Im folgenden soll die Form der supervised PR vorgestellt werden. used to understand (explore) data
Supervised vs. Unsupervised Learning * o income debt debt + + + + + + + + + + + + + + + + + + + + + Supervised Learning Unsupervised Learning + debt * o income debt income
Classification Training Set Data with unknown classes Data with known classes Data with unknown classes Class Assignment Classification Technique Classifier
Types of Classifiers * income debt Linear Classifier: Non Linear Classifier: * o income debt debt * * * o o o * * o o * * o * * o * o * o o income a*income + b*debt < t => No loan !
Predictive Modelling: Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No Predict categorical class labels Classify data (construct a model) based on the training set and the values (class labels) in a classifying attribute and Use it in classifying new data
Classification Learning : Induce classifiers from training data Task: determine which of a fixed set of classes an example belongs to Input: training set of examples annotated with class values. Output:induced hypotheses (model/concept description/classifiers) Learning : Induce classifiers from training data Inductive Learning System Training Data: Classifiers (Derived Hypotheses) Predication : Using Hypothesis for Prediction: classifying any example described in the same manner Classifier Decision on class assignment Data to be classified
Decision Tree: Example Day Outlook Temperature Humidity Wind Play Tennis 1 Sunny Hot High Weak No 2 Sunny Hot High Strong No 3 Overcast Hot High Weak Yes 4 Rain Mild High Weak Yes 5 Rain Cool Normal Weak Yes 6 Rain Cool Normal Strong No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes 10 Rain Mild Normal Weak Yes 11 Sunny Mild Normal Strong Yes 12 Overcast Mild High Strong Yes 13 Overcast Hot Normal Weak Yes 14 Rain Mild High Strong No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal No Strong Weak
Classification: Relevant Gene Identification Goal: Identify subset of genes that distinguish between treatments, tissues, etc. Method Collect several samples grouped by treatments (e.g. Diseased vs. Healthy) Use genes as “features” Build a classifier to distinguish treatments
Gene Expression Example ID G1 G2 G3 G4 Cancer 1 11.12 1.34 1.97 11.0 No 2 12.34 2.01 1.22 11.1 No 3 13.11 1.34 1.34 2.0 Yes 4 13.34 11.11 1.38 2.23 Yes 5 14.11 13.10 1.06 2.44 Yes 6 11.34 14.21 1.07 1.23 No 7 21.01 12.32 1.97 1.34 Yes 8 66.11 33.3 1.97 1.34 Yes 9 33.11 44.1 1.96 11.23 Yes 10 11.54 11.1 1.97 10.01 Yes 11 12.00 15.1 1.98 9.01 Yes 12 15.23 1.11 1.89 12.48 No 13 31.22 2.0 1.99 13.51 Yes 14 11.33 11.1 1.01 11.01 No 15 ….. … .. .. .. G1 >22 G3 G4 <=12 >12 No Yes <=52 >52 <=22 Problem: With large number of genes (~10000) Need to use feature selection/reduction techniques