Supervised Learning I BME 230.

Slides:



Advertisements
Similar presentations
COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
Support Vector Machines H. Clara Pong Julie Horrocks 1, Marianne Van den Heuvel 2,Francis Tekpetey 3, B. Anne Croy 4. 1 Mathematics & Statistics, University.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Discrimination or Class prediction or Supervised Learning.
Part II: Discriminative Margin Clustering Joint work with: Rob Tibshirani, Dept of Statistics Patrick O. Brown, School of Medicine Stanford University.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Discrimination Class web site: Statistics for Microarrays.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Discrimination Methods As Used In Gene Array Analysis.
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Sp’10Bafna/Ideker Classification (SVMs / Kernel method)
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
Classification (Supervised Clustering) Naomi Altman Nov '06.
Xuelian Wei Department of Statistics Most of Slides Adapted from by Darlene Goldstein Classification.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site:
SLIDES RECYCLED FROM ppt slides by Darlene Goldstein Supervised Learning, Classification, Discrimination.
The Broad Institute of MIT and Harvard Classification / Prediction.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Lecture 4 Linear machine
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Classification of tissues and samples 指導老師:藍清隆 演講者:張許恩、王人禾.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Topics in analysis of microarray data : clustering and discrimination
Support Vector Machines
PREDICT 422: Practical Machine Learning
Data Mining ICCM
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Gene expression.
CH 5: Multivariate Methods
Gene Set Enrichment Analysis
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Molecular Classification of Cancer
Overview of Supervised Learning
Classification Discriminant Analysis
K Nearest Neighbor Classification
Computational Biology Lecture #10: Analyzing Gene Expression Data
PCA, Clustering and Classification by Agnieszka S. Juncker
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Support Vector Machines
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Hairong Qi, Gonzalez Family Professor
Linear Discrimination
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
ECE – Pattern Recognition Lecture 10 – Nonparametric Density Estimation – k-nearest-neighbor (kNN) Hairong Qi, Gonzalez Family Professor Electrical.
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Supervised Learning I BME 230

Supervised (hypothesis-driven) learning In clustering only had expression data available In supervised learning, a “response” or “label” or “class” is known (e.g. group of related genes, tissue of origin, drug treatment, concentration, temperature,...)

Supervised Learning in gene expression analysis Use information about known biology Usually predict arrays based on their expression vectors (columns of matrix) Samples of different cell types Healthy patients versus sick Drug treatments Find genes that are informative Use expression of informative genes to predict new samples.

Example – Predicting Cancer Prognosis Somatic chromosomal alterations frequently occur in cancer Some may be neutral while others contribute to pathogenesis Identifying aberrations that recur in tumors

Array Comparative Genome Hybridization

Chin et al. (2007) 171 Primary Breast Tumors 41 Breast Cancer Cell Lines

Cancer Classification Challenge Given genomic features like CNV (x) Predict clinical outcomes, (y) Classifiction if y is discrete Cancer vs. normal HER2 status Regression if y is continuous Genome instability index (GII)

Cluster the CNV Data Patients Discrete Resposes, y: Grade Continous GII Predictors, x: CNV

Classifying leukemia (Golub et al 1999) Target cancer w/ the right drug Usually classify by morphology Acute leukemia known to have variable clinical outcome Nuclear morphologies different 1960’s: enzyme assays (myeloperoxidase +) 1970’s: antibodies to lymphoid vs myeloid 1995: specific chrom translocations Golub et al (1999)

Classifying leukemia (Golub et al 1999) 38 bone marrow samples at time of diagnosis 27 patients w/ acute lymphoblastic (ALL) 11 patients w/ acute myeloblastic (AML) Class discovery – can clustering “find” distinction between the leukemias? Golub et al (1999)

Leukemia: Neighborhood analysis Golub et al (1999)

Classifying leukemia (Golub et al 1999)

Supervised Learning Train a model with x1,...,xn examples of labelled data. Labels are y1,...,yn. Find function h(x)  y. So can predict the class y on new observations.

Copy Number Variation Data patient genomic samples y Normal Normal Normal Cancer Cancer sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes x 3 Fewer copies of gene i in sample j

Tumor Classification Genomic Data Three main types of statistical problems associated with the high-throughput genomic data: Identification of “marker” genes that characterize the different tumor classes (feature or variable selection). Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering) Classification of sample into known classes (supervised learning – classification) 4 relevant to other types of classification problems, not just tumors

Classification sample1 sample2 sample3 sample4 sample5 … New sample Y Normal Normal Normal Cancer Cancer unknown =Y_new sample1 sample2 sample3 sample4 sample5 … New sample 1 0.46 0.30 0.80 1.51 0.90 ... 0.34 2 -0.10 0.49 0.24 0.06 0.46 ... 0.43 3 0.15 0.74 0.04 0.10 0.20 ... -0.23 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... -0.91 5 -0.06 1.06 1.35 1.09 -1.09 ... 1.23 X X_new Each object (e.g. arrays or columns)associated with a class label (or response) Y  {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG) Aim: predict Y_new from X_new.

Supervised Learning Neighbor-based methods Discriminating hyperplanes k-nearest neighbors (KNN) Parzen windows & kernel density estimation Discriminating hyperplanes Linear discriminant (LDA/QDA/PDA) Support Vector Machines (SVM) Neural nets and perceptrons (ANNs) Decision trees (CART) Aggregating Classifiers

Neighbor based methods (guilt by association) The function of a gene should be similar to the functions of its neighbors Neighbors are found in the predictor space X The neighbors vote for the function of the gene

Nearest Neighbor Classification Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation). k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: find the k closest observations in the training data, predict the class by majority vote, i.e. choose the class that is most common among those k neighbors. k is a parameter, the value of which will be determined by minimizing the cross-validation error later. E. Fix and J. Hodges. Discriminatory analysis. Nonparametric discrimination: Consistency properties. Tech. Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.

Neighbor-based methods Does it encode a protein involved in degradation? ? degradation expression vector a “new” unclassified gene: 1 genes

Neighbor-based methods 1. Find its closest neighors degradation similarity ? 1 1 0.8 0.7 0.65 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar genes least similar

Neighbor-based methods 2. Let closest neighbors vote on function degradation similarity ? 1 1 0.8 0.7 0.65 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar genes least similar count # of 1’s vs. count # of 0’s

k-nearest neighbors The k closest neighbors get to vote (no matter how far away they are) k = 3 function of the gene is

k-nearest neighbors k-nearest neighbors w/ k=5 degradation similarity ? 4/5 say degradation 1 1 0.8 0.7 0.65 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar genes least similar

Parzen windows Neighbors within distance d get to vote (no matter how many there are) distance = d d function of the gene is

Parzen windows Parzen windows with similarity > 0.1 degradation ? 1 1 0.8 0.7 0.65 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar genes least similar 6/9 say degradation

KNN for missing data imputation Microarrays have *tons* of missing data. Some methods won’t work with NA’s... What can you do? Troyanskya et al. 2002

Use k-nearest neighbors for missing data Troyanskya et al. 2002

Hyperplane discriminants Search for a single partition that places all positives on one side and all negatives on another side

Hyperplane discriminants Y is discrete E.g. non-cancer Y E.g. cancer E.g. PCNA expression X

Hyperplane discriminants Y is discrete “Decision Boundary” Discriminant Line Separating Hyperplane training misclassified E.g. non-cancer Y E.g. cancer E.g. PCNA expression X

Hyperplane discriminants Y is discrete “Decision Boundary” Discriminant Line Separating hyperplane E.g. non-cancer training misclassified Y E.g. cancer E.g. PCNA expression X

Hyperplane discriminants Y is discrete Test Data E.g. non-cancer testing misclassified Y E.g. cancer E.g. PCNA expression X

Separating hyperplane Hyperplanes Y is continuous “Decision Boundary” Discriminant Line Separating hyperplane Y E.g. survival E.g. PCNA expression X

Hyperplanes X –a set of predictors Y X1 X2 hyperplane E.g. survival E.g. PCNA X2 E.g. Cell-surface marker

Classify new cases with selected features f(Xi)  y f(Xi) = selected features vote f(Xi) = ∑jwjXij=Xi β We want a β that minimizes error. On training data the least squares error is: ∑i (f(Xi)-Yi)2 = (Xβ-y)T(Xβ-y) β* = argminβ {Xβ-y)T(Xβ-y)} β* = (XTX)−1XT

Fisher Linear Discriminant Analysis In a two-class classification problem, given n samples in a d-dimensional feature space. n1 in class 1 and n2 in class 2. Goal: to find a vector w, and project the n samples on the axis y=w’x, so that the projected samples are well separated. w1: Poor Seperation w2: Good Seperation

Fisher Linear Discriminant Analysis The sample mean vector for the ith class is mi and the sample covariance matrix for the ith class is Si. The between-class scatter matrix is: SB=(m1-m2)(m1-m2)’ The within-class scatter matrix is: Sw= S1+S2 The sample mean of the projected points in the ith class is: The variance of the projected points in the ith class is:

Fisher Linear Discriminant Analysis The fisher linear discriminant analysis will choose the w, which maximize: i.e. the between-class distance should be as large as possible, meanwhile the within-class scatter should be as small as possible.

Maximum Likelihood Discriminant Rule A maximum likelihood classifier (ML) chooses the class that makes the chance of the observations the highest the maximum likelihood (ML) discriminant rule predicts the class of an observation X by that which gives the largest likelihood to X, i.e., by

Gaussian ML Discriminant Rules Assume the conditional densities for each class is a multivariate Gaussian (normal), P(X|Y= k) ~ N(k, k), Then ML discriminant rule is h(X) = argmink {(X - k) k-1 (X - k)’ + log| k |} In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA in R) In practice, population mean vectors k and covariance matrices k are estimated from the training set.

Gaussian ML Discriminant Rules When all class densities have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2): h(x) = argmink (x - k) -1 (x - k)’ In practice, population mean vectors k and constant covariance matrices  are estimated from learning set L.

Gaussian ML Discriminant Rules When the class densities have diagonal covariance matrices, , the discriminant rule is given by additive quadratic contributions from each variable (Diagonal quadratic discriminant analysis, or DQDA) When all class densities have the same diagonal covariance matrix =diag(12… G2), the discriminant rule is again linear (Diagonal linear discriminant analysis, or DLDA in R)

Application of ML discriminant Rule Weighted gene voting method. (Golub et al. 1999) One of the first application of a ML discriminant rule to gene expression data. This methods turns out to be a minor variant of the sample Diagonal Linear Discriminant rule. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP,Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. (1999).Molecular classification of cancer: class discovery and class prediction bygene expression monitoring. Science. Oct 15;286(5439):531 - 537.

Support Vector Machines Find the seperating hyperplane with maximum margin In practice, classes can overlap, so error measures how far on the wrong side a point may be Hastie, Tibshirani, Friedman (2001)

Support Vector Machines To find hyperplane, maximize the Lagrangian: Which gives us our solutions: Any α>0 is part of the support

Kernels in SVMs Dot product is distance between i and j Can be replaced with any appropriate distance measure The distance measure is called a kernel Changing the distance measure effectively changes the space where we search for the hyperplane Hastie et al. (2001)

SVMs

SVMs – 4th degree polynomial kernel Hastie et al.(2001)

SVMs in microarray analysis First application tried to predict gene categories from Eisen yeast compendium Train to distinguish between: ribosome vs not TCA vs not respiration vs not proteasome vs not histones vs not helix-turn-helix versus not Brown et al. (2000)

Predicting TCA w/ SVM Brown et al. (2000)

Predicting Ribosomes w/ SVM Brown et al. (2000)

Predicting HTH w/ SVM Brown et al. (2000)

“False” predictions FP: Genes newly predicted to be in a functional group that were thought to belong to another, may be coregulated with the new group. FN: Genes that were thought to belong to a functional group may not be coregulated with that group. Inspecting “errors” often leads to the most interesting findings!

Looking closely at “false” predictions w/ SVM RPN1: regulatory particle, interacts with DNA damage rad23 Elsasser S et al 2002 shouldn’t have been in the list Brown et al. (2000)

E.g. of “mis-annotated” gene EGD1’s expression profile closely resembles other ribosomal subunits Brown et al. (2000)

New functional annotations Brown et al. (2000)