Download presentation
Presentation is loading. Please wait.
1
Supervised Learning I BME 230
2
Supervised (hypothesis-driven) learning
In clustering only had expression data available In supervised learning, a “response” or “label” or “class” is known (e.g. group of related genes, tissue of origin, drug treatment, concentration, temperature,...)
3
Supervised Learning in gene expression analysis
Use information about known biology Usually predict arrays based on their expression vectors (columns of matrix) Samples of different cell types Healthy patients versus sick Drug treatments Find genes that are informative Use expression of informative genes to predict new samples.
4
Example – Predicting Cancer Prognosis
Somatic chromosomal alterations frequently occur in cancer Some may be neutral while others contribute to pathogenesis Identifying aberrations that recur in tumors
5
Array Comparative Genome Hybridization
6
Chin et al. (2007) 171 Primary Breast Tumors
41 Breast Cancer Cell Lines
7
Cancer Classification Challenge
Given genomic features like CNV (x) Predict clinical outcomes, (y) Classifiction if y is discrete Cancer vs. normal HER2 status Regression if y is continuous Genome instability index (GII)
8
Cluster the CNV Data Patients Discrete Resposes, y: Grade Continous
GII Predictors, x: CNV
9
Classifying leukemia (Golub et al 1999)
Target cancer w/ the right drug Usually classify by morphology Acute leukemia known to have variable clinical outcome Nuclear morphologies different 1960’s: enzyme assays (myeloperoxidase +) 1970’s: antibodies to lymphoid vs myeloid 1995: specific chrom translocations Golub et al (1999)
10
Classifying leukemia (Golub et al 1999)
38 bone marrow samples at time of diagnosis 27 patients w/ acute lymphoblastic (ALL) 11 patients w/ acute myeloblastic (AML) Class discovery – can clustering “find” distinction between the leukemias? Golub et al (1999)
11
Leukemia: Neighborhood analysis
Golub et al (1999)
12
Classifying leukemia (Golub et al 1999)
14
Supervised Learning Train a model with x1,...,xn examples of labelled data. Labels are y1,...,yn. Find function h(x) y. So can predict the class y on new observations.
15
Copy Number Variation Data
patient genomic samples y Normal Normal Normal Cancer Cancer sample1 sample2 sample3 sample4 sample5 … Genes x 3 Fewer copies of gene i in sample j
16
Tumor Classification Genomic Data
Three main types of statistical problems associated with the high-throughput genomic data: Identification of “marker” genes that characterize the different tumor classes (feature or variable selection). Identification of new/unknown tumor classes using gene expression profiles (unsupervised learning – clustering) Classification of sample into known classes (supervised learning – classification) 4 relevant to other types of classification problems, not just tumors
17
Classification sample1 sample2 sample3 sample4 sample5 … New sample
Y Normal Normal Normal Cancer Cancer unknown =Y_new sample1 sample2 sample3 sample4 sample5 … New sample X X_new Each object (e.g. arrays or columns)associated with a class label (or response) Y {1, 2, …, K} and a feature vector (vector of predictor variables) of G measurements: X = (X1, …, XG) Aim: predict Y_new from X_new.
18
Supervised Learning Neighbor-based methods Discriminating hyperplanes
k-nearest neighbors (KNN) Parzen windows & kernel density estimation Discriminating hyperplanes Linear discriminant (LDA/QDA/PDA) Support Vector Machines (SVM) Neural nets and perceptrons (ANNs) Decision trees (CART) Aggregating Classifiers
19
Neighbor based methods (guilt by association)
The function of a gene should be similar to the functions of its neighbors Neighbors are found in the predictor space X The neighbors vote for the function of the gene
20
Nearest Neighbor Classification
Based on a measure of distance between observations (e.g. Euclidean distance or one minus correlation). k-nearest neighbor rule (Fix and Hodges (1951)) classifies an observation X as follows: find the k closest observations in the training data, predict the class by majority vote, i.e. choose the class that is most common among those k neighbors. k is a parameter, the value of which will be determined by minimizing the cross-validation error later. E. Fix and J. Hodges. Discriminatory analysis. Nonparametric discrimination: Consistency properties. Tech. Report 4, USAF School of Aviation Medicine, Randolph Field, Texas, 1951.
21
Neighbor-based methods
Does it encode a protein involved in degradation? ? degradation expression vector a “new” unclassified gene: 1 genes
22
Neighbor-based methods
1. Find its closest neighors degradation similarity ? 1 1 0.8 0.7 0.65 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar genes least similar
23
Neighbor-based methods
2. Let closest neighbors vote on function degradation similarity ? 1 1 0.8 0.7 0.65 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar genes least similar count # of 1’s vs. count # of 0’s
24
k-nearest neighbors The k closest neighbors get to vote (no matter how far away they are) k = 3 function of the gene is
25
k-nearest neighbors k-nearest neighbors w/ k=5 degradation similarity
? 4/5 say degradation 1 1 0.8 0.7 0.65 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar genes least similar
26
Parzen windows Neighbors within distance d get to vote (no matter how many there are) distance = d d function of the gene is
27
Parzen windows Parzen windows with similarity > 0.1 degradation
? 1 1 0.8 0.7 0.65 0.6 0.54 0.3 0.1 0.05 0.03 0.01 most similar genes least similar 6/9 say degradation
28
KNN for missing data imputation
Microarrays have *tons* of missing data. Some methods won’t work with NA’s... What can you do? Troyanskya et al. 2002
29
Use k-nearest neighbors for missing data
Troyanskya et al. 2002
30
Hyperplane discriminants
Search for a single partition that places all positives on one side and all negatives on another side
31
Hyperplane discriminants
Y is discrete E.g. non-cancer Y E.g. cancer E.g. PCNA expression X
32
Hyperplane discriminants
Y is discrete “Decision Boundary” Discriminant Line Separating Hyperplane training misclassified E.g. non-cancer Y E.g. cancer E.g. PCNA expression X
33
Hyperplane discriminants
Y is discrete “Decision Boundary” Discriminant Line Separating hyperplane E.g. non-cancer training misclassified Y E.g. cancer E.g. PCNA expression X
34
Hyperplane discriminants
Y is discrete Test Data E.g. non-cancer testing misclassified Y E.g. cancer E.g. PCNA expression X
35
Separating hyperplane
Hyperplanes Y is continuous “Decision Boundary” Discriminant Line Separating hyperplane Y E.g. survival E.g. PCNA expression X
36
Hyperplanes X –a set of predictors Y X1 X2 hyperplane E.g. survival
E.g. PCNA X2 E.g. Cell-surface marker
37
Classify new cases with selected features
f(Xi) y f(Xi) = selected features vote f(Xi) = ∑jwjXij=Xi β We want a β that minimizes error. On training data the least squares error is: ∑i (f(Xi)-Yi)2 = (Xβ-y)T(Xβ-y) β* = argminβ {Xβ-y)T(Xβ-y)} β* = (XTX)−1XT
38
Fisher Linear Discriminant Analysis
In a two-class classification problem, given n samples in a d-dimensional feature space. n1 in class 1 and n2 in class 2. Goal: to find a vector w, and project the n samples on the axis y=w’x, so that the projected samples are well separated. w1: Poor Seperation w2: Good Seperation
39
Fisher Linear Discriminant Analysis
The sample mean vector for the ith class is mi and the sample covariance matrix for the ith class is Si. The between-class scatter matrix is: SB=(m1-m2)(m1-m2)’ The within-class scatter matrix is: Sw= S1+S2 The sample mean of the projected points in the ith class is: The variance of the projected points in the ith class is:
40
Fisher Linear Discriminant Analysis
The fisher linear discriminant analysis will choose the w, which maximize: i.e. the between-class distance should be as large as possible, meanwhile the within-class scatter should be as small as possible.
41
Maximum Likelihood Discriminant Rule
A maximum likelihood classifier (ML) chooses the class that makes the chance of the observations the highest the maximum likelihood (ML) discriminant rule predicts the class of an observation X by that which gives the largest likelihood to X, i.e., by
42
Gaussian ML Discriminant Rules
Assume the conditional densities for each class is a multivariate Gaussian (normal), P(X|Y= k) ~ N(k, k), Then ML discriminant rule is h(X) = argmink {(X - k) k-1 (X - k)’ + log| k |} In general, this is a quadratic rule (Quadratic discriminant analysis, or QDA in R) In practice, population mean vectors k and covariance matrices k are estimated from the training set.
43
Gaussian ML Discriminant Rules
When all class densities have the same covariance matrix, k = the discriminant rule is linear (Linear discriminant analysis, or LDA; FLDA for k = 2): h(x) = argmink (x - k) -1 (x - k)’ In practice, population mean vectors k and constant covariance matrices are estimated from learning set L.
44
Gaussian ML Discriminant Rules
When the class densities have diagonal covariance matrices, , the discriminant rule is given by additive quadratic contributions from each variable (Diagonal quadratic discriminant analysis, or DQDA) When all class densities have the same diagonal covariance matrix =diag(12… G2), the discriminant rule is again linear (Diagonal linear discriminant analysis, or DLDA in R)
45
Application of ML discriminant Rule
Weighted gene voting method. (Golub et al. 1999) One of the first application of a ML discriminant rule to gene expression data. This methods turns out to be a minor variant of the sample Diagonal Linear Discriminant rule. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP,Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. (1999).Molecular classification of cancer: class discovery and class prediction bygene expression monitoring. Science. Oct 15;286(5439):
46
Support Vector Machines
Find the seperating hyperplane with maximum margin In practice, classes can overlap, so error measures how far on the wrong side a point may be Hastie, Tibshirani, Friedman (2001)
47
Support Vector Machines
To find hyperplane, maximize the Lagrangian: Which gives us our solutions: Any α>0 is part of the support
48
Kernels in SVMs Dot product is distance between i and j
Can be replaced with any appropriate distance measure The distance measure is called a kernel Changing the distance measure effectively changes the space where we search for the hyperplane Hastie et al. (2001)
49
SVMs
50
SVMs – 4th degree polynomial kernel
Hastie et al.(2001)
51
SVMs in microarray analysis
First application tried to predict gene categories from Eisen yeast compendium Train to distinguish between: ribosome vs not TCA vs not respiration vs not proteasome vs not histones vs not helix-turn-helix versus not Brown et al. (2000)
52
Predicting TCA w/ SVM Brown et al. (2000)
53
Predicting Ribosomes w/ SVM
Brown et al. (2000)
54
Predicting HTH w/ SVM Brown et al. (2000)
55
“False” predictions FP: Genes newly predicted to be in a functional group that were thought to belong to another, may be coregulated with the new group. FN: Genes that were thought to belong to a functional group may not be coregulated with that group. Inspecting “errors” often leads to the most interesting findings!
56
Looking closely at “false” predictions w/ SVM
RPN1: regulatory particle, interacts with DNA damage rad23 Elsasser S et al 2002 shouldn’t have been in the list Brown et al. (2000)
57
E.g. of “mis-annotated” gene
EGD1’s expression profile closely resembles other ribosomal subunits Brown et al. (2000)
58
New functional annotations
Brown et al. (2000)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.