Download presentation
Presentation is loading. Please wait.
Published byGervase Kennedy Modified over 9 years ago
1
Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu
2
Outline Dimension reduction –Principal Component Analysis (PCA) –Other approaches such as MDS, SOM, etc. Unsupervised learning for classificationUnsupervised learning –Clustering and KNN Supervised learning for classificationSupervised learning –CART, SVMCARTSVM Expression and genome resources
3
Dimension Reduction High dimensional data points are difficult to visualize Always good to plot data in 2D –Easier to detect or confirm the relationship among data points –Catch stupid mistakes (e.g. in clustering) Two ways to reduce: –By genes: some experiments are similar or have little information –By experiments: some genes are similar or have little information
4
Principal Component Analysis Optimal linear transformation that chooses a new coordinate system for the data set that maximizes the variance by projecting the data on to new axes in order of the principal components Components are orthogonal (mutually uncorrelated) Few PCs may capture most variation in original data E.g. reduce 2D into 1D data
5
5 Principle Component Analysis (PCA)
6
Example: human SNP marker data
7
PCA for 800 randomly selected SNPs
8
PCA for 400 randomly selected SNPs
9
PCA for 200 randomly selected SNPs
10
PCA for 100 randomly selected SNPs
11
PCA for 50 randomly selected SNPs
12
Interpretations and Insights PCA can discover aggregated subtle effects/differences in high-dimensional data PCA finds linear directions to project data, and is purely unsupervised, so it is indifferent about the “importance” of certain directions, but attempting to find most “variable” directions. There are generalizations for supervised PCA.
13
Principal Component Analysis Achieved by singular value decomposition (SVD): X = UDV T X is the original N p data –E.g. N genes, p experiments V is p p project directions –Orthogonal matrix: U T U = I p –v 1 is direction of the first projection –Linear combination (relative importance) of each experiment or (gene if PCA on samples)
14
14 Quick Linear Algebra Review N p matrix: m row, n col Matrix multiplication: –N k matrix multiplied with k p matrix gets N p matrix Diagonal matrix Identity matrix I p Orthogonal matrix: –r = c, U T U = I p Orthonormal matrix –r >= c, U T U = I p
15
15 Quick Linear Algebra Review Example Orthogonal matrix Transformation Multiplication Identity matrix
16
16 Some basic mathematics Goal of PCA step 1: find a direction that has the “maximum variation” –Let the direction be a (column) unit vector c (p-dim) –A projection of x onto c is –So we want to maximize –Algebra: –So the maximum is achieved as the largest eigen value of the sample covariance matrix where, and subject to ||c||=1
17
A simple illustration
18
PCA and SVD PCA: eigen(X T X), SVD: X=UDV T U is N p, relative projection of points D is p p scaling factor –Diagonal matrix, d 1 d 2 d 3 … d p 0 u i1 d 1 is distance along v 1 from origin (first principal components) –Expression value projected on v 1 –v 2 is 2 nd projection direction, u i2 d 2 is 2 nd principal component, so on Captured variances by the first m principal components
19
PCA N p × p p = N p p Original data Projection dir Projected value scale X 11 V 11 + X 12 V 21 + X 13 V 31 + …= X 11 ’ = U 11 D 11 X 21 V 11 + X 22 V 21 + X 23 V 31 + …= X 21 ’ = U 21 D 11 X 11 V 12 + X 12 V 22 + X 13 V 32 + …= X 12 ’ = U 12 D 22 X 21 V 12 + X 22 V 22 + X 23 V 32 + …= X 22 ’ = U 22 D 22 1 st Principal Component 2 nd Principal Component
20
PCA v1v1 v2v2 v1v1 v2v2 v1v1 v2v2
21
PCA on Genes Example Cell cycle genes, 13 time points, reduced to 2D Genes: 1: G1; 4: S; 2: G2; 3: M
22
PCA Example Variance in data explained by the first n principle components
23
PCA Example The coefficients of the first 8 principle directions This is an example of PCA to reduce samples Can do PCA to reduce the genes as well –Use first 2-3 PC to plot samples, give more weight to the more differentially expressed genes, can often see sample classification v1v2v3v4v1v2v3v4
24
Microarray Classification ?
25
25 Supervised Learning Abound in statistical methods, e.g. regression A special case is the classification problem: Y i is the class label, and X i ’s are the covariates We learn the relationship between X i ’s and Y from training data, and predict Y for future data where only X i ’s are known
26
26 Supervised Learning Example Use gene expression data to distinguish and predict –Tumor vs. normal samples, or sub-classes of tumor samples –Long-survival patients vs. short-survival patients –Metastatic tumors vs.non-metastatic tumors
27
Clustering Classification Which known samples does the unknown sample cluster with? No guarantee that the known sample will cluster Try different clustering methods (semi-supervised) –E.g. change linkage, use subset of genes
28
K Nearest Neighbor For observation X with unknown label, find the K observations in the training data closest (e.g. correlation) to X Predict the label of X based on majority vote by KNN K can be determined by predictability of known samples, semi-supervised again! Offer little insights into mechanism
29
29
30
30 Extensions of Nearest Neighbor Rule Class prior weights. Votes may be weighted according to neighbor class. Distance weights. Assigning weights to the observed neighbors (“evidences”) that are inversely proportional to their distance from the test-sample. Differential misclassification costs. Votes may be adjusted based on the class to be called.
31
31 Other Well-known Classification Methods Linear Discriminant Analysis (LDA) Logistic Regression Classification and Regression Trees Neutral Networks (NN) Support Vector Machines (SVM) The following presentations of Linear methods for classification, LDA and Logistic Regression are mainly based on Hastie, Tibshirani and Friedman (2001) The Elements of Statistical Learning
32
A general framework: the Bayes Classifier Consider a two-class problem (can be any number of classes) Training data: (X i,Y i ) -- Y i is the class label, and X i ’s are the covariates Learn the conditional distribution –P(X|Y=1) and P(X|Y=0) Learn (or impose) the prior weight on Y Use the Bayes rule:
33
STAT115 03/18/2008 33 Supervised Learning Performance Assessment If error rate is estimated from whole learning data set, could overfit the data (do well now, but poorly in future observations) Divide observations into L1 and L2 –Build classifier using L1 –Compute classifier error rate using L2 –Requirement: L1 and L2 are iid (independent & identically-distributed) N-fold cross validation –Divide data into N subset (equal size), build classifier on (N-1) subsets, compute error rate on left out subset
34
34 Fisher’s Linear Discriminant Analysis First collect differentially expressed genes Find linear projection so as to maximize class separability (b/w to w/i group sum of sq) Can be used for dimension reduction as well
35
35 LDA 1D, find two group means, cut at some point (middle, say) 2D, connect two group means with line, use a line parallel to the “main direction” of the data and pass through somewhere. i.e., project to Limitation: –Does not consider non-linear relationship –Assume class mean capture most of information Weighted voting: variation of LDA –Informative genes given different weight based on how informative it is at classifying samples (e.g. t-statistic)
36
36 In practice: estimating Gaussian Distributions Prior probabilities: Class center: Covariance matrix Decision boundary for (y, x): find k to maximize
37
37 Logistic Regression Data: (y i, x i ), i=1,…, n; (binary responses). In practice, one estimates the ’s using the training data. (can use R) The decision boundary is determined by the linear regression, i.e., classify y i =1 if Model:
38
38 Diabetes Data Set
39
39 Connections with LDA
40
40 Remarks Simple methods such as nearest neighbor classification, are competitive with more complex approaches, such as aggregated classification trees or support vector machines (Dudoit and Fridlyand, 2003) Screening of genes to G=10 to 100 is advisable; Models may include other predictor variables (such as age and sex) Outcomes maybe continuous (e.g., blood pressure, cholesterol level, etc.)
41
Classification And Regression Tree Split data using set of binary (or multiple value) decisions Root node (all data) has certain impurities, need to split the data to reduce impurities
42
CART Measure of impurities –Entropy –Gini index impurity Example with Gini: multiply impurity by number of samples in the node –Root node (e.g. 8 normal & 14 cancer) –Try split by gene x i (x i 0, 13 cancer; x i < 0, 1 cancer & 8 normal): –Split at gene with the biggest reduction in impurities
43
CART Assume independence of partitions, same level may split on different gene Stop splitting –When impurity is small enough –When number of node is small Pruning to reduce over fit –Training set to split, test set for pruning –Split has cost, compared to gain at each split
44
44 Boosting Boosting is a method of improving the effectiveness of predictors. Boosting relies on the existence of weak learners. A weak learner is a “rough and moderately inaccurate” predictor, but one that can predict better than chance. Boosting shows the strength of weak learn- ability
45
45 The Rules for Boosting set all weights of training examples equal train a weak learner on the weighted examples see how well the weak learner performs on data and give it a weight based on how well it did re-weight training examples and repeat when done, predict by weighted voting
46
46 Artificial Neural Network ANN: model neurons (feedforward NN) Perceptron: simplest ANN –x i input (e.g. expression values of diff genes) –w i weight (e.g. how much each gene contributes +/–) –y output (e.g. )
47
47 ANN Multi Layered Perceptron 3 layer ANN can solve any nonlinear continuous problem –Picking # layers and # nodes/layer not easy Weight training: –Back propagation –Minimize error b/t observed and predicted –Black box
48
Support Vector Machine SVM –Which hyperplane is the best?
49
Support Vector Machine SVM finds the hyperplane that maximizes the margin Margin determined by support vectors (samples lie on the class edge), others irrelevant
50
Support Vector Machine SVM finds the hyperplane that maximizes the margin Margin determined by support vectors others irrelevant Extensions: –Soft edge, support vectors diff weight –Non separable: slack var > 0 Max (margin – # bad)
51
Nonlinear SVM Project the data through higher dimensional space with kernel function, so classes can be separated by hyperplane A few implemented kernel functions available in Matlab & BioConductor, the choice is usually trial and error and personal experience K(x,y) = (xy) 2
52
Most Widely Used Sequence IDs GenBank: all submitted sequences EST: Expressed Sequence Tags (mRNA), some redundancy, might have contaminations UniGene: computationally derived gene-based transcribed sequence clusters Entrez Gene: comprehensive catalog of genes and associated information, ~ traditional concept of “gene” RefSeq: reference sequences mRNAs and proteins, individual transcript (splice variant)
53
UCSC Genome Browser Can display custom tracks
54
Entrez: Main NCBI Search Engine
55
Public Microarray Databases SMD: Stanford Microarray Database, most Stanford and collaborators’ cDNA arraysSMD GEO: Gene Expression Omnibus, a NCBI repository for gene expression and hybridization data, growing quickly.GEO Oncomine: Cancer Microarray DatabaseOncomine –Published cancer related microarrays –Raw data all processed, nice interface
56
Outline Gene ontology –Check diff expr and clustering, GSEA Microarray clustering: –Unsupervised Clustering, KNN, PCA –Supervised learning for classification CART, SVM Expression and genome resources
57
Acknowledgment Kevin Coombes & Keith Baggerly Darlene Goldstein Mark Craven George Gerber Gabriel Eichler Ying Xie Terry Speed & Group Larry Hunter Wing Wong & Cheng Li Ping Ma, Xin Lu, Pengyu Hong Mark Reimers Marco Ramoni Jenia Semyonov
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.