Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.

Slides:



Advertisements
Similar presentations
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Advertisements

Data Mining Classification: Alternative Techniques
Support Vector Machines
Dimension reduction (1)
Supervised Learning Recap
Chapter 7 – Classification and Regression Trees
Lecture 14 – Neural Networks
Principal Component Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
CES 514 – Data Mining Lecture 8 classification (contd…)
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Classification 10/03/07.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
This week: overview on pattern recognition (related to machine learning)
Classification (Supervised Clustering) Naomi Altman Nov '06.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 9 – Classification and Regression Trees
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
An Introduction to Support Vector Machines (M. Law)
Gene Set Enrichment Analysis Microarray Classification STAT115 Jun S. Liu and Xiole Shirley Liu.
Lecture 4 Linear machine
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Principle Component Analysis and its use in MA clustering Lecture 12.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
PREDICT 422: Practical Machine Learning
LECTURE 11: Advanced Discriminant Analysis
Trees, bagging, boosting, and stacking
Gene Set Enrichment Analysis
Basic machine learning background with Python scikit-learn
Machine Learning Dimensionality Reduction
PCA, Clustering and Classification by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering
Feature space tansformation methods
Generally Discriminant Analysis
Multivariate Methods Berlin Chen
Feature Selection Methods
Multivariate Methods Berlin Chen, 2005 References:
Lecture 16. Classification (II): Practical Considerations
Support Vector Machines 2
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu

Outline Dimension reduction –Principal Component Analysis (PCA) –Other approaches such as MDS, SOM, etc. Unsupervised learning for classificationUnsupervised learning –Clustering and KNN Supervised learning for classificationSupervised learning –CART, SVMCARTSVM Expression and genome resources

Dimension Reduction High dimensional data points are difficult to visualize Always good to plot data in 2D –Easier to detect or confirm the relationship among data points –Catch stupid mistakes (e.g. in clustering) Two ways to reduce: –By genes: some experiments are similar or have little information –By experiments: some genes are similar or have little information

Principal Component Analysis Optimal linear transformation that chooses a new coordinate system for the data set that maximizes the variance by projecting the data on to new axes in order of the principal components Components are orthogonal (mutually uncorrelated) Few PCs may capture most variation in original data E.g. reduce 2D into 1D data

5 Principle Component Analysis (PCA)

Example: human SNP marker data

PCA for 800 randomly selected SNPs

PCA for 400 randomly selected SNPs

PCA for 200 randomly selected SNPs

PCA for 100 randomly selected SNPs

PCA for 50 randomly selected SNPs

Interpretations and Insights PCA can discover aggregated subtle effects/differences in high-dimensional data PCA finds linear directions to project data, and is purely unsupervised, so it is indifferent about the “importance” of certain directions, but attempting to find most “variable” directions. There are generalizations for supervised PCA.

Principal Component Analysis Achieved by singular value decomposition (SVD): X = UDV T X is the original N  p data –E.g. N genes, p experiments V is p  p project directions –Orthogonal matrix: U T U = I p –v 1 is direction of the first projection –Linear combination (relative importance) of each experiment or (gene if PCA on samples)

14 Quick Linear Algebra Review N  p matrix: m row, n col Matrix multiplication: –N  k matrix multiplied with k  p matrix gets N  p matrix Diagonal matrix Identity matrix I p Orthogonal matrix: –r = c, U T U = I p Orthonormal matrix –r >= c, U T U = I p

15 Quick Linear Algebra Review Example Orthogonal matrix Transformation Multiplication Identity matrix

16 Some basic mathematics Goal of PCA step 1: find a direction that has the “maximum variation” –Let the direction be a (column) unit vector c (p-dim) –A projection of x onto c is –So we want to maximize –Algebra: –So the maximum is achieved as the largest eigen value of the sample covariance matrix where, and subject to ||c||=1

A simple illustration

PCA and SVD PCA: eigen(X T X), SVD: X=UDV T U is N  p, relative projection of points D is p  p scaling factor –Diagonal matrix, d 1  d 2  d 3  …  d p  0 u i1 d 1 is distance along v 1 from origin (first principal components) –Expression value projected on v 1 –v 2 is 2 nd projection direction, u i2 d 2 is 2 nd principal component, so on Captured variances by the first m principal components

PCA N p × p p = N p p Original data Projection dir Projected value scale X 11 V 11 + X 12 V 21 + X 13 V 31 + …= X 11 ’ = U 11 D 11 X 21 V 11 + X 22 V 21 + X 23 V 31 + …= X 21 ’ = U 21 D 11 X 11 V 12 + X 12 V 22 + X 13 V 32 + …= X 12 ’ = U 12 D 22 X 21 V 12 + X 22 V 22 + X 23 V 32 + …= X 22 ’ = U 22 D 22 1 st Principal Component 2 nd Principal Component

PCA v1v1 v2v2 v1v1 v2v2 v1v1 v2v2

PCA on Genes Example Cell cycle genes, 13 time points, reduced to 2D Genes: 1: G1; 4: S; 2: G2; 3: M

PCA Example Variance in data explained by the first n principle components

PCA Example The coefficients of the first 8 principle directions This is an example of PCA to reduce samples Can do PCA to reduce the genes as well –Use first 2-3 PC to plot samples, give more weight to the more differentially expressed genes, can often see sample classification v1v2v3v4v1v2v3v4

Microarray Classification ?

25 Supervised Learning Abound in statistical methods, e.g. regression A special case is the classification problem: Y i is the class label, and X i ’s are the covariates We learn the relationship between X i ’s and Y from training data, and predict Y for future data where only X i ’s are known

26 Supervised Learning Example Use gene expression data to distinguish and predict –Tumor vs. normal samples, or sub-classes of tumor samples –Long-survival patients vs. short-survival patients –Metastatic tumors vs.non-metastatic tumors

Clustering Classification Which known samples does the unknown sample cluster with? No guarantee that the known sample will cluster Try different clustering methods (semi-supervised) –E.g. change linkage, use subset of genes

K Nearest Neighbor For observation X with unknown label, find the K observations in the training data closest (e.g. correlation) to X Predict the label of X based on majority vote by KNN K can be determined by predictability of known samples, semi-supervised again! Offer little insights into mechanism

29

30 Extensions of Nearest Neighbor Rule Class prior weights. Votes may be weighted according to neighbor class. Distance weights. Assigning weights to the observed neighbors (“evidences”) that are inversely proportional to their distance from the test-sample. Differential misclassification costs. Votes may be adjusted based on the class to be called.

31 Other Well-known Classification Methods Linear Discriminant Analysis (LDA) Logistic Regression Classification and Regression Trees Neutral Networks (NN) Support Vector Machines (SVM) The following presentations of Linear methods for classification, LDA and Logistic Regression are mainly based on Hastie, Tibshirani and Friedman (2001) The Elements of Statistical Learning

A general framework: the Bayes Classifier Consider a two-class problem (can be any number of classes) Training data: (X i,Y i ) -- Y i is the class label, and X i ’s are the covariates Learn the conditional distribution –P(X|Y=1) and P(X|Y=0) Learn (or impose) the prior weight on Y Use the Bayes rule:

STAT115 03/18/ Supervised Learning Performance Assessment If error rate is estimated from whole learning data set, could overfit the data (do well now, but poorly in future observations) Divide observations into L1 and L2 –Build classifier using L1 –Compute classifier error rate using L2 –Requirement: L1 and L2 are iid (independent & identically-distributed) N-fold cross validation –Divide data into N subset (equal size), build classifier on (N-1) subsets, compute error rate on left out subset

34 Fisher’s Linear Discriminant Analysis First collect differentially expressed genes Find linear projection so as to maximize class separability (b/w to w/i group sum of sq) Can be used for dimension reduction as well

35 LDA 1D, find two group means, cut at some point (middle, say) 2D, connect two group means with line, use a line parallel to the “main direction” of the data and pass through somewhere. i.e., project to Limitation: –Does not consider non-linear relationship –Assume class mean capture most of information Weighted voting: variation of LDA –Informative genes given different weight based on how informative it is at classifying samples (e.g. t-statistic)

36 In practice: estimating Gaussian Distributions Prior probabilities: Class center: Covariance matrix Decision boundary for (y, x): find k to maximize

37 Logistic Regression Data: (y i, x i ), i=1,…, n; (binary responses). In practice, one estimates the  ’s using the training data. (can use R) The decision boundary is determined by the linear regression, i.e., classify y i =1 if Model:

38 Diabetes Data Set

39 Connections with LDA

40 Remarks Simple methods such as nearest neighbor classification, are competitive with more complex approaches, such as aggregated classification trees or support vector machines (Dudoit and Fridlyand, 2003) Screening of genes to G=10 to 100 is advisable; Models may include other predictor variables (such as age and sex) Outcomes maybe continuous (e.g., blood pressure, cholesterol level, etc.)

Classification And Regression Tree Split data using set of binary (or multiple value) decisions Root node (all data) has certain impurities, need to split the data to reduce impurities

CART Measure of impurities –Entropy –Gini index impurity Example with Gini: multiply impurity by number of samples in the node –Root node (e.g. 8 normal & 14 cancer) –Try split by gene x i (x i  0, 13 cancer; x i < 0, 1 cancer & 8 normal): –Split at gene with the biggest reduction in impurities

CART Assume independence of partitions, same level may split on different gene Stop splitting –When impurity is small enough –When number of node is small Pruning to reduce over fit –Training set to split, test set for pruning –Split has cost, compared to gain at each split

44 Boosting Boosting is a method of improving the effectiveness of predictors. Boosting relies on the existence of weak learners. A weak learner is a “rough and moderately inaccurate” predictor, but one that can predict better than chance. Boosting shows the strength of weak learn- ability

45 The Rules for Boosting set all weights of training examples equal train a weak learner on the weighted examples see how well the weak learner performs on data and give it a weight based on how well it did re-weight training examples and repeat when done, predict by weighted voting

46 Artificial Neural Network ANN: model neurons (feedforward NN) Perceptron: simplest ANN –x i input (e.g. expression values of diff genes) –w i weight (e.g. how much each gene contributes +/–) –y output (e.g. )

47 ANN Multi Layered Perceptron 3 layer ANN can solve any nonlinear continuous problem –Picking # layers and # nodes/layer not easy Weight training: –Back propagation –Minimize error b/t observed and predicted –Black box

Support Vector Machine SVM –Which hyperplane is the best?

Support Vector Machine SVM finds the hyperplane that maximizes the margin Margin determined by support vectors (samples lie on the class edge), others irrelevant

Support Vector Machine SVM finds the hyperplane that maximizes the margin Margin determined by support vectors others irrelevant Extensions: –Soft edge, support vectors diff weight –Non separable: slack var  > 0 Max (margin –   # bad)

Nonlinear SVM Project the data through higher dimensional space with kernel function, so classes can be separated by hyperplane A few implemented kernel functions available in Matlab & BioConductor, the choice is usually trial and error and personal experience K(x,y) = (xy) 2

Most Widely Used Sequence IDs GenBank: all submitted sequences EST: Expressed Sequence Tags (mRNA), some redundancy, might have contaminations UniGene: computationally derived gene-based transcribed sequence clusters Entrez Gene: comprehensive catalog of genes and associated information, ~ traditional concept of “gene” RefSeq: reference sequences mRNAs and proteins, individual transcript (splice variant)

UCSC Genome Browser Can display custom tracks

Entrez: Main NCBI Search Engine

Public Microarray Databases SMD: Stanford Microarray Database, most Stanford and collaborators’ cDNA arraysSMD GEO: Gene Expression Omnibus, a NCBI repository for gene expression and hybridization data, growing quickly.GEO Oncomine: Cancer Microarray DatabaseOncomine –Published cancer related microarrays –Raw data all processed, nice interface

Outline Gene ontology –Check diff expr and clustering, GSEA Microarray clustering: –Unsupervised Clustering, KNN, PCA –Supervised learning for classification CART, SVM Expression and genome resources

Acknowledgment Kevin Coombes & Keith Baggerly Darlene Goldstein Mark Craven George Gerber Gabriel Eichler Ying Xie Terry Speed & Group Larry Hunter Wing Wong & Cheng Li Ping Ma, Xin Lu, Pengyu Hong Mark Reimers Marco Ramoni Jenia Semyonov