Gene Set Enrichment Analysis

Slides:



Advertisements
Similar presentations
Support Vector Machines
Advertisements

1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
x – independent variable (input)
Principal Component Analysis
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Dimensional reduction, PCA
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Summarized by Soo-Jin Kim
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
This week: overview on pattern recognition (related to machine learning)
Classification (Supervised Clustering) Naomi Altman Nov '06.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
The Broad Institute of MIT and Harvard Classification / Prediction.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
CSE 185 Introduction to Computer Vision Face Recognition.
Gene Set Enrichment Analysis Microarray Classification STAT115 Jun S. Liu and Xiole Shirley Liu.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Principal Components Analysis ( PCA)
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Principal Component Analysis
Principal Component Analysis (PCA)
Support Vector Machines
PREDICT 422: Practical Machine Learning
Machine Learning Logistic Regression
Exploring Microarray data
University of Ioannina
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Support Vector Machines
Principal Component Analysis (PCA)
Basic machine learning background with Python scikit-learn
Dimension Reduction via PCA (Principal Component Analysis)
Machine Learning Dimensionality Reduction
Pawan Lingras and Cory Butz
Machine Learning Logistic Regression
Principal Component Analysis
PCA, Clustering and Classification by Agnieszka S. Juncker
Principal Component Analysis
Dimension reduction : PCA and Clustering
Feature space tansformation methods
Generally Discriminant Analysis
CS4670: Intro to Computer Vision
Chapter 7: Transformations
Multivariate Methods Berlin Chen
Feature Selection Methods
Principal Component Analysis
Multivariate Methods Berlin Chen, 2005 References:
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Lecture 16. Classification (II): Practical Considerations
Support Vector Machines 2
Presentation transcript:

Gene Set Enrichment Analysis Xiaole Shirley Liu STAT115/STAT215

Gene Set Enrichment Analysis In some microarray experiments comparing two conditions, there might be no single gene significantly diff expressed, but a group of genes slightly diff expressed Check a set of genes with similar annotation (e.g. GO) and see their expression values Kolmogorov-Smirnov test GSEA at Broad Institute

Gene Set Enrichment Analysis Mootha et al, PNAS 2003 Kolmogorov-Smirnov test Cumulative fraction function: What fraction of genes are below this fold change? FC T*-test

Gene Set Enrichment Analysis Alternative to KS: one sample z-test Population with all the genes follow normal ~ N(,2) Avg of the genes (X) with a specific annotation: STAT115 03/18/2008

Gene Set Enrichment Analysis Set of genes with specific annotation involved in coordinated down-regulation Need to define the set before looking at the data Can only see the significance by looking at the whole set

Expanded Gene Sets Subramanian, et al PNAS 2005

Examples of GSEA Break

Microarray Classification Xiaole Shirley Liu STAT115/STAT215

Microarray Classification ?

Classification Equivalent to machine learning methods Task: assign object to class based on measurements on object E.g. is sample normal or cancer based on expression profile? Unsupervised learning Ignore known class labels Sometimes can’t separate even the known classes Supervised learning: Extract useful features based on known class labels to best separate classes Can over fit the data, so need to separate training and test set

Clustering Classification Which known samples does the unknown sample cluster with? No guarantee that the known sample will cluster Try batch removal or different clustering methods Change linkage, select subset genes (semi-supervised)

K Nearest Neighbor Used in missing value estimation For observation X with unknown label, find the K observations in the training data closest (e.g. correlation) to X Predict the label of X based on majority vote by KNN K can be determined by predictability of known samples, semi-supervised again!

KNN Example Can extend KNN by assigning weights to the neighbors by inverse distance from the test-sample Offer little insights into mechanism

MDS Break Multidimensional scaling Based on distance between data in high dimensional space (e.g. correlations) Give 2-3D representation approximating the pairwise distance relationship as much as possible Non-linear projection Can directly predict new sample based on dist Break

Principal Component Analysis Linear transformation that projects the data on to new coordinate system (linear combinations of the original variables) to capture as much of the variation in data as possible The first principal component accounts for the greatest possible variance in dataset The second principal component accounts for the next highest variance and is uncorrelated with (orthogonal to) the first principal component.

Finding the Projections Looking for a linear combination to transform the original data matrix X to: Y= T X=1 X1+ 2 X2+..+ p Xp Where  =(1 , 2 ,.., p)T is a column vector of weights with 1²+ 2²+..+ p² =1 Maximize the variance of the projection of the observations on the Y variables

Finding the Projections Good Better The direction of  is given by the eigenvector 1 correponding to the largest eigenvalue of matrix C The second vector that is orthogonal (uncorrelated) to the first is the one that has the second highest variance which comes to be the eignevector corresponding to the second eigenvalue

Principal Component Analysis Achieved by singular value decomposition (SVD): X = UDVT X is the original data U (N × N) is the relative projection of the points V is project directions v1 is a unit vector, direction of the first projection The eigenvector with the largest eigenvalue Linear combination (relative importance) of each gene (if PCA on samples)

PCA D is scaling factor (eigenvalue) Diagonal matrix, d1  d2  d3  …  0 dm2 measures the variances captures by the mth principal component u1d1 is distance along v1 from origin (first principal components) Expression value projected on v1 u1d1 captures the largest variance of original X v2 is 2nd projection direction, orthogonal to PC1, u2d2 is 2nd principal component and captures the 2nd largest var of X

PCA for Classification Blood transcriptome of healthy individual (HI), cardiovascular risk factor individuals (RF), individuals with asymptomatic left ventricular dysfunction groups (ALVD) and chronic heart failure patients (CHF). New sample predicted to be CHF.

Interpretation of components See the weights of variables in each component If Y1= 0.41X1 +0.15X2 -0.38X3+0.03X4+… X1 and X3 are more important than X2 and X4 in PC1, offers some biological insights PCA and MDS are both good dimension reduction methods PCA is a good clustering method, and can be conducted on genes or on samples PCA is only powerful if the biological question is related to the highest variance in the dataset

PCA for Batch Effect Detection PCA can identify batch effect Obvious batch effect: early PC’s separate samples by batch Un-normalized Qnorm COMBAT Brezina et al, Microarray 2015 Break

Supervised Learning

Supervised Learning Performance Assessment If error rate is estimated from whole learning data set, could overfit the data (do well now, but poorly in future observations) Need cross validation to assess performance Leave-1 cross validation on n data points Build classifier on (n-1), test on the one left out N-fold cross validation Divide data into N subset (equal size), build classifier on (N-1) subsets, compute error rate on left out subset

Logistic Regression Data: (yi, xi), i=1,…, n Dependent variable is binary: 0, 1 Model Logit: natural log of odds ratio Pb(1) over Pb(0) b0+b1X really big, Y =1; b0+b1X really small, Y =0 But change in probability is not linear with changes in X b0  intercept b1  regression slope b0+b1X = 0  decision boundary Pb(1) = 0.5

Example (wiki) Hours study on Pb of passing an exam Significant association P-value 0.0167 Pb (pass) = 1/[1+exp(-b0-b1X)] = 1/[1+exp(4.0777-1.5046* Hours)] 4.0777/1.5046 = 2.71 hours  Pb (pass) = 0.5

Logistic Regression Sample classification: Y  Cancer 1, Normal 0 Find subset of p genes whose expression x collectively predict new sample classification ’s are estimated from training data (R) The decision boundary is determined by the linear regression, i.e., classify yi =1 if: More later in the semester x<-c(rnorm(50),rnorm(50)+2) y<-c(rnorm(50)*0,rnorm(50)*0+1) plot(x,y) data1<-data.frame(covariate=x, response=y) g.test<-glm(response~covariate, family=binomial(), data=data1) x1<-cbind(c(1:100)*0+1,x) #adding a constant term g.test<-glm.fit(x1,y,family=binomial()) plot(g.test$fitted.values)

Support Vector Machine SVM Which hyperplane is the best?

Support Vector Machine SVM finds the hyperplane that maximizes the margin Margin determined by support vectors (samples lie on the class edge), others irrelevant

Support Vector Machine SVM finds the hyperplane that maximizes the margin Margin determined by support vectors others irrelevant Extensions: Soft edge, support vectors diff weight Non separable: slack var  > 0 Max (margin –   # bad)

Nonlinear SVM Project the data through higher dimensional space with kernel function, so classes can be separated by hyperplane A few implemented kernel functions available in BioConductor, the choice is usually trial and error and personal experience K(x,y) = (xy)2

Outline GSEA for activities of group of genes Dimension reduction techniques MDS, PCA Unsupervised learning method Clustering, KNN, MDS PCA, batch effect Supervised learning for classification Logistic regression SVM Cross validation