Gene Set Enrichment Analysis

Gene Set Enrichment Analysis
Xiaole Shirley Liu STAT115/STAT215

In some microarray experiments comparing two conditions, there might be no single gene significantly diff expressed, but a group of genes slightly diff expressed Check a set of genes with similar annotation (e.g. GO) and see their expression values Kolmogorov-Smirnov test GSEA at Broad Institute

Mootha et al, PNAS 2003 Kolmogorov-Smirnov test Cumulative fraction function: What fraction of genes are below this fold change? FC T*-test

Alternative to KS: one sample z-test Population with all the genes follow normal ~ N(,2) Avg of the genes (X) with a specific annotation: STAT115 03/18/2008

Set of genes with specific annotation involved in coordinated down-regulation Need to define the set before looking at the data Can only see the significance by looking at the whole set

Expanded Gene Sets Subramanian, et al PNAS 2005

Examples of GSEA Break

Microarray Classification
Xiaole Shirley Liu STAT115/STAT215

Microarray Classification
?

Classification Equivalent to machine learning methods
Task: assign object to class based on measurements on object E.g. is sample normal or cancer based on expression profile? Unsupervised learning Ignore known class labels Sometimes can’t separate even the known classes Supervised learning: Extract useful features based on known class labels to best separate classes Can over fit the data, so need to separate training and test set

Clustering Classification
Which known samples does the unknown sample cluster with? No guarantee that the known sample will cluster Try batch removal or different clustering methods Change linkage, select subset genes (semi-supervised)

K Nearest Neighbor Used in missing value estimation
For observation X with unknown label, find the K observations in the training data closest (e.g. correlation) to X Predict the label of X based on majority vote by KNN K can be determined by predictability of known samples, semi-supervised again!

KNN Example Can extend KNN by assigning weights to the neighbors by inverse distance from the test-sample Offer little insights into mechanism

MDS Break Multidimensional scaling
Based on distance between data in high dimensional space (e.g. correlations) Give 2-3D representation approximating the pairwise distance relationship as much as possible Non-linear projection Can directly predict new sample based on dist Break

Principal Component Analysis
Linear transformation that projects the data on to new coordinate system (linear combinations of the original variables) to capture as much of the variation in data as possible The first principal component accounts for the greatest possible variance in dataset The second principal component accounts for the next highest variance and is uncorrelated with (orthogonal to) the first principal component.

Finding the Projections
Looking for a linear combination to transform the original data matrix X to: Y= T X=1 X1+ 2 X2+..+ p Xp Where  =(1 , 2 ,.., p)T is a column vector of weights with 1²+ 2²+..+ p² =1 Maximize the variance of the projection of the observations on the Y variables

Finding the Projections
Good Better The direction of  is given by the eigenvector 1 correponding to the largest eigenvalue of matrix C The second vector that is orthogonal (uncorrelated) to the first is the one that has the second highest variance which comes to be the eignevector corresponding to the second eigenvalue

Principal Component Analysis
Achieved by singular value decomposition (SVD): X = UDVT X is the original data U (N × N) is the relative projection of the points V is project directions v1 is a unit vector, direction of the first projection The eigenvector with the largest eigenvalue Linear combination (relative importance) of each gene (if PCA on samples)

PCA D is scaling factor (eigenvalue)
Diagonal matrix, d1  d2  d3  …  0 dm2 measures the variances captures by the mth principal component u1d1 is distance along v1 from origin (first principal components) Expression value projected on v1 u1d1 captures the largest variance of original X v2 is 2nd projection direction, orthogonal to PC1, u2d2 is 2nd principal component and captures the 2nd largest var of X

PCA for Classification
Blood transcriptome of healthy individual (HI), cardiovascular risk factor individuals (RF), individuals with asymptomatic left ventricular dysfunction groups (ALVD) and chronic heart failure patients (CHF). New sample predicted to be CHF.

Interpretation of components
See the weights of variables in each component If Y1= 0.41X X X3+0.03X4+… X1 and X3 are more important than X2 and X4 in PC1, offers some biological insights PCA and MDS are both good dimension reduction methods PCA is a good clustering method, and can be conducted on genes or on samples PCA is only powerful if the biological question is related to the highest variance in the dataset

PCA for Batch Effect Detection
PCA can identify batch effect Obvious batch effect: early PC’s separate samples by batch Un-normalized Qnorm COMBAT Brezina et al, Microarray 2015 Break

Supervised Learning

Supervised Learning Performance Assessment
If error rate is estimated from whole learning data set, could overfit the data (do well now, but poorly in future observations) Need cross validation to assess performance Leave-1 cross validation on n data points Build classifier on (n-1), test on the one left out N-fold cross validation Divide data into N subset (equal size), build classifier on (N-1) subsets, compute error rate on left out subset

Logistic Regression Data: (yi, xi), i=1,…, n
Dependent variable is binary: 0, 1 Model Logit: natural log of odds ratio Pb(1) over Pb(0) b0+b1X really big, Y =1; b0+b1X really small, Y =0 But change in probability is not linear with changes in X b0  intercept b1  regression slope b0+b1X = 0  decision boundary Pb(1) = 0.5

Example (wiki) Hours study on Pb of passing an exam
Significant association P-value Pb (pass) = 1/[1+exp(-b0-b1X)] = 1/[1+exp( * Hours)] 4.0777/ = 2.71 hours  Pb (pass) = 0.5

Logistic Regression Sample classification: Y  Cancer 1, Normal 0
Find subset of p genes whose expression x collectively predict new sample classification ’s are estimated from training data (R) The decision boundary is determined by the linear regression, i.e., classify yi =1 if: More later in the semester x<-c(rnorm(50),rnorm(50)+2) y<-c(rnorm(50)*0,rnorm(50)*0+1) plot(x,y) data1<-data.frame(covariate=x, response=y) g.test<-glm(response~covariate, family=binomial(), data=data1) x1<-cbind(c(1:100)*0+1,x) #adding a constant term g.test<-glm.fit(x1,y,family=binomial()) plot(g.test$fitted.values)

Support Vector Machine
SVM Which hyperplane is the best?

SVM finds the hyperplane that maximizes the margin Margin determined by support vectors (samples lie on the class edge), others irrelevant

SVM finds the hyperplane that maximizes the margin Margin determined by support vectors others irrelevant Extensions: Soft edge, support vectors diff weight Non separable: slack var  > 0 Max (margin –   # bad)

Nonlinear SVM Project the data through higher dimensional space with kernel function, so classes can be separated by hyperplane A few implemented kernel functions available in BioConductor, the choice is usually trial and error and personal experience K(x,y) = (xy)2

Outline GSEA for activities of group of genes
Dimension reduction techniques MDS, PCA Unsupervised learning method Clustering, KNN, MDS PCA, batch effect Supervised learning for classification Logistic regression SVM Cross validation

Gene Set Enrichment Analysis

Similar presentations

Presentation on theme: "Gene Set Enrichment Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene Set Enrichment Analysis

Similar presentations

Presentation on theme: "Gene Set Enrichment Analysis"— Presentation transcript:

Similar presentations

About project

Feedback