Gene Set Enrichment Analysis Microarray Classification STAT115 Jun S. Liu and Xiole Shirley Liu.

Slides:



Advertisements
Similar presentations
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Advertisements

Support Vector Machines
Minimum Redundancy and Maximum Relevance Feature Selection
Chapter 7 – Classification and Regression Trees
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimensional reduction, PCA
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.
Multidimensional Analysis If you are comparing more than two conditions (for example 10 types of cancer) or if you are looking at a time series (cell cycle.
Classification 10/03/07.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
1 Masterseminar „A statistical framework for the diagnostic of meningioma cancer“ Chair for Bioinformatics, Saarland University Andreas Keller Supervised.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Supervised Learning and Classification Xiaole Shirley Liu and Jun Liu.
This week: overview on pattern recognition (related to machine learning)
Classification (Supervised Clustering) Naomi Altman Nov '06.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Chapter 9 – Classification and Regression Trees
The Broad Institute of MIT and Harvard Classification / Prediction.
An Introduction to Support Vector Machines (M. Law)
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
CSE 185 Introduction to Computer Vision Face Recognition.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Analyzing Expression Data: Clustering and Stats Chapter 16.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Principle Component Analysis and its use in MA clustering Lecture 12.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Mining and Decision Support
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Principal Components Analysis ( PCA)
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Unsupervised Learning
Support Vector Machines
PREDICT 422: Practical Machine Learning
Gene Set Enrichment Analysis
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Microarray Clustering
Dimension reduction : PCA and Clustering
Dimensionality Reduction
Chapter 7: Transformations
Multivariate Methods Berlin Chen
Feature Selection Methods
Lecture 16. Classification (II): Practical Considerations
Support Vector Machines 2
Unsupervised Learning
Presentation transcript:

Gene Set Enrichment Analysis Microarray Classification STAT115 Jun S. Liu and Xiole Shirley Liu

Outline Gene ontology –Check differential expression and clustering results –Gene set enrichment analysisGene set enrichment analysis Unsupervised learning for classificationUnsupervised learning –Clustering and KNN –PCA (dimension reduction)PCA Supervised learning for classificationSupervised learning –CART, SVMCARTSVM Expression and genome resources

GO Relationships: –Subclass: Is_a –Membership: Part_of –Topological: adjacent_to; Derivation: derives_from –E.g. 5_prime_UTR is part_of a transcript, and mRNA is_a kind of transcript Same term could be annotated at multiple branches Directed acyclic graph

Evaluate Differentially Expressed Genes NetAffx mapped GO terms for all probesets Whole genomeUp genes GO term X10080 Total20K200 Statistical significance? Binomial proportional test –p = 100 / 20 K = –Check z table

Evaluate Differentially Expressed Genes Whole genomeUp genes GO term X10080 Total20K200 Chi sq test: Up!UpTotal GO: 80 (1)20 (99)100 !GO: 120 (199)20K-120 (19701)20K-100 Total:20020K-20020K –Check Chi-sq table

GO Tools for Microarray Analysis 40 tools

GO on Clustering Evaluate and refine clustering –Check GO term for members in the cluster –Are GO term significantly enriched? –Can we summarize what this cluster of these genes do? –Are there conflicting members in the cluster? Annotate unknown genes –After clustering, check GO term –Can we infer an unknown gene’s function based on the GO terms of cluster members?

Gene Set Enrichment Analysis In some microarray experiments comparing two conditions, there might be no single gene significantly diff expressed, but a group of genes slightly diff expressed Check a set of genes with similar annotation (e.g. GO) and see their expression values –Kolmogorov-Smirnov test –One sample z-test GSEA at Broad Institute

Gene Set Enrichment Analysis Kolmogorov-Smirnov test –Determine if two datasets differ significantly –Cumulative fraction function What fraction of genes are below this fold change?

Gene Set Enrichment Analysis Set of genes with specific annotation involved in coordinated down-regulation Need to define the set before looking at the data Can only see the significance by looking at the whole set

Gene Set Enrichment Analysis Alternative to KS: one sample z-test –Population with all the genes follow normal ~ N( ,  2 ) –Avg of the genes (X) with a specific annotation:

Dimension Reduction High dimensional data points are difficult to visualize Always good to plot data in 2D –Easier to detect or confirm the relationship among data points –Catch stupid mistakes (e.g. in clustering) Two ways to reduce: –By genes: some experiments are similar or have little information –By experiments: some genes are similar or have little information

Principal Component Analysis Optimal linear transformation that chooses a new coordinate system for the data set that maximizes the variance by projecting the data on to new axes in order of the principal components Components are orthogonal (mutually uncorrelated) Few PCs may capture most variation in original data E.g. reduce 2D into 1D data

Principal Component Analysis Achieved by singular value decomposition (SVD): X = UDV T X is the original N  p data –E.g. N genes, p experiments V is p  p project directions –Orthogonal matrix: U T U = I p –v 1 is direction of the first projection –Linear combination (relative importance) of each experiment or (gene if PCA on samples)

PCA U is N  p, relative projection of points D is p  p scaling factor –Diagonal matrix, d 1  d 2  d 3  …  d p  0 u i1 d 1 is distance along v 1 from origin (first principal components) –Expression value projected on v 1 –v 2 is 2 nd projection direction, u i2 d 2 is 2 nd principal component, so on Captured variances by the first m principal components

PCA N P × P P = N P P Original data Projection dir Projected value scale X 11 V 11 + X 12 V 21 + X 13 V 31 + …= X 11 ’ = U 11 D 11 X 21 V 11 + X 22 V 21 + X 23 V 31 + …= X 21 ’ = U 21 D 11 X 11 V 12 + X 12 V 22 + X 13 V 32 + …= X 12 ’ = U 12 D 22 X 21 V 12 + X 22 V 22 + X 23 V 32 + …= X 22 ’ = U 22 D 22 1 st Principal Component 2 nd Principal Component

PCA v1v1 v2v2 v1v1 v2v2 v1v1 v2v2

PCA on Genes Example Cell cycle genes, 13 time points, reduced to 2D Genes: 1: G1; 4: S; 2: G2; 3: M

PCA Example Variance in data explained by the first n principle components

PCA Example The weights of the first 8 principle directions This is an example of PCA to reduce samples Can do PCA to reduce the genes as well –Use first 2-3 PC to plot samples, give more weight to the more differentially expressed genes, can often see sample classification v1v2v3v4v1v2v3v4

Microarray Classification ?

Classification Equivalent to machine learning methods Task: assign object to class based on measurements on object –E.g. is sample normal or cancer based on expression profile? Unsupervised learning –Ignore known class labels, e.g. cluster analysis or KNN –Sometimes can’t separate even the known classes Supervised learning: –Extract useful features based on known class labels to best separate classes –Can over fit the data, so need to separate training and test set (e.g. cross-validation)

Clustering Classification Which known samples does the unknown sample cluster with? No guarantee that the known sample will cluster Try different clustering methods (semi-supervised) –E.g. change linkage, use subset of genes

K Nearest Neighbor Used in missing value estimation For observation X with unknown label, find the K observations in the training data closest (e.g. correlation) to X Predict the label of X based on majority vote by KNN K can be determined by predictability of known samples, semi-supervised again! Offer little insights into mechanism

STAT115 03/18/ Supervised Learning Performance Assessment If error rate is estimated from whole learning data set, it will be over-optimistic (do well now, but poorly in future observations) Divide observations into L1 and L2 –Build classifier using L1 –Compute classifier error rate using L2 –Requirement: L1 and L2 are iid (independent & identically-distributed) N-fold cross validation –Divide data into N subsets (equal size), build classifier on (N-1) subsets, compute error rate on left out subset

Classification And Regression Tree Split data using set of binary (or multiple value) decisions Root node (all data) has certain impurities, need to split the data to reduce impurities

CART Measure of impurities –Entropy –Gini index impurity Example with Gini: multiply impurity by number of samples in the node –Root node (e.g. 8 normal & 14 cancer) –Try split by gene x i (x i  0, 13 cancer; x i < 0, 1 cancer & 8 normal): –Split at gene with the biggest reduction in impurities

CART Assume independence of partitions, same level may split on different gene Stop splitting –When impurity is small enough –When number of node is small Pruning to reduce over fit –Training set to split, test set for pruning –Split has cost, compared to gain at each split

Support Vector Machine SVM –Which hyperplane is the best?

Support Vector Machine SVM finds the hyperplane that maximizes the margin Margin determined by support vectors (samples lie on the class edge), others irrelevant

Support Vector Machine SVM finds the hyperplane that maximizes the margin Margin determined by support vectors others irrelevant Extensions: –Soft edge, support vectors diff weight –Non separable: slack var  > 0 Max (margin –   # bad)

Nonlinear SVM Project the data through higher dimensional space with kernel function, so classes can be separated by hyperplane A few implemented kernel functions available in Matlab & BioConductor, the choice is usually trial and error and personal experience K(x,y) = (xy) 2

Most Widely Used Sequence IDs GenBank: all submitted sequences EST: Expressed Sequence Tags (mRNA), some redundancy, might have contaminations UniGene: computationally derived gene-based transcribed sequence clusters Entrez Gene: comprehensive catalog of genes and associated information, ~ traditional concept of “gene” RefSeq: reference sequences mRNAs and proteins, individual transcript (splice variant)

UCSC Genome Browser Can display custom tracks

Entrez: Main NCBI Search Engine

Public Microarray Databases SMD: Stanford Microarray Database, most Stanford and collaborators’ cDNA arraysSMD GEO: Gene Expression Omnibus, a NCBI repository for gene expression and hybridization data, growing quickly.GEO Oncomine: Cancer Microarray DatabaseOncomine –Published cancer related microarrays –Raw data all processed, nice interface

Outline Gene ontology –Check diff expr and clustering, GSEA Microarray clustering: –Unsupervised Clustering, KNN, PCA –Supervised learning for classification CART, SVM Expression and genome resources

Acknowledgment Kevin Coombes & Keith Baggerly Darlene Goldstein Mark Craven George Gerber Gabriel Eichler Ying Xie Terry Speed & Group Larry Hunter Wing Wong & Cheng Li Ping Ma, Xin Lu, Pengyu Hong Mark Reimers Marco Ramoni Jenia Semyonov