More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Basic Gene Expression Data Analysis--Clustering
Self-Organizing Maps Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering –Prototypes.
Outlines Background & motivation Algorithms overview
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
Cluster analysis for microarray data Anja von Heydebreck.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Unsupervised learning
Machine Learning and Data Mining Clustering
Introduction to Bioinformatics
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Microarray Data Preprocessing and Clustering Analysis
Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Bioinformatics: gene expression basics
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Cluster analysis  Function  Places genes with similar expression patterns in groups.  Sometimes genes of unknown function will be grouped with genes.
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Introduction to Bioinformatics - Tutorial no. 12
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Gene expression analysis
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Distances Between Genes and Samples Naomi Altman Oct. 06.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
March 4, Visualization Approaches for Gene Expression Data Matt Hibbs Assistant Professor The Jackson Laboratory.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Clustering CSC 600: Data Mining Class 21.
Clustering.
Dimension reduction : PCA and Clustering
Text Categorization Berlin Chen 2003 Reference:
Inferring Cellular Processes from Coexpressing Genes
Unsupervised Learning
Presentation transcript:

More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab

Outline Gene Expression vs. DNA applications A little more normalization (missing values) Unsupervised Analysis –Basic Clustering –Statistical Enrichment –PCA/SVD –Advanced Clustering –Search-based Approaches

Expression / DNA Some similar concepts to analysis, but often very different goals Expression – clustering, guilt by association, functional enrichment DNA – signal processing, spatial relationships, motif finding Visualized differently (Heat maps vs. karyoscope)

The missing value problem Microarrays can have systematic or random missing values Some algorithms can’t deal with missing values (PCA/SVD in particular) Instead of hoping missing values won’t bias the analysis, better to estimate them accurately

Spatial Defects

KNN Impute Idea: use genes with similar expression profiles to estimate missing values 2 | 4 | 5 | 7 | 3 | 2 2 | | 5 | 7 | 3 | 1 8 | 9 | 2 | 1 | 4 | 9 Gene X Gene A Gene B 3 | 5 | 6 | 7 | 3 | 2 Gene C 2 | 4 | 5 | 7 | 3 | 2 2 |4.3| 5 | 7 | 3 | 1 8 | 9 | 2 | 1 | 4 | 9 Gene X Gene A Gene B 3 | 5 | 6 | 7 | 3 | 2 Gene C

Complete data setData set with missing values estimated by KNNimpute algorithm Data set with 30% entries missing and filled with zeros (zero values appear black) Imputation affects downstream analysis

Unsupervised Analysis Supervised techniques great if you have starting information (e.g. labels) –But, we often we don’t know enough beforehand to apply these methods Unsupervised techniques are exploratory –Let the data organize itself, then try to find biological meaning –Approaches to understand whole data –Visualization often helpful

Clustering Let the data organize itself Reordering of genes (or conditions) in the dataset so that similar patterns are next to each other (or in separate groups) Identify subsets of genes (or experiments) that are related by some measure

Quick Example Genes Conditions

Why cluster? “Guilt by association” – if unknown gene X is similar in expression to known genes A and B, maybe they are involved in the same/related pathway Visualization: datasets are too large to be able to get information out without reorganizing the data

Clustering Techniques Algorithm (Method) –Hierarchical –K-means –Self Organizing Maps –QT-Clustering –NNN –. Distance Metric –Euclidean (L 2 ) –Pearson Correlation –Spearman Correlation –Manhattan (L 1 ) –Kendall’s  –.

Distance Metrics Choice of distance measure is important for most clustering techniques Pair-wise metrics – compare vectors of numbers –e.g. genes x & y, ea. with n measurements Euclidean Distance Pearson Correlation Spearman Correlation

Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation

Hierarchical clustering Imposes (pair-wise) hierarchical structure on all of the data Often good for visualization Basic Method (agglomerative): 1.Calculate all pair-wise distances 2.Join the closest pair 3.Calculate pair’s distance to all others 4.Repeat from 2 until all joined

Hierarchical clustering

HC – Interior Distances Three typical variants to calculate interior distances within the tree –Average linkage: mean/median over all possible pair-wise values –Single linkage: minimum pair-wise distance –Complete linkage: maximum pair-wise distance

Hierarchical clustering: problems Hard to define distinct clusters Genes assigned to clusters on the basis of all experiments Optimizing node ordering hard (finding the optimal solution is NP-hard) Can be driven by one strong cluster – a problem for gene expression b/c data in row space is often highly correlated

HC: Real Example Demo in JavaTreeView & HIDRA –Spellman et al., 1998: yeast alpha-factor sync cell cycle timecourse

HC: Another Example Expression of tumors hierarchically clustered Expression groups by clinical class Garber et al.

K-means Clustering Groups genes into a pre-defined number of independent clusters Basic algorithm: 1.Define k = number of clusters 2.Randomly initialize each cluster with a seed (often with a random gene) 3.Assign each gene to the cluster with the most similar seed 4.Recalculate all cluster seeds as means (or medians) of genes assigned to the cluster 5.Repeat 3 & 4 until convergence (e.g. No genes move, means don’t change much, etc.)

K-means example

K-means: problems Have to set k ahead of time –Ways to choose “optimal” k: minimize within- cluster variation compared to random data or held out data Each gene only belongs to exactly 1 cluster One cluster has no influence on the others (one dimensional clustering) Genes assigned to clusters on the basis of all experiments

K-means: Real Example Demo in TIGR MeV –Spellman et al. alpha-factor cell cycle

Clustering “Tweaks” Fuzzy clustering – allows genes to be “partially” in different clusters Dependent clusters – consider between- cluster distances as well as within-cluster Bi-clustering – look for patterns across subsets of conditions –Very hard problem (NP-complete) –Practical solutions use heuristics/simplifications that may affect biological interpretation

Cluster Evaluation Mathematical consistency –Compare coherency of clusters to background Look for functional consistency in clusters –Requires a gold standard, often based on GO, MIPS, etc. Evaluate likelihood of enrichment in clusters –Hypergeometric distribution, etc. –Several tools available

Gene Ontology Organization of curated biological knowledge –3 branches: biological process, molecular function, cellular component

Hypergeometric Distribution Probability of observing x or more genes in a cluster of n genes with a common annotation –N = total number of genes in genome –M = number of genes with annotation –n = number of genes in cluster –x = number of genes in cluster with annotation Multiple hypothesis correction required if testing multiple functions (Bonferroni, FDR, etc.) Additional genes in clusters with strong enrichment may be related

GO term Enrichment Tools SGD’s & Princeton’s GoTermFinder – GOLEM ( HIDRA Sealfon et al., 2006

More Unsupervised Methods Search-based approaches –Starting with a query gene/condition, find most related group Singular Value Decomposition (SVD) & Principal Component Analysis (PCA) –Decomposition of data matrix into “patterns” “weights” and “contributions” –Real names are “principal components” “singular values” and “left/right eigenvectors” –Used to remove noise, reduce dimensionality, identify common/dominant signals

SVD is the method, PCA is performing SVD on centered data Projects data into another orthonormal basis New basis ordered by variance explained XU  VtVt = SVD (& PCA) Original Data matrix “Eigen-conditions” Singular values “Eigen-genes”

SVD

SVD: Real Example Demo in TIGR MeV –Spellman et al., 1998 cell cycle time courses alpha-factor sync cdc15 sync

DNA arrays / Sequence-based Analysis Methods so far focused on expression data Other uses of microarrays often sequence based: CGH, ChIP-chip, SNP scanner –Data has important, inherent order –Most analysis methods developed from signal processing techniques (e.g. sound) –View data in chromosomal order (karyoscope) Tools: JavaTreeView, IGB, Chippy

CGH Example Demo in JavaTreeView

(data from Hughes et al. (2000)) Aneuploidy affects expression too rpl20a  rpl20a , Chromosome XV

Software Tools JavaTreeView – viz, karyoscope HIDRA – viz, mult. datasets, search Cluster (Eisen lab) – clustering TIGR MeV – clustering, viz IGB – Affy’s CGH browser ChIPpy – ChIP-chip analysis

Summary Unsupervised Analysis –Let the data organize itself, find patterns –Clustering: Distance Metric + Algorithm –SVD/PCA – auto find dominant patterns Impute missing values (KNN) CGH – Karyoscope view Questions?