Presentation is loading. Please wait.

Presentation is loading. Please wait.

More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab.

Similar presentations


Presentation on theme: "More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab."— Presentation transcript:

1 More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab

2 Outline Gene Expression vs. DNA applications A little more normalization (missing values) Unsupervised Analysis –Basic Clustering –Statistical Enrichment –PCA/SVD –Advanced Clustering –Search-based Approaches

3 Expression / DNA Some similar concepts to analysis, but often very different goals Expression – clustering, guilt by association, functional enrichment DNA – signal processing, spatial relationships, motif finding Visualized differently (Heat maps vs. karyoscope)

4 The missing value problem Microarrays can have systematic or random missing values Some algorithms can’t deal with missing values (PCA/SVD in particular) Instead of hoping missing values won’t bias the analysis, better to estimate them accurately

5 Spatial Defects

6 KNN Impute Idea: use genes with similar expression profiles to estimate missing values 2 | 4 | 5 | 7 | 3 | 2 2 | | 5 | 7 | 3 | 1 8 | 9 | 2 | 1 | 4 | 9 Gene X Gene A Gene B 3 | 5 | 6 | 7 | 3 | 2 Gene C 2 | 4 | 5 | 7 | 3 | 2 2 |4.3| 5 | 7 | 3 | 1 8 | 9 | 2 | 1 | 4 | 9 Gene X Gene A Gene B 3 | 5 | 6 | 7 | 3 | 2 Gene C

7 Complete data setData set with missing values estimated by KNNimpute algorithm Data set with 30% entries missing and filled with zeros (zero values appear black) Imputation affects downstream analysis

8 Unsupervised Analysis Supervised techniques great if you have starting information (e.g. labels) –But, we often we don’t know enough beforehand to apply these methods Unsupervised techniques are exploratory –Let the data organize itself, then try to find biological meaning –Approaches to understand whole data –Visualization often helpful

9 Clustering Let the data organize itself Reordering of genes (or conditions) in the dataset so that similar patterns are next to each other (or in separate groups) Identify subsets of genes (or experiments) that are related by some measure

10 Quick Example Genes Conditions

11 Why cluster? “Guilt by association” – if unknown gene X is similar in expression to known genes A and B, maybe they are involved in the same/related pathway Visualization: datasets are too large to be able to get information out without reorganizing the data

12 Clustering Techniques Algorithm (Method) –Hierarchical –K-means –Self Organizing Maps –QT-Clustering –NNN –. Distance Metric –Euclidean (L 2 ) –Pearson Correlation –Spearman Correlation –Manhattan (L 1 ) –Kendall’s  –.

13 Distance Metrics Choice of distance measure is important for most clustering techniques Pair-wise metrics – compare vectors of numbers –e.g. genes x & y, ea. with n measurements Euclidean Distance Pearson Correlation Spearman Correlation

14 Distance Metrics Euclidean Distance Pearson Correlation Spearman Correlation

15 Hierarchical clustering Imposes (pair-wise) hierarchical structure on all of the data Often good for visualization Basic Method (agglomerative): 1.Calculate all pair-wise distances 2.Join the closest pair 3.Calculate pair’s distance to all others 4.Repeat from 2 until all joined

16 Hierarchical clustering

17

18

19

20

21

22 HC – Interior Distances Three typical variants to calculate interior distances within the tree –Average linkage: mean/median over all possible pair-wise values –Single linkage: minimum pair-wise distance –Complete linkage: maximum pair-wise distance

23 Hierarchical clustering: problems Hard to define distinct clusters Genes assigned to clusters on the basis of all experiments Optimizing node ordering hard (finding the optimal solution is NP-hard) Can be driven by one strong cluster – a problem for gene expression b/c data in row space is often highly correlated

24 HC: Real Example Demo in JavaTreeView & HIDRA –Spellman et al., 1998: yeast alpha-factor sync cell cycle timecourse

25 HC: Another Example Expression of tumors hierarchically clustered Expression groups by clinical class Garber et al.

26 K-means Clustering Groups genes into a pre-defined number of independent clusters Basic algorithm: 1.Define k = number of clusters 2.Randomly initialize each cluster with a seed (often with a random gene) 3.Assign each gene to the cluster with the most similar seed 4.Recalculate all cluster seeds as means (or medians) of genes assigned to the cluster 5.Repeat 3 & 4 until convergence (e.g. No genes move, means don’t change much, etc.)

27 K-means example

28

29

30 K-means: problems Have to set k ahead of time –Ways to choose “optimal” k: minimize within- cluster variation compared to random data or held out data Each gene only belongs to exactly 1 cluster One cluster has no influence on the others (one dimensional clustering) Genes assigned to clusters on the basis of all experiments

31 K-means: Real Example Demo in TIGR MeV –Spellman et al. alpha-factor cell cycle

32 Clustering “Tweaks” Fuzzy clustering – allows genes to be “partially” in different clusters Dependent clusters – consider between- cluster distances as well as within-cluster Bi-clustering – look for patterns across subsets of conditions –Very hard problem (NP-complete) –Practical solutions use heuristics/simplifications that may affect biological interpretation

33 Cluster Evaluation Mathematical consistency –Compare coherency of clusters to background Look for functional consistency in clusters –Requires a gold standard, often based on GO, MIPS, etc. Evaluate likelihood of enrichment in clusters –Hypergeometric distribution, etc. –Several tools available

34 Gene Ontology Organization of curated biological knowledge –3 branches: biological process, molecular function, cellular component

35 Hypergeometric Distribution Probability of observing x or more genes in a cluster of n genes with a common annotation –N = total number of genes in genome –M = number of genes with annotation –n = number of genes in cluster –x = number of genes in cluster with annotation Multiple hypothesis correction required if testing multiple functions (Bonferroni, FDR, etc.) Additional genes in clusters with strong enrichment may be related

36 GO term Enrichment Tools SGD’s & Princeton’s GoTermFinder –http://go.princeton.edu GOLEM (http://function.princeton.edu/GOLEM) HIDRA Sealfon et al., 2006

37 More Unsupervised Methods Search-based approaches –Starting with a query gene/condition, find most related group Singular Value Decomposition (SVD) & Principal Component Analysis (PCA) –Decomposition of data matrix into “patterns” “weights” and “contributions” –Real names are “principal components” “singular values” and “left/right eigenvectors” –Used to remove noise, reduce dimensionality, identify common/dominant signals

38 SVD is the method, PCA is performing SVD on centered data Projects data into another orthonormal basis New basis ordered by variance explained XU  VtVt = SVD (& PCA) Original Data matrix “Eigen-conditions” Singular values “Eigen-genes”

39 SVD

40 SVD: Real Example Demo in TIGR MeV –Spellman et al., 1998 cell cycle time courses alpha-factor sync cdc15 sync

41 DNA arrays / Sequence-based Analysis Methods so far focused on expression data Other uses of microarrays often sequence based: CGH, ChIP-chip, SNP scanner –Data has important, inherent order –Most analysis methods developed from signal processing techniques (e.g. sound) –View data in chromosomal order (karyoscope) Tools: JavaTreeView, IGB, Chippy

42 CGH Example Demo in JavaTreeView

43 (data from Hughes et al. (2000)) Aneuploidy affects expression too rpl20a  rpl20a , Chromosome XV

44 Software Tools JavaTreeView – viz, karyoscope HIDRA – viz, mult. datasets, search Cluster (Eisen lab) – clustering TIGR MeV – clustering, viz IGB – Affy’s CGH browser ChIPpy – ChIP-chip analysis

45 Summary Unsupervised Analysis –Let the data organize itself, find patterns –Clustering: Distance Metric + Algorithm –SVD/PCA – auto find dominant patterns Impute missing values (KNN) CGH – Karyoscope view Questions?


Download ppt "More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab."

Similar presentations


Ads by Google