Microarray Clustering Xiaole Shirley Liu STAT115/215
Profiles Heat map display Gene expression profile refers to individual rows or columns as profiles Row is profile of a gene Column is profile of a sample / condition Tongji 2009
Clustering Can cluster genes or samples Genes: have similar expression profile over different samples or conditions Samples: have similar expression profile over all the genes
Clustering Motivation for clustering: Clustering quality: Visualizing data Understand general characteristics of data Make generalizations about gene behavior Classify samples Clustering quality: “Distance”: 1 – corr Maximize inter (between)-cluster distance Minimize intra (within)-cluster distance
Cluster Stability Mask out some data (e.g. only sample a subset of genes or samples) to perform clustering Introduce a little noise to the data, then perform clustering May select only diff expr genes for clustering, or use genes on the same pathway, etc.
Missing Value Estimation Missing values affects clustering E.g. when calculating distances Naïve solution: Ignore gene/sample, fill in 0 or Avg K nearest neighbors Gene i has missing value at sample j Calculate corr between gene i and all other genes using all samples except j Find k genes (e.g. 10) with highest corr to i Calculate k genes’ avg expression at sample j
Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8
Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8
Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6
Hierarchical Clustering Repeatedly Merge two nodes (either a gene or a cluster) that are closest to each other Re-calculate the distance from newly formed node to all other nodes Branch length represents distance Linkage: distance from newly formed node to all other nodes Tongji 2009
Hierarchical Clustering Linkage Single Complete Average: pairwise distances
Some Brain Teasers If we have n data points How many internal nodes are in the H-cluster? How many possible ways to draw the H-cluster Break
Partitional Clustering Disjoint groups From hierarchical clustering: Cut a line from hierarchical clustering By varying the cut height, we could produce arbitrary number of clusters
K-means Algorithm Choose K centroids at random Expression in Sample1 Iteration = 0
K-means Algorithm Choose K centroids at random Assign object i to closest centroid Iteration = 1
K-means Algorithm Choose K centroids at random Assign object i to closest centroid Recalculate centroid based on current cluster assignment Iteration = 2
K-means Algorithm Choose K centroids at random Assign object i to closest centroid Recalculate centroid based on current cluster assignment Repeat until assignment stabilize Iteration = 3
K-means Algorithm Example Assign each objects to most similar center 1 2 3 4 5 6 7 8 9 10 Update the cluster means 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center reassign reassign Update the cluster means
K-means Algorithm Deterministic method Initial cluster centers important Can be trapped in local optimal How to pick initial cluster centers Run hierarchical cluster, find cut line Random start many times with different initial centers Not tolerant to outliers or noise
Problem With Outliers x1 x2 x3
Partition Around Medoids Pick one real data point (closest to all) instead of average as the centroid of the cluster More robust in the presence of noise and outliers
How to Pick K K = 2, gradually increase Improvement: reduce within cluster distance and increase between cluster distance Cost: cost with each increase in K Compare the cost with improvement, stop when not worth it
How to Pick K W(k) = total sum of squares within clusters B(k) = sum of squares between cluster means n = total number of data points Calinski & Harabasz, 1974 Hartigan, 1975 stop when H(k) < 10
How to Pick K
Practical Considerations Only cluster genes that are variable across samples The magic of 7! Break
Batch Effect Removal
Motivation Batch Effect: Non-biological variation Make samples not directly comparable Caused by differences: Array platform Lab protocol or experimenter Time of day or hybridization Atmospheric ozone level (Rhodes et al. 2004)
Example of Batch Effect Knock down expression using different cells and different siRNA in 3 batches
Batch vs Normalization Shouldn’t normalization take care of this? NO! Normalization and expression indexing procedures only adjust sample and probe effects Usually these are orthogonal to the batch effects, so batch effects are left in the data
Importance of Experimental Design Record run date, labs, personnel, environmental variables, etc. When possible, balance groups of interests, at least include some controls in each batch Avoid “perfect confounding” when batch and group are perfectly correlated. E.g. Treatments in one batch, controls in another Treatment 1 in Batch 1, Treatment 2 in Batch 2
Decomposing Data Heterogeneity Primary variables (T & C) Random variations (Error) Baseline expression Samples Genes = + +
Decomposing Data Heterogeneity Primary variables (T & C) Dependence variables (Batch) Independent variations (Error) Baseline expression Samples Genes = + + + Intuitively similar to LIMMA’s linear model, consider batch as some kind of treatment Break
Gene Ontology
Gene Annotation How to report differentially expressed genes or gene clusters? Enriched for certain pathways, certain functions, or proteins localized in the same complex, etc? Gene Ontology Consortium Ashburner et al 1998 Annotate gene function in the human genome Now extended to many model organisms Why do we care? Effectively communicate biomedical knowledge Organize and summarize annotations in structured way Allow effective and meaningful computation on gene annotations
GO Categories Molecular function Biological process Cellular component Describe gene’s jobs or abilities E.g. transporters, transcription factor Biological process Events or pathways E.g. cell differentiation, maturation, development Cellular component Describe locations (subcellular structures, macromolecular complexes) E.g. nucleus, cell membrane, protein complexes
GO
GO Relationships: Same term could be annotated at multiple branches Subclass: Is_a Membership: Part_of Topological: adjacent_to; Derivation: derives_from E.g. 5_prime_UTR is part_of a transcript, and mRNA is_a kind of transcript Same term could be annotated at multiple branches Directed acyclic graph
Evaluate Differentially Expressed Genes NetAffx mapped GO terms for all probesets Whole genome Up genes GO term X 100 80 Total 20K 200 Statistical significance? Binomial proportional test p = 100 / 20 K = 0.005 Check z table
Evaluate Differentially Expressed Genes Whole genome Up genes GO term X 100 80 Total 20K 200 Chi sq test or Fisher’s exact test: Up !Up Total GO: 80 (1) 20 (99) 100 !GO: 120 (199) 20K-120 (19701) 20K-100 Total: 200 20K-200 20K Check Chi-sq table
GO Tools for Microarray Analysis http://neurolex.org/wiki/Category:Resource:Gene_Ontology_Tools Hundreds DAVID
Summary Clustering by distance Missing value estimation Hierarchical clustering, linkage K-mean clustering, how to determine K Batch effect considerations COMBAT for batch effect removal GO categories, directed acyclic GO enrichment statistics