Presentation is loading. Please wait.

Presentation is loading. Please wait.

Microarray Clustering

Similar presentations


Presentation on theme: "Microarray Clustering"— Presentation transcript:

1 Microarray Clustering
Xiaole Shirley Liu STAT115/215

2 Profiles Heat map display
Gene expression profile refers to individual rows or columns as profiles Row is profile of a gene Column is profile of a sample / condition Tongji 2009

3 Clustering Can cluster genes or samples
Genes: have similar expression profile over different samples or conditions Samples: have similar expression profile over all the genes

4 Clustering Motivation for clustering: Clustering quality:
Visualizing data Understand general characteristics of data Make generalizations about gene behavior Classify samples Clustering quality: “Distance”: 1 – corr Maximize inter (between)-cluster distance Minimize intra (within)-cluster distance

5 Cluster Stability Mask out some data (e.g. only sample a subset of genes or samples) to perform clustering Introduce a little noise to the data, then perform clustering May select only diff expr genes for clustering, or use genes on the same pathway, etc.

6 Missing Value Estimation
Missing values affects clustering E.g. when calculating distances Naïve solution: Ignore gene/sample, fill in 0 or Avg K nearest neighbors Gene i has missing value at sample j Calculate corr between gene i and all other genes using all samples except j Find k genes (e.g. 10) with highest corr to i Calculate k genes’ avg expression at sample j

7 Hierarchical Clustering
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

8 Hierarchical Clustering
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

9 Hierarchical Clustering
Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6

10 Hierarchical Clustering
Repeatedly Merge two nodes (either a gene or a cluster) that are closest to each other Re-calculate the distance from newly formed node to all other nodes Branch length represents distance Linkage: distance from newly formed node to all other nodes Tongji 2009

11 Hierarchical Clustering Linkage
Single Complete Average: pairwise distances

12 Some Brain Teasers If we have n data points
How many internal nodes are in the H-cluster? How many possible ways to draw the H-cluster Break

13 Partitional Clustering
Disjoint groups From hierarchical clustering: Cut a line from hierarchical clustering By varying the cut height, we could produce arbitrary number of clusters

14 K-means Algorithm Choose K centroids at random Expression in Sample1
Iteration = 0

15 K-means Algorithm Choose K centroids at random
Assign object i to closest centroid Iteration = 1

16 K-means Algorithm Choose K centroids at random
Assign object i to closest centroid Recalculate centroid based on current cluster assignment Iteration = 2

17 K-means Algorithm Choose K centroids at random
Assign object i to closest centroid Recalculate centroid based on current cluster assignment Repeat until assignment stabilize Iteration = 3

18 K-means Algorithm Example
Assign each objects to most similar center 1 2 3 4 5 6 7 8 9 10 Update the cluster means 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center reassign reassign Update the cluster means

19 K-means Algorithm Deterministic method
Initial cluster centers important Can be trapped in local optimal How to pick initial cluster centers Run hierarchical cluster, find cut line Random start many times with different initial centers Not tolerant to outliers or noise

20 Problem With Outliers x1 x2 x3

21 Partition Around Medoids
Pick one real data point (closest to all) instead of average as the centroid of the cluster More robust in the presence of noise and outliers

22 How to Pick K K = 2, gradually increase
Improvement: reduce within cluster distance and increase between cluster distance Cost: cost with each increase in K Compare the cost with improvement, stop when not worth it

23 How to Pick K W(k) = total sum of squares within clusters
B(k) = sum of squares between cluster means n = total number of data points Calinski & Harabasz, 1974 Hartigan, 1975 stop when H(k) < 10

24 How to Pick K

25 Practical Considerations
Only cluster genes that are variable across samples The magic of 7! Break

26 Batch Effect Removal

27 Motivation Batch Effect: Non-biological variation
Make samples not directly comparable Caused by differences: Array platform Lab protocol or experimenter Time of day or hybridization Atmospheric ozone level (Rhodes et al. 2004)

28 Example of Batch Effect
Knock down expression using different cells and different siRNA in 3 batches

29 Batch vs Normalization
Shouldn’t normalization take care of this? NO! Normalization and expression indexing procedures only adjust sample and probe effects Usually these are orthogonal to the batch effects, so batch effects are left in the data

30 Importance of Experimental Design
Record run date, labs, personnel, environmental variables, etc. When possible, balance groups of interests, at least include some controls in each batch Avoid “perfect confounding” when batch and group are perfectly correlated. E.g. Treatments in one batch, controls in another Treatment 1 in Batch 1, Treatment 2 in Batch 2

31 Decomposing Data Heterogeneity
Primary variables (T & C) Random variations (Error) Baseline expression Samples Genes =

32 Decomposing Data Heterogeneity
Primary variables (T & C) Dependence variables (Batch) Independent variations (Error) Baseline expression Samples Genes = Intuitively similar to LIMMA’s linear model, consider batch as some kind of treatment Break

33 Gene Ontology

34 Gene Annotation How to report differentially expressed genes or gene clusters? Enriched for certain pathways, certain functions, or proteins localized in the same complex, etc? Gene Ontology Consortium Ashburner et al 1998 Annotate gene function in the human genome Now extended to many model organisms Why do we care? Effectively communicate biomedical knowledge Organize and summarize annotations in structured way Allow effective and meaningful computation on gene annotations

35 GO Categories Molecular function Biological process Cellular component
Describe gene’s jobs or abilities E.g. transporters, transcription factor Biological process Events or pathways E.g. cell differentiation, maturation, development Cellular component Describe locations (subcellular structures, macromolecular complexes) E.g. nucleus, cell membrane, protein complexes

36 GO

37 GO Relationships: Same term could be annotated at multiple branches
Subclass: Is_a Membership: Part_of Topological: adjacent_to; Derivation: derives_from E.g. 5_prime_UTR is part_of a transcript, and mRNA is_a kind of transcript Same term could be annotated at multiple branches Directed acyclic graph

38 Evaluate Differentially Expressed Genes
NetAffx mapped GO terms for all probesets Whole genome Up genes GO term X Total 20K Statistical significance? Binomial proportional test p = 100 / 20 K = 0.005 Check z table

39 Evaluate Differentially Expressed Genes
Whole genome Up genes GO term X Total 20K Chi sq test or Fisher’s exact test: Up !Up Total GO: 80 (1) 20 (99) 100 !GO: 120 (199) 20K-120 (19701) 20K-100 Total: K K Check Chi-sq table

40 GO Tools for Microarray Analysis
Hundreds DAVID

41 Summary Clustering by distance Missing value estimation
Hierarchical clustering, linkage K-mean clustering, how to determine K Batch effect considerations COMBAT for batch effect removal GO categories, directed acyclic GO enrichment statistics


Download ppt "Microarray Clustering"

Similar presentations


Ads by Google