Microarray Clustering

Microarray Clustering
Xiaole Shirley Liu STAT115/215

Profiles Heat map display
Gene expression profile refers to individual rows or columns as profiles Row is profile of a gene Column is profile of a sample / condition Tongji 2009

Clustering Can cluster genes or samples
Genes: have similar expression profile over different samples or conditions Samples: have similar expression profile over all the genes

Clustering Motivation for clustering: Clustering quality:
Visualizing data Understand general characteristics of data Make generalizations about gene behavior Classify samples Clustering quality: “Distance”: 1 – corr Maximize inter (between)-cluster distance Minimize intra (within)-cluster distance

Cluster Stability Mask out some data (e.g. only sample a subset of genes or samples) to perform clustering Introduce a little noise to the data, then perform clustering May select only diff expr genes for clustering, or use genes on the same pathway, etc.

Missing Value Estimation
Missing values affects clustering E.g. when calculating distances Naïve solution: Ignore gene/sample, fill in 0 or Avg K nearest neighbors Gene i has missing value at sample j Calculate corr between gene i and all other genes using all samples except j Find k genes (e.g. 10) with highest corr to i Calculate k genes’ avg expression at sample j

Hierarchical Clustering
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6

Repeatedly Merge two nodes (either a gene or a cluster) that are closest to each other Re-calculate the distance from newly formed node to all other nodes Branch length represents distance Linkage: distance from newly formed node to all other nodes Tongji 2009

Hierarchical Clustering Linkage
Single Complete Average: pairwise distances

Some Brain Teasers If we have n data points
How many internal nodes are in the H-cluster? How many possible ways to draw the H-cluster Break

Partitional Clustering
Disjoint groups From hierarchical clustering: Cut a line from hierarchical clustering By varying the cut height, we could produce arbitrary number of clusters

K-means Algorithm Choose K centroids at random Expression in Sample1
Iteration = 0

K-means Algorithm Choose K centroids at random
Assign object i to closest centroid Iteration = 1

Assign object i to closest centroid Recalculate centroid based on current cluster assignment Iteration = 2

Assign object i to closest centroid Recalculate centroid based on current cluster assignment Repeat until assignment stabilize Iteration = 3

K-means Algorithm Example
Assign each objects to most similar center 1 2 3 4 5 6 7 8 9 10 Update the cluster means 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center reassign reassign Update the cluster means

K-means Algorithm Deterministic method
Initial cluster centers important Can be trapped in local optimal How to pick initial cluster centers Run hierarchical cluster, find cut line Random start many times with different initial centers Not tolerant to outliers or noise

Problem With Outliers x1 x2 x3

Partition Around Medoids
Pick one real data point (closest to all) instead of average as the centroid of the cluster More robust in the presence of noise and outliers

How to Pick K K = 2, gradually increase
Improvement: reduce within cluster distance and increase between cluster distance Cost: cost with each increase in K Compare the cost with improvement, stop when not worth it

How to Pick K W(k) = total sum of squares within clusters
B(k) = sum of squares between cluster means n = total number of data points Calinski & Harabasz, 1974 Hartigan, 1975 stop when H(k) < 10

How to Pick K

Practical Considerations
Only cluster genes that are variable across samples The magic of 7! Break

Batch Effect Removal

Motivation Batch Effect: Non-biological variation
Make samples not directly comparable Caused by differences: Array platform Lab protocol or experimenter Time of day or hybridization Atmospheric ozone level (Rhodes et al. 2004)

Example of Batch Effect
Knock down expression using different cells and different siRNA in 3 batches

Batch vs Normalization
Shouldn’t normalization take care of this? NO! Normalization and expression indexing procedures only adjust sample and probe effects Usually these are orthogonal to the batch effects, so batch effects are left in the data

Importance of Experimental Design
Record run date, labs, personnel, environmental variables, etc. When possible, balance groups of interests, at least include some controls in each batch Avoid “perfect confounding” when batch and group are perfectly correlated. E.g. Treatments in one batch, controls in another Treatment 1 in Batch 1, Treatment 2 in Batch 2

Decomposing Data Heterogeneity
Primary variables (T & C) Random variations (Error) Baseline expression Samples Genes =

Decomposing Data Heterogeneity
Primary variables (T & C) Dependence variables (Batch) Independent variations (Error) Baseline expression Samples Genes = Intuitively similar to LIMMA’s linear model, consider batch as some kind of treatment Break

Gene Ontology

Gene Annotation How to report differentially expressed genes or gene clusters? Enriched for certain pathways, certain functions, or proteins localized in the same complex, etc? Gene Ontology Consortium Ashburner et al 1998 Annotate gene function in the human genome Now extended to many model organisms Why do we care? Effectively communicate biomedical knowledge Organize and summarize annotations in structured way Allow effective and meaningful computation on gene annotations

GO Categories Molecular function Biological process Cellular component
Describe gene’s jobs or abilities E.g. transporters, transcription factor Biological process Events or pathways E.g. cell differentiation, maturation, development Cellular component Describe locations (subcellular structures, macromolecular complexes) E.g. nucleus, cell membrane, protein complexes

GO Relationships: Same term could be annotated at multiple branches
Subclass: Is_a Membership: Part_of Topological: adjacent_to; Derivation: derives_from E.g. 5_prime_UTR is part_of a transcript, and mRNA is_a kind of transcript Same term could be annotated at multiple branches Directed acyclic graph

Evaluate Differentially Expressed Genes
NetAffx mapped GO terms for all probesets Whole genome Up genes GO term X Total 20K Statistical significance? Binomial proportional test p = 100 / 20 K = 0.005 Check z table

Evaluate Differentially Expressed Genes
Whole genome Up genes GO term X Total 20K Chi sq test or Fisher’s exact test: Up !Up Total GO: 80 (1) 20 (99) 100 !GO: 120 (199) 20K-120 (19701) 20K-100 Total: K K Check Chi-sq table

GO Tools for Microarray Analysis
Hundreds DAVID

Summary Clustering by distance Missing value estimation
Hierarchical clustering, linkage K-mean clustering, how to determine K Batch effect considerations COMBAT for batch effect removal GO categories, directed acyclic GO enrichment statistics

Microarray Clustering

Similar presentations

Presentation on theme: "Microarray Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microarray Clustering

Similar presentations

Presentation on theme: "Microarray Clustering"— Presentation transcript:

Similar presentations

About project

Feedback