An Overview of Clustering Methods Michael D. Kane, Ph.D.

Topics What is clustering? Clustering mechanics (how the computer does it). Parameter choices and their effect. Examples.

What is clustering? Grouping by similarity.

Similar genes. Group genes that have similar expression profiles when observed over multiple samples. Genes Samples Gene clustering

Similar samples. Group samples that are similar when observed over multiple genes. Genes Samples Sample clustering

Why cluster? Similar gene expression infers common biology. Function of uncharacterized genes may be deduced from co- expression with known genes. Associate expression patterns with: Response to environmental change. Disease pathology/progression.

Clustering Mechanics E1E1 + + - - E2E2 Gene a Gene e Gene b Gene c Gene d Gene f E2E2 E1E1 E2E2 c ed f For gene clustering, we must measure similarity between genes. a b

Distance (similarity) measure E1E1 + + - - E2E2 a b c ed f Euclidean distance d be (4.6, 0.5) (1.0, 1.7)

Distance Measure Pearson Correlation S=(-1... +1) Used in “Eisen” clustering

Hierarchical Clustering E1E1 + + - - E2E2 a b c ed f a b c d e f

Measuring distance between clusters Single linkage The minimum distance between clusters. May form loose clusters. Complete linkage The maximum distance between clusters. Tends to form compact clusters. Produces “chained” clusters.

Methods for joining clusters UPGMA unweighted pair group method (Average linkage) The average distance between clusters. Weighted pair group method Same as UPGMA but the distance is weighted by cluster size. Use when clusters are expected to be significantly uneven in size!

Effect of distance measure Euclidean Single Linkage Euclidean Complete Linkage

Effect of distance measure Euclidean UPGMA Euclidean Ward’s Method

Alternatives to hierarchical clustering Number of clusters specified by user. Good when prior knowledge available. k-means

k-means clustering E1E1 + + - - E2E2 a b c ed f 1. Number of clusters specified by user. 2. Genes randomly assigned to clusters. 3. Assess inter and intra-cluster similarity. 4. Move genes to alternative cluster if distance is reduced. 3. Assess inter and intra-cluster similarity. 4. Move genes to alternative cluster if distance is reduced.

Alternatives to hierarchical clustering Number of clusters specified by user. Good when prior knowledge available. SOM Self-organizing maps

SOM Gene a Gene e Gene b Gene c Gene d Gene f E2E2 E1E1 E2E2 +0-+0- +0-+0- +0-+0- +0-+0- +0-+0- +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 cluster 1 cluster 2 cluster 3 User specified number of clusters. Each initially given a random expression representation. +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 cluster 1 cluster 2 cluster 3 For a gene, find the most similar cluster representation. +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 cluster 1 cluster 2 cluster 3 Increase the similarity by adjusting the cluster representation. “Training” +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 cluster 1 cluster 2 cluster 3 Iteratively train the cluster representations. +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 cluster 1 cluster 2 cluster 3 After training, assign each gene to the most similar cluster.

Gene clustering Eisen et al., Cluster analysis and display of genome-wide expression patterns. PNAS v95,14863-14868, 1998 24 hour time course after re-introduction of serum to serum-deprived human fibroblasts. Pearson correlation, average linkage. cholesterol biosynthesis cell cycle immediate-early response signaling wound healing

Sample clustering Ross et al., Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics v24, 227-235, 2000 64 cancer cell lines clustered. 8,000 genes. Clustering performed with 2 different subsets of genes. Similar results. Pearson correlation, average linkage. Note breast cancer cell lines, derived from the same patient.

Summary Different methods often provide different clusters. No overall “best” clustering method. Clustering applied to unrelated data will still provide clusters. Use biological insight in method selection and interpretation.

Clustering E1E1 + + - - E2E2 a b c ed f a b c d e f

SOM Gene a Gene e Gene b Gene c Gene d Gene f E2E2 E1E1 E2E2 +0-+0- +0-+0- +0-+0- +0-+0- +0-+0- +0-+0- +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 +0-+0- E1E1 E2E2 cluster 1 cluster 2 cluster 3 After training, assign each gene to the most similar cluster.

An Overview of Clustering Methods Michael D. Kane, Ph.D.

Similar presentations

Presentation on theme: "An Overview of Clustering Methods Michael D. Kane, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Overview of Clustering Methods Michael D. Kane, Ph.D.

Similar presentations

Presentation on theme: "An Overview of Clustering Methods Michael D. Kane, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback