MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
Blank slide/colon data
gene Hsa ' UTR 2a MYOSIN HEAVY CHAIN, NONMUSCLE (Gallus gallus) tumor: normal: mean = 0.73 std = 0.4 mean = 2.41 std = 1.05
histograms HISTOGRAM, BINS OF 0.5
NORMALIZED (FREQUENCIES) mean = 0.73 std = 0.4mean = 2.41 std = 1.05
t-test T = P = 10 e-14
gene Hsa ' UTR 2a EUKARYOTIC INITIATION FACTOR 4B (Homo sapiens) mean = std = mean = std = tumor: normal:
histograms
NORMALIZED (FREQUENCIES)
t-test T = P = %
gene2000 Hsa.1829 gene 1 Human mRNA fragment for class II histocompatibility antigen beta-chain (pII-beta-4) tumor: normal: mean = std = mean = std = 1.536
histograms
NORMALIZED (FREQUENCIES)
t-test T = P =
E, C&N_log2E colon date expression matrix E log2 E, center, normalize
genes ordered by p-value 726 genes with p < 0.05 ordered by difference of means (normal – tumor)
after ttest 0.05 order by diffmeans genes with p < 0.05 RANDOM DATA
sorted p Q=0.15 I=758
how many out of 726 are false? 0.14 FDR: 726*0.14=101 false separating genes
how many genes at FDR=0.05? 516*0.05=26 false separating genes
26 out of false 26 - false
random data
100separating (p<0.001), 1900 random
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
separation E1E1 E2E2 ALL MLL E 1 -2E 2 = 0 = E 1 - 2E 2 < 0= E 1 - 2E 2 > 0
projection 1 E1E1 E2E2 ALL MLL w +/- PROJECTIONS ON w – DO SEPARATE ALL FROM MLL
projection 2 E1E1 E2E2 ALL MLL +/- PROJECTIONS ON w – DO NOT SEPARATE ALL FROM MLL
projection 3 E1E1 E2E2 WELL SEPARATED CENTERS OF MASS - NO SEPARATION OF THE TWO CLOUDS
projection 4 E1E1 E2E2 WEAK SEPARATION OF CENTERS OF MASS – GOOD SEPARATION OF THE TWO CLOUDS
Fisher to perceptron E1E1 E2E2 ALL MLL OPTIMAL LINE TO PROJECT ON FISHER PERCEPTRON
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL PROCESS. GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR GENE EXPRESSION PROFILES. THESE TISSUES ARE EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL) STATE. CLUSTERING Unsupervised analysis
Giraffe DEFINITION OF THE CLUSTERING PROBLEM
CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram1 T (RESOLUTION)
Giraffe + Okapi BUT WHAT ABOUT THE OKAPI?
STATEMENT OF THE PROBLEM GIVEN DATA POINTS X i, i=1,2,...N, EMBEDDED IN D - DIMENSIONAL SPACE, IDENTIFY THE UNDERLYING STRUCTURE OF THE DATA. AIMS:PARTITION THE DATA INTO M CLUSTERS, POINTS OF SAME CLUSTER - "MORE SIMILAR“ M ALSO TO BE DETERMINED! GENERATE DENDROGRAM, IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS "ILL POSED": WHAT IS "MORE SIMILAR"? RESOLUTION Statement of the problem2
CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram2 T LINEAR ORDERING OF DATA YOUNG OLD
AGGLOMERATIVE HIERARCHICAL –AVERAGE LINKAGE (GENES: EISEN ET. AL., PNAS 1998) CENTROID (REPRESENTATIVE) –SELF ORGANIZED MAPS (KOHONEN 1997; (GENES: GOLUB ET. AL., SCIENCE 1999) --K-MEANS (GENES; TAMAYO ET. AL., PNAS 1999) PHYSICALLY MOTIVATED –DETERMINISTIC ANNEALING (ROSE ET. AL.,PRL 1990; GENES: ALON ET. AL., PNAS 1999) –SUPER-PARAMAGNETIC CLUSTERING (SPC)(BLATT ET.AL. GENES: GETZ ET. AL., PHYSICA 2000,PNAS 2000) CLUSTERING METHODS Clustering methods
Agglomerative Hierarchical Clustering Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Dendrogram The dendrogram induces a linear ordering of the data points
Hierarchical Clustering - Summary Results depend on distance update method Greedy iterative process NOT robust against noise No inherent measure to identify stable clusters
2 good clouds COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS
2 flat clouds 2 FLAT CLOUDS - SINGLE LINKAGE WORKS
filament SINGLE LINKAGE SENSITIVE TO NOISE
Average linkage Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Average Linkage: average distance between all pairs Need to define the distance between the new cluster and the other clusters. Average Linkage: average distance between all pairs Dendrogram
Agglomerative Hierarchical Clustering Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Dendrogram The dendrogram induces a linear ordering of the data points