UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL PROCESS. GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR GENE EXPRESSION PROFILES. THESE TISSUES ARE EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL) STATE. CLUSTERING Unsupervised analysis
Giraffe DEFINITION OF THE CLUSTERING PROBLEM
CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram1 T (RESOLUTION)
Giraffe + Okapi BUT WHAT ABOUT THE OKAPI?
STATEMENT OF THE PROBLEM GIVEN DATA POINTS X i, i=1,2,...N, EMBEDDED IN D - DIMENSIONAL SPACE, IDENTIFY THE UNDERLYING STRUCTURE OF THE DATA. AIMS:PARTITION THE DATA INTO M CLUSTERS, POINTS OF SAME CLUSTER - "MORE SIMILAR“ M ALSO TO BE DETERMINED! GENERATE DENDROGRAM, IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS "ILL POSED": WHAT IS "MORE SIMILAR"? RESOLUTION Statement of the problem2
CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram2 T LINEAR ORDERING OF DATA YOUNG OLD
AGGLOMERATIVE HIERARCHICAL –AVERAGE LINKAGE (GENES: EISEN ET. AL., PNAS 1998) CENTROID (REPRESENTATIVE) –SELF ORGANIZED MAPS (KOHONEN 1997; (GENES: GOLUB ET. AL., SCIENCE 1999) --K-MEANS (GENES; TAMAYO ET. AL., PNAS 1999) PHYSICALLY MOTIVATED –DETERMINISTIC ANNEALING (ROSE ET. AL.,PRL 1990; GENES: ALON ET. AL., PNAS 1999) –SUPER-PARAMAGNETIC CLUSTERING (SPC)(BLATT ET.AL. GENES: GETZ ET. AL., PHYSICA 2000,PNAS 2000) CLUSTERING METHODS
Agglomerative Hierarchical Clustering Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Dendrogram The dendrogram induces a linear ordering of the data points
Hierarchical Clustering - Summary Results depend on distance update method Greedy iterative process NOT robust against noise No inherent measure to identify stable clusters
2 good clouds COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS
2 flat clouds 2 FLAT CLOUDS - SINGLE LINKAGE WORKS
filament SINGLE LINKAGE SENSITIVE TO NOISE
start here
Average linkage Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Average Linkage: average distance between all pairs Mean Linkage: distance between centroids Need to define the distance between the new cluster and the other clusters. Average Linkage: average distance between all pairs Mean Linkage: distance between centroids Dendrogram
nature 2002 breast cancer
STATEMENT OF THE PROBLEM GIVEN DATA POINTS X i, i=1,2,...N, EMBEDDED IN D - DIMENSIONAL SPACE, IDENTIFY THE UNDERLYING STRUCTURE OF THE DATA. AIMS:PARTITION THE DATA INTO M CLUSTERS, POINTS OF SAME CLUSTER - "MORE SIMILAR“ M ALSO TO BE DETERMINED! GENERATE DENDROGRAM, IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS "ILL POSED": WHAT IS "MORE SIMILAR"? RESOLUTION Statement of the problem2
how many clusters? 3 LARGE MANY small (SPC) toy problem SPC
other methods
Centroid methods – K-means PARTITIONS THE DATA POINTS INTO K SUBSETS FINDS POSITION OF K CENTROIDS DATA POINTS ARE ASSIGNED TO THE CLOSEST CENTROID FINDS LOCAL MINIMA OF COST: SUM OF SQUARE DISTANCES BETWEEN DATA POINTS AND THEIR ASSOCIATED CENTROID. CLUSTERS ARE CONVEX AND COMPACT
K-means Iteration = 0 Start with random positions of centroids.
K-means Iteration = 1 Start with random positions of centroids. Assign data points to centroids
K-means Iteration = 1 Start with random positions of centroids. Assign data points to centroids Move centroids to center of assigned points
K-means Iteration = 3 Start with random positions of centroids. Assign data points to centroids Move centroids to center of assigned points Iterate till minimal cost
Result depends on initial centroids’ position Fast algorithm: compute distances from data points to centroids Must preset K Fails for non-spherical distributions K-means - Summary
TSS vs K
Iris setosa Iris versicolor Iris virginica 50 specimes from each group 4 numbers for each flower 150 data points in 4-dimensional space irises
150 points in d=4 3 large clusters d=4
Output of SPC Stable clusters “live” for large T
Choosing a value for T
Same data - Average Linkage No analog for
Same data - Average Linkage Examining this cluster