Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Similar presentations


Presentation on theme: "UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL."— Presentation transcript:

1 UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL PROCESS. GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR GENE EXPRESSION PROFILES. THESE TISSUES ARE EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL) STATE. CLUSTERING Unsupervised analysis

2 Giraffe DEFINITION OF THE CLUSTERING PROBLEM

3 CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram1 T (RESOLUTION)

4 Giraffe + Okapi BUT WHAT ABOUT THE OKAPI?

5 STATEMENT OF THE PROBLEM GIVEN DATA POINTS X i, i=1,2,...N, EMBEDDED IN D - DIMENSIONAL SPACE, IDENTIFY THE UNDERLYING STRUCTURE OF THE DATA. AIMS:PARTITION THE DATA INTO M CLUSTERS, POINTS OF SAME CLUSTER - "MORE SIMILAR“ M ALSO TO BE DETERMINED! GENERATE DENDROGRAM, IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS "ILL POSED": WHAT IS "MORE SIMILAR"? RESOLUTION Statement of the problem2

6 CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram2 T LINEAR ORDERING OF DATA YOUNG OLD

7 AGGLOMERATIVE HIERARCHICAL –AVERAGE LINKAGE (GENES: EISEN ET. AL., PNAS 1998) CENTROID (REPRESENTATIVE) –SELF ORGANIZED MAPS (KOHONEN 1997; (GENES: GOLUB ET. AL., SCIENCE 1999) --K-MEANS (GENES; TAMAYO ET. AL., PNAS 1999) PHYSICALLY MOTIVATED –DETERMINISTIC ANNEALING (ROSE ET. AL.,PRL 1990; GENES: ALON ET. AL., PNAS 1999) –SUPER-PARAMAGNETIC CLUSTERING (SPC)(BLATT ET.AL. GENES: GETZ ET. AL., PHYSICA 2000,PNAS 2000) CLUSTERING METHODS

8 5 24 13 Agglomerative Hierarchical Clustering 3 1 4 2 5 Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Dendrogram The dendrogram induces a linear ordering of the data points

9 Hierarchical Clustering - Summary Results depend on distance update method Greedy iterative process NOT robust against noise No inherent measure to identify stable clusters

10 2 good clouds COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS

11 2 flat clouds 2 FLAT CLOUDS - SINGLE LINKAGE WORKS

12 filament SINGLE LINKAGE SENSITIVE TO NOISE

13 start here

14 5 24 13 Average linkage 3 1 4 2 5 Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Average Linkage: average distance between all pairs Mean Linkage: distance between centroids Need to define the distance between the new cluster and the other clusters. Average Linkage: average distance between all pairs Mean Linkage: distance between centroids Dendrogram

15 nature 2002 breast cancer

16

17 STATEMENT OF THE PROBLEM GIVEN DATA POINTS X i, i=1,2,...N, EMBEDDED IN D - DIMENSIONAL SPACE, IDENTIFY THE UNDERLYING STRUCTURE OF THE DATA. AIMS:PARTITION THE DATA INTO M CLUSTERS, POINTS OF SAME CLUSTER - "MORE SIMILAR“ M ALSO TO BE DETERMINED! GENERATE DENDROGRAM, IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS "ILL POSED": WHAT IS "MORE SIMILAR"? RESOLUTION Statement of the problem2

18 how many clusters? 3 LARGE MANY small (SPC) toy problem SPC

19 other methods

20 Centroid methods – K-means PARTITIONS THE DATA POINTS INTO K SUBSETS FINDS POSITION OF K CENTROIDS DATA POINTS ARE ASSIGNED TO THE CLOSEST CENTROID FINDS LOCAL MINIMA OF COST: SUM OF SQUARE DISTANCES BETWEEN DATA POINTS AND THEIR ASSOCIATED CENTROID. CLUSTERS ARE CONVEX AND COMPACT

21 K-means Iteration = 0 Start with random positions of centroids.

22 K-means Iteration = 1 Start with random positions of centroids. Assign data points to centroids

23 K-means Iteration = 1 Start with random positions of centroids. Assign data points to centroids Move centroids to center of assigned points

24 K-means Iteration = 3 Start with random positions of centroids. Assign data points to centroids Move centroids to center of assigned points Iterate till minimal cost

25 Result depends on initial centroids’ position Fast algorithm: compute distances from data points to centroids Must preset K Fails for non-spherical distributions K-means - Summary

26 TSS vs K

27 Iris setosa Iris versicolor Iris virginica 50 specimes from each group 4 numbers for each flower 150 data points in 4-dimensional space irises

28 150 points in d=4 3 large clusters d=4

29 Output of SPC Stable clusters “live” for large  T

30 Choosing a value for T

31 Same data - Average Linkage No analog for 

32 Same data - Average Linkage Examining this cluster

33

34


Download ppt "UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL."

Similar presentations


Ads by Google