Download presentation
Presentation is loading. Please wait.
Published byEllen Barber Modified over 9 years ago
1
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic methodologies1 SUPERVISED METHODS CAN ONLY VALIDATE OR REJECT HYPOTHESES. CAN NOT LEAD TO DISCOVERY OF UNEXPECTED PARTITIONS. UNSUPERVISED: EXPLORATORY ANALYSIS NO PRIOR KNOWLEDGE IS USED EXPLORE STRUCTURE OF DATA ON THE BASIS OF CORRELATIONS AND SIMILARITIES
2
AIMS: ASSIGN PATIENTS TO GROUPS ON THE BASIS OF THEIR EXPRESSION PROFILES IDENTIFY DIFFERENCES BETWEEN TUMORS AT DIFFERENT STAGES IDENTIFY GENES THAT PLAY CENTRAL ROLES IN DISEASE PROGRESSION EACH PATIENT IS DESCRIBED BY 30,000 NUMBERS: ITS EXPRESSION PROFILE
3
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL PROCESS. GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR GENE EXPRESSION PROFILES. THESE TISSUES ARE EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL) STATE. CLUSTERING, SORTING Unsupervised analysis
4
Giraffe DEFINITION OF THE CLUSTERING PROBLEM
5
CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram1 T (RESOLUTION) LINEAR ORDERING OF DATA
6
Giraffe + Okapi BUT WHAT ABOUT THE OKAPI?
7
STATEMENT OF THE PROBLEM GIVEN DATA POINTS X i, i=1,2,...N, EMBEDDED IN D - DIMENSIONAL SPACE, IDENTIFY THE UNDERLYING STRUCTURE OF THE DATA. AIMS:PARTITION THE DATA INTO M CLUSTERS, POINTS OF SAME CLUSTER - "MORE SIMILAR“ M ALSO TO BE DETERMINED! GENERATE DENDROGRAM, IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS "ILL POSED": WHAT IS "MORE SIMILAR"? RESOLUTION
8
CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram2 T STABILITY T LINEAR ORDERING OF DATA YOUNG OLD
9
CENTROID (REPRESENTATIVE) –SELF ORGANIZED MAPS (KOHONEN 1997; (GENES: GOLUB ET. AL., SCIENCE 1999) –K-MEANS (GENES; TAMAYO ET. AL., PNAS 1999) AGGLOMERATIVE HIERARCHICAL -AVERAGE LINKAGE (GENES: EISEN ET. AL., PNAS 1998) PHYSICALLY MOTIVATED –DETERMINISTIC ANNEALING (ROSE ET. AL.,PRL 1990; GENES: ALON ET. AL., PNAS 1999) –SUPER-PARAMAGNETIC CLUSTERING (SPC)(BLATT ET.AL. GENES: GETZ ET. AL., PHYSICA 2000,PNAS 2000) --COUPLED MAPS (ANGELINI ET. AL., PRL 2000) CLUSTERING METHODS Clustering methods
10
INFORMATION THEORY –AGGLOMERATIVE INFORMATION BOTTLENECK (TISHBY ET. AL.) LINEAR ALGEBRA –SPECTRAL METHODS (MALIK ET. AL.) MULTIGRID BASED METHODS (BRANDT ET. AL., ) CLUSTERING METHODS (Cont) Clustering methods
11
Centroid methods – K-means i = 1,...,N DATA POINTS, AT X i = 1,...,K CENTROIDS, AT Y ASSIGN DATA POINT i TO CENTROID ; S i = COST E: E(S 1, S 2,...,S N ; Y 1,...Y K ) = MINIMIZE E OVER S i, Y
12
K-means “GUESS” K=3
13
K-means Iteration = 0 Start with random positions of centroids.
14
K-means Iteration = 1 Start with random positions of centroids. Assign each data point to closest centroid
15
K-means Iteration = 1 Start with random positions of centroids. Assign each data point to closest centroid Move centroids to center of assigned points
16
K-means; algorithm to find minima Iteration = 3 Start with random positions of centroids. Assign each data point to closest centroid Move centroids to center of assigned points Iterate till minimal cost
17
E=Total Sum of Squares vs K
18
Result depends on initial centroids’ position Fast algorithm: compute distances from data points to centroids O(N) operations (vs O(N 2 )) Must preset K Fails for non-spherical distributions K-means - Summary
19
5 24 13 Agglomerative Hierarchical Clustering 3 1 4 2 5 Distance between joined clusters Dendrogram The dendrogram induces a linear ordering of the data points at each step merge pair of nearest clusters initially – each point = cluster Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers
21
COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS
22
2 flat clouds 2 FLAT CLOUDS - SINGLE LINKAGE WORKS
23
average linkage
25
filament SINGLE LINKAGE SENSITIVE TO NOISE
26
filament SINGLE LINKAGE SENSITIVE TO NOISE
27
filament with one point removed SINGLE LINKAGE SENSITIVE TO NOISE
28
Hierarchical Clustering - Summary Results depend on distance update method Greedy iterative process NOT robust against noise No inherent measure to identify stable clusters Average Linkage – the most widely used clustering method in gene expression analysis
29
nature 2002 breast cancer
30
how many clusters? 3 LARGE MANY small SuperParamagnetic Clustering (SPC) toy problem SPC
31
other methods
32
Graph based clustering Undirected graph: Vertices (nodes). Edges. A cut. J i,j i j
33
Graph based clustering (cont.) i=1,2,...N data points = vertices (nodes) of graph J i,j – weight associated with edge i,j 5 1 8 J 5,8 J i,j depends on distance D i,j J i,j D i,j A cut in the graph represents a clustering solution (partition).
34
= cut edge high cost (high resolution) low cost (low resolution) COST OF A CUT, I.E, PARTITION = WEIGHTS OF ALL CUT EDGES
35
highest cost = sum of all edges. each point is a cluster lowest cost = 0 One cluster. Conclusion –minimization/maximization of the cost are meaningless
36
Clustering: The SPC spirit M.Blatt, S.Weisman and E.Domany (1996) SPC’s idea – consider ALL cuts, i.e. partitions {S}. Each partition appears with probability p({S}). Measure the correlation between points i,j connected by an edge, over all partitions: Cij = probability that the edge i-j was NOT cut. {S} 1 : p({S} 1 ) i j i j i j i j {S} 2 : p({S} 2 ){S} 3 : p({S} 3 ){S} 4 : p({S} 4 ) Cij = p({S} 2 )+ p({S} 3 )+ p({S} 4 )
37
Clustering: The SPC spirit (cont) We have a graph, whose edge values are the correlations 0.45 0.75 1 0.8 0.9 0.2 0.45 0.850.7 0.9 1 Create the clustering solution by deleting edges for which Cij < 0.5
38
What is p({S}) ? COST OF {S} = H({S}) CORRESPONDS TO THE RESOLUTION SOUNDS REASONABLE TO FIND A SOLUTION FOR EACH VALUE OF THE COST/RESOLUTION, E. FIX H=E, AND GENERATE PARTITIONS FOR WHICH H({S})=E. P({X}) =1/(# PARTITIONS WITH H({S})=E), IF H({X})=E 0 OTHERWISE
39
What is p({S}) ? (Cont) Due to computational issues it is easier to generate partitions for with an AVERAGE cost E: INSTEAD OF FINDING PARTITIONS WITH H({S})=E FIND PARTITIONS WITH =E P({X})=exp [-H({X})/T ] /Z Boltzmann distribution T is the temperature = the resolution parameter
40
Outline of SPC Go over resolutions T (minT to maxT is steps of deltaT): Generate thousands (Cycles) of partitions with average cost that corresponds to the current resolution. Calculate pair correlations : C i,j (T). Clusters(T): connected components of C i,j > 0.5 Map data to a graph G.
41
Example: N=4800 points in D=2 Super-Paramagnetic Clustering (SPC)
42
Output of SPC Size of largest clusters as function of T Dendrogram Stable clusters “live” for large T A function (T) that peaks when stable clusters break
43
Identify the stable clusters
44
Same data - Average Linkage Examining this cluster No analog to (T)
45
Advantages of SPC Scans all resolutions (T) Robust against noise and initialization - calculates collective correlations. Identifies “natural” ( ) and stable clusters ( T) No need to pre-specify number of clusters Clusters can be any shape Can use distance matrix as input (vs coordinates)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.