BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.

Slides:

Advertisements

Similar presentations

Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Basic Gene Expression Data Analysis--Clustering

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.

Cluster analysis for microarray data Anja von Heydebreck.

Introduction to Bioinformatics

Cluster Analysis.

UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

The Broad Institute of MIT and Harvard Clustering.

Clustering and Dimensionality Reduction Brendan and Yifang April

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Introduction to Bioinformatics Algorithms Clustering.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.

Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.

Clustering Color/Intensity

Clustering (Part II) 11/26/07. Spectral Clustering.

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Introduction to Bioinformatics - Tutorial no. 12

What is Cluster Analysis?

 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.

Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Advanced Methods of Data Analysis 9: :00CTWC 10: :00 CTWC exercise 11:00 – 11:30 Break 11: :00 SPIN 12: :00 SPIN exercise Course.

Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.

Radial Basis Function Networks

Clustering Unsupervised learning Generating “classes”

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Mar 2002 (GG)1 Clustering Gene Expression Data Gene Expression Data Clustering of Genes and Conditions Methods –Agglomerative Hierarchical: Average Linkage.

Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

tch?v=Y6ljFaKRTrI Fireflies.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.

An Overview of Clustering Methods Michael D. Kane, Ph.D.

MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”

1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Unsupervised Learning

Clustering Anna Reithmeir Data Mining Proseminar 2017

Semi-Supervised Clustering

Clustering CSC 600: Data Mining Class 21.

Dimension reduction : PCA and Clustering

Cluster Analysis.

Clustering The process of grouping samples so that the samples are similar within each group.

Hierarchical Clustering

Unsupervised Learning

Presentation transcript:

BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic methodologies1 SUPERVISED METHODS CAN ONLY VALIDATE OR REJECT HYPOTHESES. CAN NOT LEAD TO DISCOVERY OF UNEXPECTED PARTITIONS. UNSUPERVISED: EXPLORATORY ANALYSIS NO PRIOR KNOWLEDGE IS USED EXPLORE STRUCTURE OF DATA ON THE BASIS OF CORRELATIONS AND SIMILARITIES

AIMS: ASSIGN PATIENTS TO GROUPS ON THE BASIS OF THEIR EXPRESSION PROFILES IDENTIFY DIFFERENCES BETWEEN TUMORS AT DIFFERENT STAGES IDENTIFY GENES THAT PLAY CENTRAL ROLES IN DISEASE PROGRESSION EACH PATIENT IS DESCRIBED BY 30,000 NUMBERS: ITS EXPRESSION PROFILE

UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL PROCESS. GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR GENE EXPRESSION PROFILES. THESE TISSUES ARE EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL) STATE. CLUSTERING, SORTING Unsupervised analysis

Giraffe DEFINITION OF THE CLUSTERING PROBLEM

CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram1 T (RESOLUTION) LINEAR ORDERING OF DATA

Giraffe + Okapi BUT WHAT ABOUT THE OKAPI?

STATEMENT OF THE PROBLEM GIVEN DATA POINTS X i, i=1,2,...N, EMBEDDED IN D - DIMENSIONAL SPACE, IDENTIFY THE UNDERLYING STRUCTURE OF THE DATA. AIMS:PARTITION THE DATA INTO M CLUSTERS, POINTS OF SAME CLUSTER - "MORE SIMILAR“ M ALSO TO BE DETERMINED! GENERATE DENDROGRAM, IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS "ILL POSED": WHAT IS "MORE SIMILAR"? RESOLUTION

CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram2 T STABILITY T LINEAR ORDERING OF DATA YOUNG OLD

CENTROID (REPRESENTATIVE) –SELF ORGANIZED MAPS (KOHONEN 1997; (GENES: GOLUB ET. AL., SCIENCE 1999) –K-MEANS (GENES; TAMAYO ET. AL., PNAS 1999) AGGLOMERATIVE HIERARCHICAL -AVERAGE LINKAGE (GENES: EISEN ET. AL., PNAS 1998) PHYSICALLY MOTIVATED –DETERMINISTIC ANNEALING (ROSE ET. AL.,PRL 1990; GENES: ALON ET. AL., PNAS 1999) –SUPER-PARAMAGNETIC CLUSTERING (SPC)(BLATT ET.AL. GENES: GETZ ET. AL., PHYSICA 2000,PNAS 2000) --COUPLED MAPS (ANGELINI ET. AL., PRL 2000) CLUSTERING METHODS Clustering methods

INFORMATION THEORY –AGGLOMERATIVE INFORMATION BOTTLENECK (TISHBY ET. AL.) LINEAR ALGEBRA –SPECTRAL METHODS (MALIK ET. AL.) MULTIGRID BASED METHODS (BRANDT ET. AL., ) CLUSTERING METHODS (Cont) Clustering methods

Centroid methods – K-means i = 1,...,N DATA POINTS, AT X i = 1,...,K CENTROIDS, AT Y ASSIGN DATA POINT i TO CENTROID ; S i = COST E: E(S 1, S 2,...,S N ; Y 1,...Y K ) = MINIMIZE E OVER S i, Y

K-means “GUESS” K=3

K-means Iteration = 0 Start with random positions of centroids.

K-means Iteration = 1 Start with random positions of centroids. Assign each data point to closest centroid

K-means Iteration = 1 Start with random positions of centroids. Assign each data point to closest centroid Move centroids to center of assigned points

K-means; algorithm to find minima Iteration = 3 Start with random positions of centroids. Assign each data point to closest centroid Move centroids to center of assigned points Iterate till minimal cost

E=Total Sum of Squares vs K

Result depends on initial centroids’ position Fast algorithm: compute distances from data points to centroids O(N) operations (vs O(N 2 )) Must preset K Fails for non-spherical distributions K-means - Summary

Agglomerative Hierarchical Clustering Distance between joined clusters Dendrogram The dendrogram induces a linear ordering of the data points at each step merge pair of nearest clusters initially – each point = cluster Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers

COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS

2 flat clouds 2 FLAT CLOUDS - SINGLE LINKAGE WORKS

average linkage

filament SINGLE LINKAGE SENSITIVE TO NOISE

filament SINGLE LINKAGE SENSITIVE TO NOISE

filament with one point removed SINGLE LINKAGE SENSITIVE TO NOISE

Hierarchical Clustering - Summary Results depend on distance update method Greedy iterative process NOT robust against noise No inherent measure to identify stable clusters Average Linkage – the most widely used clustering method in gene expression analysis

nature 2002 breast cancer

how many clusters? 3 LARGE MANY small SuperParamagnetic Clustering (SPC) toy problem SPC

other methods

Graph based clustering Undirected graph: Vertices (nodes). Edges. A cut. J i,j i j

Graph based clustering (cont.) i=1,2,...N data points = vertices (nodes) of graph J i,j – weight associated with edge i,j J 5,8 J i,j depends on distance D i,j J i,j D i,j A cut in the graph represents a clustering solution (partition).

= cut edge high cost (high resolution) low cost (low resolution) COST OF A CUT, I.E, PARTITION = WEIGHTS OF ALL CUT EDGES

highest cost = sum of all edges. each point is a cluster lowest cost = 0 One cluster. Conclusion –minimization/maximization of the cost are meaningless

Clustering: The SPC spirit M.Blatt, S.Weisman and E.Domany (1996) SPC’s idea – consider ALL cuts, i.e. partitions {S}. Each partition appears with probability p({S}). Measure the correlation between points i,j connected by an edge, over all partitions: Cij = probability that the edge i-j was NOT cut. {S} 1 : p({S} 1 ) i j i j i j i j {S} 2 : p({S} 2 ){S} 3 : p({S} 3 ){S} 4 : p({S} 4 ) Cij = p({S} 2 )+ p({S} 3 )+ p({S} 4 )

Clustering: The SPC spirit (cont) We have a graph, whose edge values are the correlations Create the clustering solution by deleting edges for which Cij < 0.5

What is p({S}) ? COST OF {S} = H({S}) CORRESPONDS TO THE RESOLUTION SOUNDS REASONABLE TO FIND A SOLUTION FOR EACH VALUE OF THE COST/RESOLUTION, E. FIX H=E, AND GENERATE PARTITIONS FOR WHICH H({S})=E. P({X}) =1/(# PARTITIONS WITH H({S})=E), IF H({X})=E 0 OTHERWISE

What is p({S}) ? (Cont) Due to computational issues it is easier to generate partitions for with an AVERAGE cost E: INSTEAD OF FINDING PARTITIONS WITH H({S})=E FIND PARTITIONS WITH =E P({X})=exp [-H({X})/T ] /Z Boltzmann distribution T is the temperature = the resolution parameter

Outline of SPC Go over resolutions T (minT to maxT is steps of deltaT): Generate thousands (Cycles) of partitions with average cost that corresponds to the current resolution. Calculate pair correlations : C i,j (T). Clusters(T): connected components of C i,j > 0.5 Map data to a graph G.

Example: N=4800 points in D=2 Super-Paramagnetic Clustering (SPC)

Output of SPC Size of largest clusters as function of T Dendrogram Stable clusters “live” for large  T A function  (T) that peaks when stable clusters break

Identify the stable clusters

Same data - Average Linkage Examining this cluster No analog to  (T)

Advantages of SPC Scans all resolutions (T) Robust against noise and initialization - calculates collective correlations. Identifies “natural” (  ) and stable clusters (  T) No need to pre-specify number of clusters Clusters can be any shape Can use distance matrix as input (vs coordinates)