UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Slides:



Advertisements
Similar presentations
Algorithms and applications
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Basic Gene Expression Data Analysis--Clustering
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Cluster analysis for microarray data Anja von Heydebreck.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Unsupervised learning
Introduction to Bioinformatics
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
The Broad Institute of MIT and Harvard Clustering.
Clustering and Dimensionality Reduction Brendan and Yifang April
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.
Clustering Color/Intensity
Cluster Analysis: Basic Concepts and Algorithms
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Cluster Analysis?
CLUSTERING (Segmentation)
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
I=1,2,...N data points = vertices of graph neighbors i,j connected by edges J i,j – weight associated with edge i,j J 5,8 J i,j depends on distance.
Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Evaluating Performance for Data Mining Techniques
Unsupervised Learning and Clustering k-means clustering Sum-of-Squared Errors Competitive Learning SOM Pre-processing and Post-processing techniques.
Mar 2002 (GG)1 Clustering Gene Expression Data Gene Expression Data Clustering of Genes and Conditions Methods –Agglomerative Hierarchical: Average Linkage.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Microarrays.
tch?v=Y6ljFaKRTrI Fireflies.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Unsupervised Learning
Data Mining: Basic Cluster Analysis
Semi-Supervised Clustering
Clustering Patrice Koehl Department of Biological Sciences
Clustering CSC 600: Data Mining Class 21.
Machine Learning Clustering: K-means Supervised Learning
Hierarchical clustering approaches for high-throughput data
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
(A) Hierarchical clustering was performed to identify groups of patients with similar RNASeq expression of 20 genes associated with reduced survivability.
Clustering The process of grouping samples so that the samples are similar within each group.
Unsupervised Learning
Presentation transcript:

UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL PROCESS. GOAL B: DIVIDE TISSUES TO GROUPS WITH SIMILAR GENE EXPRESSION PROFILES. THESE TISSUES ARE EXPECTED TO BE IN THE SAME BIOLOGICAL (CLINICAL) STATE. CLUSTERING Unsupervised analysis

Giraffe DEFINITION OF THE CLUSTERING PROBLEM

CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram1 T (RESOLUTION)

Giraffe + Okapi BUT WHAT ABOUT THE OKAPI?

STATEMENT OF THE PROBLEM GIVEN DATA POINTS X i, i=1,2,...N, EMBEDDED IN D - DIMENSIONAL SPACE, IDENTIFY THE UNDERLYING STRUCTURE OF THE DATA. AIMS:PARTITION THE DATA INTO M CLUSTERS, POINTS OF SAME CLUSTER - "MORE SIMILAR“ M ALSO TO BE DETERMINED! GENERATE DENDROGRAM, IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS "ILL POSED": WHAT IS "MORE SIMILAR"? RESOLUTION Statement of the problem2

CLUSTER ANALYSIS YIELDS DENDROGRAM Dendrogram2 T LINEAR ORDERING OF DATA YOUNG OLD

AGGLOMERATIVE HIERARCHICAL –AVERAGE LINKAGE (GENES: EISEN ET. AL., PNAS 1998) CENTROID (REPRESENTATIVE) –SELF ORGANIZED MAPS (KOHONEN 1997; (GENES: GOLUB ET. AL., SCIENCE 1999) --K-MEANS (GENES; TAMAYO ET. AL., PNAS 1999) PHYSICALLY MOTIVATED –DETERMINISTIC ANNEALING (ROSE ET. AL.,PRL 1990; GENES: ALON ET. AL., PNAS 1999) –SUPER-PARAMAGNETIC CLUSTERING (SPC)(BLATT ET.AL. GENES: GETZ ET. AL., PHYSICA 2000,PNAS 2000) CLUSTERING METHODS

Agglomerative Hierarchical Clustering Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Dendrogram The dendrogram induces a linear ordering of the data points

Hierarchical Clustering - Summary Results depend on distance update method Greedy iterative process NOT robust against noise No inherent measure to identify stable clusters

2 good clouds COMPACT WELL SEPARATED CLOUDS – EVERYTHING WORKS

2 flat clouds 2 FLAT CLOUDS - SINGLE LINKAGE WORKS

filament SINGLE LINKAGE SENSITIVE TO NOISE

start here

Average linkage Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Average Linkage: average distance between all pairs Mean Linkage: distance between centroids Need to define the distance between the new cluster and the other clusters. Average Linkage: average distance between all pairs Mean Linkage: distance between centroids Dendrogram

nature 2002 breast cancer

STATEMENT OF THE PROBLEM GIVEN DATA POINTS X i, i=1,2,...N, EMBEDDED IN D - DIMENSIONAL SPACE, IDENTIFY THE UNDERLYING STRUCTURE OF THE DATA. AIMS:PARTITION THE DATA INTO M CLUSTERS, POINTS OF SAME CLUSTER - "MORE SIMILAR“ M ALSO TO BE DETERMINED! GENERATE DENDROGRAM, IDENTIFY SIGNIFICANT, “STABLE” CLUSTERS "ILL POSED": WHAT IS "MORE SIMILAR"? RESOLUTION Statement of the problem2

how many clusters? 3 LARGE MANY small (SPC) toy problem SPC

other methods

Centroid methods – K-means PARTITIONS THE DATA POINTS INTO K SUBSETS FINDS POSITION OF K CENTROIDS DATA POINTS ARE ASSIGNED TO THE CLOSEST CENTROID FINDS LOCAL MINIMA OF COST: SUM OF SQUARE DISTANCES BETWEEN DATA POINTS AND THEIR ASSOCIATED CENTROID. CLUSTERS ARE CONVEX AND COMPACT

K-means Iteration = 0 Start with random positions of centroids.

K-means Iteration = 1 Start with random positions of centroids. Assign data points to centroids

K-means Iteration = 1 Start with random positions of centroids. Assign data points to centroids Move centroids to center of assigned points

K-means Iteration = 3 Start with random positions of centroids. Assign data points to centroids Move centroids to center of assigned points Iterate till minimal cost

Result depends on initial centroids’ position Fast algorithm: compute distances from data points to centroids Must preset K Fails for non-spherical distributions K-means - Summary

TSS vs K

Iris setosa Iris versicolor Iris virginica 50 specimes from each group 4 numbers for each flower 150 data points in 4-dimensional space irises

150 points in d=4 3 large clusters d=4

Output of SPC Stable clusters “live” for large  T

Choosing a value for T

Same data - Average Linkage No analog for 

Same data - Average Linkage Examining this cluster