PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Cluster analysis for microarray data Anja von Heydebreck.
Chapter 4: Linear Models for Classification
Machine Learning and Data Mining Clustering
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Introduction to Bioinformatics
Cluster Analysis.
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
Visual Recognition Tutorial
Clustering II.
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Evaluating Hypotheses
Basic Data Mining Techniques
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
What is Cluster Analysis?
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Normalization Review and Cluster Analysis Class web site: Statistics for Microarrays.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Birch: An efficient data clustering method for very large databases
Radial Basis Function Networks
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
CLUSTER ANALYSIS.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Clustering.
Lecture: Forensic Evidence and Probability Characteristics of evidence Class characteristics Individual characteristics  features that place the item.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Cluster Analysis This work is created by Dr. Anamika Bhargava, Ms. Pooja Kaul, Ms. Priti Bali and Ms. Rajnipriya Dhawan and licensed under a Creative Commons.
Unsupervised Learning
Semi-Supervised Clustering
Data Mining K-means Algorithm
Topic 3: Cluster Analysis
Revision (Part II) Ke Chen
Clustering.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Revision (Part II) Ke Chen
DATA MINING Introductory and Advanced Topics Part II - Clustering
Lecture: Forensic Evidence and Probability Characteristics of evidence
CSCI N317 Computation for Scientific Applications Unit Weka
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
EM Algorithm and its Applications
Unsupervised Learning
Presentation transcript:

PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles Nir Friedman Presenting: Inbar Matarasso 09/05/2005 The School of Computer Science Tel – Aviv University

Outline  A little about clustering  Mathematics background  Introduction  The problem  Notation  Scoring Method  Agglomerative clustering  Double clustering  Conclusion

A little about clustering  Partition entities (genes) into groups called clusters (according to similarity in their expression profiles across the probed conditions).  Cluster are homogeneous and well- separated.  Clustering problem arise in numerous disciplines including biology, medicine, psychology, economics.

Clustering – why?  Reduce the dimensionality of the problem – identify the major patterns in the dataset  Pattern Recognition  Image Processing  Economic Science (especially market research)  WWW Document classification Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults

Types of clustering methods  How to choose a particular method? 1. The type of output desired 2. The known performance of method with particular types of data 3. The hardware and software facilities available 4. The size of the dataset.  In general, clustering methods may be divided into two categories based on the cluster structure which they produce: Partitioning Methods, Hierarchical Agglomerative methods

Partitioning Methods  Partition the objects into a prespecified number of groups K  Iteratively reallocate objects to clusters until some criterion is met (e.g. minimize within cluster sums of squares)  Examples: k-means, partitioning around medoids (PAM), self-organizing maps (SOM), model-based clustering

Partitioning Methods  Result: M clusters, each object belonging to one cluster  Single Pass: 1. Make the first object the centroid for the first cluster. 2. For the next object, calculate the similarity, S, with each existing cluster centroid, using some similarity coefficient. 3. If the highest calculated S is greater than some specified threshold value, add the object to the corresponding cluster and re determine the centroid; otherwise, use the object to initiate a new cluster. If any objects remain to be clustered, return to step 2.

Partitioning Methods  This method requires only one pass through the dataset  The time requirements are typically of order O(NlogN) for order O(logN) clusters.  A disadvantage is that the resulting clusters are not independent of the order in which the documents are processed, with the first clusters formed usually being larger than those created later in the clustering run

Hierarchical Clustering  Produce a dendrogram  Avoid prespecification of the number of clusters K  The tree can be built in two distinct ways: Bottom-up: agglomerative clustering Top-down: divisive clustering

Hierarchical Clustering  Organize the genes in a structure of a hierarchical tree  Initial step: each gene is regarded as a cluster with one item  Find the 2 most similar clusters and merge them into a common node  The length of the branch is proportional to the distance  Iterate on merging nodes until all genes are contained in one cluster- the root of the tree. g1g2g3g4g5 {1,2} {4,5} {1,2,3} {1,2,3,4,5}

Partitioning vs. Hierarchical  Partitioning Advantage: Provides clusters that satisfy some optimality criterion (approximately) Disadvantages: Need initial K, long computation time  Hierarchical Advantage: Fast computation (agglomerative) Disadvantages: Rigid, cannot correct later for erroneous decisions made earlier

Mathematical evaluation of clustering solution Merits of a ‘good’ clustering solution:  Homogeneity: Genes inside a cluster are highly similar to each other. Average similarity between a gene and the center (average profile) of its cluster.  Separation: Genes from different clusters have low similarity to each other. Weighted average similarity between centers of clusters.  These are conflicting features: increasing the number of clusters tends to improve with-in cluster Homogeneity on the expense of between-cluster Separation

Gaussian Distribution Function  Large number of events  describes physical events  approximates the exact binomial distribution of events DistributionFunctional FormMeanStandard Deviation Gaussian aσ

Bayes' Theorem  p(A|X) = p(X|A)*p(A) p(X|A)*p(A) + p(X|~A)*p(~A)  1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

Bayes' Theorem  The correct answer is 7.8%, obtained as follows: Out of 10,000 women, 100 have breast cancer; 80 of those 100 have positive mammographies. From the same 10,000 women, 9,900 will not have breast cancer and of those 9,900 women, 950 will also get positive mammographies. This makes the total number of women with positive mammographies or 1,030. Of those 1,030 women with positive mammographies, 80 will have cancer. Expressed as a proportion, this is 80/1,030 or or 7.8%.

Bayes' Theorem p(cancer):0.01 Group 1: 100 women with breast cancer p(~cancer):0.99 Group 2: 9900 women without breast cancer p(positive|cancer):80.0% 80% of women with breast cancer have positive mammographies p(~positive|cancer):20.0% 20% of women with breast cancer have negative mammographies p(positive|~cancer):9.6% 9.6% of women without breast cancer have positive mammographies p(~positive|~cancer):90.4% 90.4% of women without breast cancer have negative mammographies p(cancer&positive):0.008 Group A: 80 women with breast cancer and positive mammographies p(cancer&~positive):0.002 Group B: 20 women with breast cancer and negative mammographies p(~cancer&positive):0.095 Group C: 950 women without breast cancer and positive mammographies p(~cancer&~positive):0.895 Group D: 8950 women without breast cancer and negative mammographies p(positive): women with positive results p(~positive): women with negative results p(cancer|positive):7.80% Chance you have breast cancer if mammography is positive: 7.8% p(~cancer|positive):92.20% Chance you are healthy if mammography is positive: 92.2% p(cancer|~positive):0.22% Chance you have breast cancer if mammography is negative: 0.22% p(~cancer|~positive):99.78% Chance you are healthy if mammography is negative: 99.78%

Bayes' Theorem  to find the chance that a woman with positive mammography has breast cancer, we computed: p(positive|cancer)*p(cancer) p(positive|cancer)*p(cancer) + p(positive|~cancer)*p(~cancer) 1. which is p(positive&cancer) / [p(positive&cancer) + p(positive&~cancer)] 2. which is p(positive&cancer) / p(positive) 3. which is p(cancer|positive)

Bayes' Theorem  The original proportion of patients with breast cancer is known as the prior probability. The chance that a patient with breast cancer gets a positive mammography, and the chance that a patient without breast cancer gets a positive mammography, are known as the two conditional probabilities. Collectively, this initial information is known as the priors. The final answer - the estimated probability that a patient has breast cancer, given that we know she has a positive result on her mammography - is known as the revised probability or the posterior probability.

Bayes' Theorem p(A|X) = p(A|X) p(A|X) = p(X&A) p(X) p(A|X) = p(X&A) p(X&A) + p(X&~A) p(A|X) = p(X|A)*p(A) p(X|A)*p(A) + p(X|~A)*p(~A)

Introduction  A central problem in analysis of gene expression data is clustering of genes with similar expression profiles.  We are going to get familiar with an hierarchical clustering procedure that is based on simple probabilistic model.  Genes that are expressed similarly in each group of conditions are clustered together.

The problem  The goal of clustering is identify groups of genes with “similar” expression patterns.  A group of genes are clustered together if their measured expression values could have been sampled from the same stochastic source with a high probability.  The user specifies in advance a partition of the experimental conditions

Clustering Gene Expression Data  Cluster genes, e.g. to (attempt to) identify groups of co-regulated genes  Cluster samples, e.g. to identify tumors based on profiles  Cluster both at the same time  Can be helpful for identifying patterns in time or space  Useful (essential?) when seeking new subclasses of samples  Can be used for exploratory purposes

Notation  a matrix of gene expression measurement: D = {e g,c : gєGenes, cєConds}  Genes is a set genes, and Conds is a set of conditions

Scoring Method  partition C = {C 1, …,C m } of conditions in Conds and a partition G = {G 1, …, G n } of genes in Genes.  We want to score the combined partition.  Assumption: g and g’ are in the same gene cluster, and c and c’ in the same condition cluster, then the expression value e g,c and e g’,c’ are sampled from the same distribution.

Scoring Method  Likelihood function:  Where θ i,k are the parameters that describe the expression of genes in G i in conditions in C k.  L(G,C,θ:D) = L(G,C,θ:D’) for any choice of G and θ.

Scoring Method  Parameterization for expression is using a Gaussian distribution.

Scoring Method  Using the previous Parameterization for each data we choose the best parameter sets.  To compensate for this overestimate we use the Bayesian approach, and average the likelihood over all of them.

Scoring Method - Summary  Local score of a particular cell:

Agglomerative Clustering  Given a partition C = {C1, …,C m } of conditions.  One approach to learn a clustering of genes is using an agglomerative procedure.

Agglomerative Clustering  G (1) ={G 1, …,G |Genes| } where each G i is a singleton.  While t < |Genes| and G (t) contains a single cluster.  Compute the change in the score that results from merging the clusters G i and G j

Agglomerative Clustering  Choose (i t,j t ) to be the pair of clusters whose merger is the most beneficial according to the score:  Define:  O(|Genes| 2 |C|)

Double Clustering  We want the procedure to select for us the best partition: 1. Track the sequence of partitions G (1),…, G |Genes|. 2. Select the partition with the highest score.  In theory: the maximum likelihood score should select G (1)  In Practice: it selects a partition in a much later stage.  Intuition: the best scoring partition strikes a tradeoff between finding groups of genes, so that each is homogeneous, and there distinct differences between them.

Double Clustering  Cluster both genes and conditions at the same time: 1. start with some partition of the conditions (say the one where each is a singleton). 2. perform gene agglomeration 3. select the “best” scoring gene partition 4. fix this gene partition 5. perform agglomeration on conditions  Intuitively, each step improves the score, and thus this procedure should converge.

particular features of our algorithm  We can measure a large amount of genes.  The agglomerative clustering algorithm returns a hierarchical partition that describes similarities at different scales.  We use a likelihood function rather than a measure of similarity.  The user specifies in advance a partition of experimental conditions.

Conclusion  Partition entities into groups called clusters.  Cluster are homogeneous and well-separated.  Bayes' Theorem p(A|X) = p(X|A)*p(A) p(X|A)*p(A) + p(X|~A)*p(~A)  Partitions: C = {C 1, …,C m }, G = {G 1, …, G n } we want to score the combined partition.  Likelihood function:

Conclusion  Agglomerative Clustering  The main advantage of this procedure is that it can take as input the “relevant” distinctions among the conditions

Questions?

References [1] N. Friedman. PCluster: Probabilistic Agglomerative Clustering of Gene Expression Profiles [2] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. J. Comp. Bio., 6(3-4):281–97, [3] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genomewide expression patterns. PNAS, 95(25):14863–8, [4] Eliezer Yudkowsky. An Intuitive Explanation of Bayesian Reasoning. 2003An Intuitive Explanation of Bayesian Reasoning