Probabilistic Techniques for the Clustering of Gene Expression Data Speaker: Yujing Zeng Advisor: Javier Garcia-Frias Department of Electrical and Computer.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Pattern Finding and Pattern Discovery in Time Series
Han-na Yang Trace Clustering in Process Mining M. Song, C.W. Gunther, and W.M.P. van der Aalst.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Introduction of Probabilistic Reasoning and Bayesian Networks
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Profiles for Sequences
Hidden Markov Models in NLP
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Clustering II.
Pattern Recognition and Machine Learning
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Mutual Information Mathematical Biology Seminar
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Radial Basis Function Networks
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Clustering Unsupervised learning Generating “classes”
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Isolated-Word Speech Recognition Using Hidden Markov Models
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Graphical models for part of speech tagging
Segmental Hidden Markov Models with Random Effects for Waveform Modeling Author: Seyoung Kim & Padhraic Smyth Presentor: Lu Ren.
More on Microarrays Chitta Baral Arizona State University.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Microarrays.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Performance Comparison of Speaker and Emotion Recognition
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Flat clustering approaches
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Data Mining and Decision Support
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Other Models for Time Series. The Hidden Markov Model (HMM)
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Statistical Models for Automatic Speech Recognition
1 Department of Engineering, 2 Department of Mathematics,
Presentation transcript:

Probabilistic Techniques for the Clustering of Gene Expression Data Speaker: Yujing Zeng Advisor: Javier Garcia-Frias Department of Electrical and Computer Engineering University of Delaware

Contents Introduction –Problem of interest –Introduction on clustering Integrating application-specific knowledge in clustering –Gene expression time-series data –Profile-HMM clustering Integrating different clustering results –Meta-clustering Conclusion

Gene Expression Data DNA (Gene) Messenger RNA (mRNA) Protein Transcription Translation Regulation measure The pattern behind these measurements reflects the function and behavior of proteins

Gene Expression Data (cont.)

Gene Expression Data (cont.) Gene Expression Data (cont.)

What Is Clustering? Clustering can be loosely defined as the process of organizing objects into groups whose members are similar in some way. All clustering algorithms assume the pre-existence of groupings among the objects to be clustered Random noise and other uncertainties have obscured these groupings

Advantages of Clustering Unsupervised learning –No pre-knowledge required –Suitable for applications with large database Well-developed techniques –Many approaches developed –Vast literature available

Problem of Interest –Difficult to integrate information resources other than the data itself Pre-knowledge from particular applications Clustering results from other clustering analysis

Profile-HMM clustering - - exploiting the temporal dependencies existing in gene expression time-series data

Gene Expression Time-Series Data Gene expression time- series data –Collected by a series of microarray experiments implemented in consecutive time-points –Each time sequence representing the behavior of one particular gene along the time axis Special property –Horizontal dependencies: dependence existing between observations taken at subsequent time-points –Similarity between a pair of series is decided by their patterns across the time axis

Hidden Markov Models to Model Temporal Dependencies Hidden Markov models (HMMs) are one of the most popular ways to model temporal dependencies in stochastic processes (speech recognition) Characterized by the following parameters: –Set of possible (hidden) states –Transition probabilities among states –Emission probability in each state –Initial state probabilities Doubly stochastic structure allows flexibility in the modeling of temporal dependencies S1S1 S2S2

Previous Work Generate one HMM per gene –HMM-based distance [Smyth 97] –HMM-based features [Panuccio et al 02] Generate one HMM per cluster –Autoregressive models (CAGED) [Ramoni et al 02] –HMM based EM clustering [Schliep et al 03] Stationary assumption on the temporal dependencies Limited quality of the resulting HMM because of small training set (one series for each HMM) Lack of models for the whole data structure Separate training for the model of each cluster Requirement of additional technique to predict the number of clusters

Profile-HMM Clustering S m1 1 S S m2 2 S S mT T S 1 T Left-to-right model with each group of states associated with a time point Only transitions among consecutive layers are allowed Time dependencies at different times modeled separately For each state, emission defined by a Gaussian density  Each path describes a pattern in a probabilistic way.

Profile-HMM Clustering (cont.) Similarity between two time series defined according to the probability that they are related to the same stochastic pattern –Training: To find the most likely set of patterns characterizing all the observed time series –Clustering: Group together the time series (genes) that are most likely to be related with the same pattern ( which corresponds a cluster Baum-Welch Viterbi

Profile-HMM Clustering (cont.) Single HMM models the overall distribution of the data, so that the representative patterns (clusters) are selected simultaneously –As opposed to other HMMs approaches each stochastic pattern is built according to both positive and negative samples Number of clusters is obtained automatically –Proposed model can be seen as a high dimensional self- organized network –Number of clusters is relatively stable with respect to number of states Training and clustering procedures are standard techniques  Easy implementation

Experiment Results: Dataset Study on the transcriptional program of sporulation in budding yeast [Chu et al, 98] –Measures at 7 uneven intervals –Subset of 477 genes with over-expression behavior during sporulation –Original paper distinguishes 7 temporal patterns by visual inspection and prior studies)

Experiment Results: Number of Clusters from Proposed HMM Same number of states at each time point, m # of clusters is automatically determined by the HMM Resulting # of clusters (and clustering structure) is relatively stable with respect to the number of states in the model –m=3  3 7 =2187 possible patterns, but 12 resulting clusters –m=50  3 50 =7.8x10 11 possible patterns, but 19 resulting clusters m # clusters

NameDefinitionBasic criteria Best Value Homogeneity 0 Separation ∞ DB indexBoth 0 silhouetteBoth 1 Clustering Validation

Experiment Results: Comparison with Original Model HMM increases the number of clusters from the original 7 to 16 HMM identifies patterns mixed in the same original group and assign them into different clusters –Original metabolism group shows some inconsistent profiles –HMM refines this subset into 2 more consistent clusters criterionHMMoriginal homogeneity separation DB index silhouette

Experiment Results: Comparison with Other Clustering Methods Compare with K-means and single-linkage with #clusters=16 criterionHMMK-meanssingleoriginal homogeneity separation DB index silhouette out of 16 clusters in single-linkage are singletons  Despite DB and separation indices, real patterns are not described in the single-linkage clusters

Summary for HMM Clustering A novel HMM clustering approach proposed to exploit the temporal dependencies in microarray dynamic data HMM performance evaluated using data studying the transcriptional program of sporulation in budding yeast –HMM capable of identifying a reasonable number of clusters, stable with model complexity, without any a priori information –Evaluation indices show that HMM provides a better description of the data distribution than other clustering techniques –Biological interpretation from the HMM results provides meaningful insights

Problem of Interest –Difficult to integrate information resources other than the data itself Pre-knowledge from particular applications Clustering results from other clustering analysis

Meta-clustering - - integrating different clustering results

Facing Various Clustering Approaches… There is no single best approach for obtaining a partition because no precise and workable definition of ‘cluster’ exists Clusters can be of any arbitrary shapes and sizes in a multidimensional pattern space Each clustering approach imposes a certain assumption on the structure of the data If the data happens to conform to that structure, the true clusters are recovered

Example of Clustering Result of K-means Result of SOM Result of Single-linkage

Result of K-means Result of Single-linkage Result of SOM Result of Single-linkage Example of Clustering(cont.)

Problem of Interest Difficult to evaluate, compare and combine different clustering results –Different cluster sizes,boundaries, … –High dimensionality –Large amount of data Although many clustering tools are available, there are few to extract the information by comparing or combining two or more clustering results

Proposed Approach An adaptive meta-clustering approach –Extracting the information from results of different clustering techniques –And combining them into a single clustering structure - so that a better interpretation of the data distribution can be obtained

Adaptive Meta-clustering Algorithm Alignment Meta- clustering Combination

Dc Matrix n  n matrix, where n is the size of the input data set Each entry Dc(i,j) is the cluster-based distance between data point i and j The cluster-based distance, which we define, shows the dissimilarity between every two points

X6X6 X5X5 X1X1 X2X2 X7X7 X3X3 X4X4 Cluster I Cluster II Cluster IV Cluster III Cluster-Based Distance P vectors Cluster-Based Distance

Combination Assume that is a clustering structure that we want to discover from the input dataset. Let denote the corresponding matrix of cluster-based distances (Dc) Given a pool of clustering results, we can estimate as

Meta-Clustering Using agglomerative hierarchical approach Merging criteria

Simulation Results

Simulation Results (cont.)

Single-linkage K-means SOM Meta-clustering

Simulation Results (cont.) Yeast cell-cycle data Karen M. Bloch and Gonzalo Arce, “Nonlinear correlation for the Analysis of Gene Expression Data”, ISMB 2002.

Size Percentage of profiles in the group that are from the function class Percentage of profiles in the function class that are contained in the group Chromatin Structure Glycolysis Protein Degradation Spindle Pole Chromatin Structure Glycolysis Protein Degradation Spindle Pole Average- linkage % 21 6% % 94% %27% 100% SOM 18100% % 88% 331 3%94%3% 6%100%9% 411 9% 91% 6% 91% K- means 11173% 27% 100% 10% % 76% % 90% % 73% 24% 100% Simulation Results (cont.) Meta- Clustering 18100% % %3% 100%9% % 91%

Chromatin structure Glycolysis Protein degradation Spindle Pole

Summary for Meta-Clustering The evaluation and combination of different clustering results is an important open problem The problem is addressed by –Defining a special distance measure, called Dc, to represent the statistical "signal" of each cluster –Combining the information together in a statistical way to form a new clustering structure The simulations show the robustness of the proposed algorithm

Conclusion We are interested on analyzing gene expression data sets and inferring biological interactions from them The study is focused on clustering –Including the pre-knowledge in clustering process –Integrating different clustering results The future work will give more emphasis on real applications

Questions?