Yinyin Yuan and Chang-Tsun Li Computer Science Department

Slides:

Advertisements

Similar presentations

Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,

Advertisements

Active Shape Models Suppose we have a statistical shape model –Trained from sets of examples How do we use it to interpret new images? Use an “Active Shape.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Patch to the Future: Unsupervised Visual Prediction

Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.

Relevance Feedback Content-Based Image Retrieval Using Query Distribution Estimation Based on Maximum Entropy Principle Irwin King and Zhong Jin Nov

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Mutual Information Mathematical Biology Seminar

Evaluation and optimization of clustering in gene expression data analysis A. Fazel Famili, Ganming Liu and Ziying Liu National Research Council of Canada.

Probabilistic Techniques for the Clustering of Gene Expression Data Speaker: Yujing Zeng Advisor: Javier Garcia-Frias Department of Electrical and Computer.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

Differentially expressed genes

Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Chapter 14 Simulation. Monte Carlo Process Statistical Analysis of Simulation Results Verification of the Simulation Model Computer Simulation with Excel.

Inferring the nature of the gene network connectivity Dynamic modeling of gene expression data Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff,

ICA-based Clustering of Genes from Microarray Expression Data Su-In Lee 1, Serafim Batzoglou 2 1 Department.

Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

 1  Outline  stages and topics in simulation  generation of random variates.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.

Particle Filters for Shape Correspondence Presenter: Jingting Zeng.

Module 1: Statistical Issues in Micro simulation Paul Sousa.

A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.

1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.

ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.

Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 Dept. Computer Science and Information Engineering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

Cluster validation Integration ICES Bioinformatics.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

Flat clustering approaches

A new initialization method for Fuzzy C-Means using Fuzzy Subtractive Clustering Thanh Le, Tom Altman University of Colorado Denver July 19, 2011.

A Dynamic Conditional Random Field Model for Object Segmentation in Image Sequences Duke University Machine Learning Group Presented by Qiuhua Liu March.

1 BA 555 Practical Business Analysis Linear Programming (LP) Sensitivity Analysis Simulation Agenda.

John Lafferty Andrew McCallum Fernando Pereira

Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Identifying Ethnic Origins with A Prototype Classification Method Fu Chang Institute of Information Science Academia Sinica ext. 1819

Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.

An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,

Tutorial on "GRID Computing“ EMBnet Conference 2008 CNR - ITB GRID distribution supporting chaotic map clustering on large mixed microarray.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Computer Simulation Henry C. Co Technology and Operations Management,

Semi-Supervised Clustering

Cluster Analysis II 10/03/2012.

Outlier Processing via L1-Principal Subspaces

Rutgers Intelligent Transportation Systems (RITS) Laboratory

Presentation transcript:

Yinyin Yuan and Chang-Tsun Li Computer Science Department Clustering of Gene Expression Time Series with Conditional Random Fields Yinyin Yuan and Chang-Tsun Li Computer Science Department

Microarray and Gene Expression Microarray is a high throughput technique that can assay gene expression levels of a large number of genes in a tissue Gene expression level is the relative amounts of mRNA produced at specific time point and under certain experiment conditions. Thus microarray provides a mean to decipher the logic of gene regulation, by monitoring the gene expression of all genes in a tissue. Analysis and learning from these data on the molecular level are revolutionary in medicine because they are highly informative. Innovative model systems are needed instead of stra- ightforward adaptations of existing methodologies.

Gene Expression Gene expression data are obtained from microarrays and organized into gene expression matrix for analysis in various methodologies for medical and biological purposes. Data acquisition comprises microarray image processing for data extraction and the transformation of extracted data into gene expression matrix for further processing. After image processing, image analysis software normally transforms the data into a gene expression matrix by organization of the data from multiple hybridizations. Each column describes the expression levels for one gene under a series of experimental conditions. In other words, each position in this matrix characterizes the expression level for one gene in a certain experiment. Obtaining the gene expression matrix is none trivial. Data normalization and treatment of replicate measurements is needed before they can be compared among others relating to the same gene.

Gene Series Time Series A sequence of gene expression measured at successive time points at either uniform or uneven time intervals. Reveal more information than static data as time series data have strong correlations between successive points. Time Series Clustering Gene expression time series is a sequence of gene expression measured at successive time points at either uniform or uneven time intervals. In static experiment only snapshots of the expression of gene are recorded, while in time series experiments a temporal process of gene expression is measured. In other words microarray experiments are performed in consecutive time points in order to record a time series of gene expression data. While static data are assumed to be independent, time series data have strong correlations between successive points. In this sense, the time series experiment should be designed carefully according to the available resource and the objective of the experiment. A number of parameters such as sampling rates and the number of time points needed are to be decided when the gene regulations are taken into account. As gene expression is a temporal process, it is necessary to measure a time series of gene expression in order to determine the set of gene that are expressed under certain conditions, the gene expression level and the interaction between these genes. This allows us to fully utilize the information we can get from the experiments as it reveals the pathway that leads from one state to the next, not just the stable state under a new condition. The underlying assumption in clustering gene expression data is that co-expression indicates co-regulation, thus clustering should identify genes that share similar functions. Assumption: co-expression indicates co-regulation, thus clustering identify genes that share similar functions.

Probabilistic models A key challenge of gene expression time series research is the development of efficient and reliable probabilistic models Allow measurements of uncertainty Give analytical measurement of the confidence of the clustering result Indicate the significance of a data point Reflect temporal dependencies in the data points A key challenge of gene expression time series research is the development of efficient and reliable probabilistic models. In response, we propose an unsupervised

Goal Identify highly informative genes Cluster genes in the dataset GO (Gene Ontology) analysis of biological function for each cluster.

HMMs and CRFs HMMs CRFs HMMs are trained to maximize the joint probability of a set of observed data and their corresponding labels. Independence assumptions are needed in order to be computationally tractable. Representing long-range dependencies between genes and gene interactions are computationally impossible. As a popular method for probabilistic sequence data modelling, Dynamic Bayesian Networks (DBNs) are trained to maximize the joint probability of a set of observed data and their corresponding labels ¥cite{dojer06applying, husmeier03sensitivity}. Hidden Markov Models (HMMs), a special case of DBNs, have been previously applied to sequence data modelling in many fields such as speech recognition. Both of them have been applied to gene expression time series clustering ¥cite{schliep05analyzing, ji03mining}. As generative models, both DBNs and HMMs assign a joint probability distribution $P(X,Y)$ where $X$ and $Y$ are random variables respectively ranging over observation and label sequences. To define such a probability, they have to make some independence assumptions, which could be problematic, in order to be computationally tractable. Furthermore, in gene expression data modelling, it is impossible to represent gene interactions and long-range dependencies between genes for generative models as enumerating all possibilities is intractable.

Conditional Random Fields CRFs are undirected graphical models that define a probability distribution over the label sequences, globally conditioned on a set of observed features. X = {x1, x2,…, xn}: variable over the observations; Y = {y1, y2,…, yn}: variable over the corresponding labels. Observed data xj and class labels yj for all j in a voting pool Ni for sample xi; CRFs are undirected graphical models that define a probability distribution over the label sequences, globally conditioned on a set of observed features. Let $X$ be a random variable over the observations and $Y$ be a random variable over the corresponding labels. When the notes corresponding to elements of $Y$ form a linear chain, the cliques are the edges and vertices as shown In the case of time series clustering X=\{x_{1}, x_{2}, …, x_{n}\} is the set of observed sequences and $Y=\{y_{1}, y_{2},\ldots,y_{n}\}$ is the set of corresponding labels. Let $G = (V, E)$ be an undirected graph such that $Y=(Y_{v}), v\in V$, and $Y$ obeys Markovian property as in the graph when conditioned on $X$, then $(X,Y)$ is a conditional random field.

CRFs Model The CRFs model can be formulated as follows The CRFs model can be expressed in a Gibbs form in terms of cost functions The conditional random field model of Eq.(\ref{eq1}) can also be expressed in a Gibbs form in terms of cost functions $U^{c}_{i}(x_{i},x_{N_{i}}|y_{i},y_{N_{i}})$ and $ U^{p}_{i}(y_{i}|y_{N_{i}})$, which are associated with the conditional probability and prior of Eq.(\ref{eq1}), respectively. Since the two cost functions are dependent on the same set of 'variables', by properly integrating the two, we obtain a new model as

Cost function The conditional random field model can also be expressed in a Gibbs form in terms of cost functions Cost function

Potential function Real-value potential functions are obtained and used to form the cost function D: the estimated threshold dividing the set of Euclidean distances into intra- and inter-class distances Inferred from the graphical structure of conditional random field, potential function aims to factorize the joint distribution over $y$ by operating on pairs of dependent variables, that is, on the edges in $G$. where the potential function $W_{i,j}$ is defined, based on the Euclidean distance between samples $i$ and $j$ as

Finding the optimal labels We adopt deterministic label selection, the optimal label is determined by The optimal label $\hat{y_{i}}$ for a sample $i$ can be stochastically or deterministically selected according to Eq.(\ref{eq2}). In our work we adopt deterministic selection, as such picking a label corresponding to a large value of $P(\cdot)$ is equivalent to picking a label corresponding to a small value of $U_{i}(\cdot)$. Therefore, the optimal label $\hat{y_{i}}$ is selected according to

Pre-processing Linear Warping for data alignment τ -time point data transformed into τ-1feature space Differences between consecutive time points inversely proportional to time intervals are used as features as they can reflect the temporal structures in the time series. Voting pool: keeps one most similar sample, one most-different sample and k-2 randomly selected samples. Preprocessing the time series data such as alignment and smoothing is necessary to remove the variability in the timing of biological processes and random variation. The alignment transforms gene expression time series that start at different cell cycle phases and occur on different time scales into comparable data. After data alignment, a simple linear transformation is carried out to transform the $\tau$-time-point data into a $\tau-1$ dimensional feature space Before the first iteration of the labelling process starts, each sample is assigned a label randomly picked from the integer range $[1, n]$, where $n$ is the number of samples. Therefore, the algorithm starts with a set of $n$ singleton clusters without user specifying the number of clusters. each individual sample to interact with its randomly formed \emph{voting pool} to find its own identity progressively. The rationale supporting our postulates is that the randomness of the voting pool facilitates global interactions in order to make local decisions.

Process Initialization Each sample is assigned a random label Voting pools are formed randomly Samples interact with each other via its voting pool progressively Update labels Updata voting pool Until steady

Experimental Validation Both biological dataset and simulated dataset Adjusted Rand index: Similarity measure of two partitions Yeast galactose dataset Gene expression measurements in galactose utilization in Saccharomyces cerevisiae Subset of meansurements of 205 genes whose expression patterns reflect four functional categories in the Gene Ontology (GO) listings 4 repeated measurements across 20 time points We use biological dataset as well as simulated datasets for validation. It is widely accepted that an algorithm is tested regarding to its accuracy using both biological datasets and the simulated datasets. Simulated datasets are necessary because the biological meanings of real datasets are very often not clear. Besides, simulated datasets provide more controllable conditions to test an algorithm and a standard for benchmarking. Simulated datasets also have the disadvantages of overlooking the real process and losing important biological features. Yeast galactose dataset consisting of gene expression measurements in galactose utilization in {\it Saccharomyces cerevisiae} is used in our experiment. A subset of meansurements of 205 genes whose expression patterns reflect four functional categories as illustrated in Fig. \ref{fig bio} in the Gene Ontology (GO) listings \cite{ashburner00gene} is used. The experiments are conducted with four repeated measurements across 20 time points. Thus we evaluate the algorithm accuracy by comparing the clustering result with the four functional categories as the ground truth.

Results for Yeast galactose dataset The four functional categories of Yeast galactose dataset Experimental results on Yeast galactose dataset 17 iterations We obtained an average Rand index value of 0.943 in 10 experiments, greater than the result 0.7 in Tjaden et al. 2006.

Simulated Dataset Data are generated for 400 genes across 20 time points from six artificial patterns to model periodic, up-regulated and down regulated gene expression profiles. High Gaussian noise is added. Perfect partitions are obtained with 10 iterations Following \cite{schliep05analyzing, medvedovic04bayesian}, we generate simulation data for $n=400$ genes across $\tau=20$ time points from six artificial patterns to model periodic, up-regulated and down regulated gene expression profiles. In particular, four classes are generated from sine waves which have frequency and phrase randomness relative to each other, two classes are generated with linear functions as in Eq.(\ref{eq7}). High Gaussian noise $\varepsilon_{i}$ is added to all of the data as it is demonstrated in Fig. \ref{fig sim}.

Conclusions A novel unsupervised Conditional Random Fields model for efficient and accurate gene expression time series clustering All data points are randomly initialized The randomness of the voting pool facilitates global interactions

Future work Various similarity measurement Advantage of information from repeated measurements Training and testing procedures