Analyzing Time Series Gene Expression Data Ziv Bar-Joseph Center for Automated Learning and Discovery Carnegie Mellon University 06/2004 Analyzing Time Series Expression Data
Expression Experiments Time series: Multiple arrays at various temporal intervals Static: Snapshot of the activity in the cell 06/2004 Analyzing Time Series Expression Data
Abundance of time series expression datasets Over 30% of the 170 papers perform time series experiments. A total of 220 time series datasets. More arrays used for time series than for static expression experiments. 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data 06/2004 Analyzing Time Series Expression Data
Unique features of time series expression experiments Autocorrelation between successive points. Can identify complete set of acting genes. Allows to infer causality. 06/2004 Analyzing Time Series Expression Data
Time Series Examples: Development Development of fruit flies [Arbeitman, Science 02] 06/2004 Analyzing Time Series Expression Data
Time Series Examples (cont) Function Infectious diseases, response to external stimulus Interactions and Systems Transcription factors knockouts 06/2004 Analyzing Time Series Expression Data
Time Series Examples: Systems The cell cycle system in yeast [Simon et al, Cell 01] 06/2004 Analyzing Time Series Expression Data
Computational challenges Biological Networks dynamic regulatory networks information fusion function, response programs clustering temporal signals Pattern Recognition alignment, diff. expressed genes continuous representation, missing values, Data Analysis transcription, decay rates sampling rates, duration Experimental Design 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Sampling Rates Non uniform Differ between experiments 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Cell Cycle Datasets Dataset Method of arrest Duration Cell cycle length Sampling Repeats alpha (Spellman 98) alpha mating factor 0-119m 64m every 7 minutes 1 cdc15 (Spellman 98) temp. sensitive cdc15 10-290m 112m ev. 20m for 1 hr, ev. 10m for 3 hr, ev. 20m for final hr cdc28 (Cho98) temp. sensitive cdc28 0-160m 85m every 10 minutes fkh1/fkh2 knockout (Zhu00) 0-215m 105m every 15m until 165m then after 45m 2 yox1/yhp1 knockout (Pramila02) 0-120m 60m 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Networks Pattern Recognition Data Analysis Experimental Design 06/2004 Analyzing Time Series Expression Data
Representing time series expression data We are capturing a continuous process with a few samples. We need a way to convert our samples for each gene to an expression profile. Some simple techniques: - Linear interpolation - Spline interpolation - Functional assignment 06/2004 Analyzing Time Series Expression Data
Standard interpolation If we have missing values and noise linear interpolation will fail to reproduce an accurate representation. 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Splines Instead of linear interpolation, we can use splines: piecewise polynomials. Still, will overfit when faced with missing values and noise. 06/2004 Analyzing Time Series Expression Data
The power of co-expression We can modify our splines to take into account the fact that many genes are co-expressed. 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Avoiding overfitting Require that for each gene ~N(0, j) Add noise term 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Class Assignment In some cases the biological classes are known in advance. The algorithm can be modified and combined with a Gaussian mixture algorithm to perform clustering of the continuous representation of the expression data. Unlike previous algorithms, works on the continuous representation of the expression profile Performs much better than k-means on non uniformly sampled data 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Missing values 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Interpolation 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Alignment FKH1 cdc15 alpha cdc28 Difference in the timing of similar biological processes 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Continuous Alignment Using the estimated splines, we continuously align two expression datasets by minimizing a global error function Can be solved using numeric methods. Hard problem due to spline segments and the different durations. We also have an algorithm for removing genes that should not align RECOMB 2002 06/2004 Analyzing Time Series Expression Data
Identifying differentially expressed genes Wild Type Knockout Hard to perform manual comparison. Sampling rates and different timing prevent direct comparison. Local vs. global methods: Using the original samples is infisable since not on the same scale. Sampling the curves might lead to errors since does not take into account systematic biases Need a global measure and probability distribution Zhu et al, Nature 2000 06/2004 Analyzing Time Series Expression Data
Using Global Error to Determine Significance Key idea: Combine individual noise model with a global error (area between curves) that correctly captures the temporal difference between the two profiles. 06/2004 Analyzing Time Series Expression Data
Comparing the continuous representation WT Knockout 06/2004 Analyzing Time Series Expression Data
Enrichment for the Cell Cycle Factors 06/2004 Analyzing Time Series Expression Data
Overcoming population effects Smc3: observed values Microarray experiments profile population of cells. Initially cells are synchronized, but they lose their synchronization over time. Need to compensate for synchronization loss in order to recover single cell values. 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Networks Pattern Recognition Individual Gene Experimental Design 06/2004 Analyzing Time Series Expression Data
Pattern recognition and clustering Identifying relationships between genes based on expression profiles. Handling non uniform sampling rates. Determining relationships between clusters. 06/2004 Analyzing Time Series Expression Data
Time Shifted and Inverted Profiles Qian et al Journal of Molecular Biology 2001 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Results Simultaneous expression profile relationships: Inverted expression profile relationships: Time delayed expression profile relationships 06/2004 Analyzing Time Series Expression Data
Hierarchical clustering For n leaves there are n-1 internal nodes Each flip in an internal node creates a new linear ordering There are 2n-1 possible linear ordering of the leafs of the tree 3 4 5 4 5 3 1 2 06/2004 Analyzing Time Series Expression Data
Determine Relations Between Clusters Optimal leaf ordering: selects the ordering that maximizes the sum of the similarities of adjacent leaves in the clustering tree. Initial structure Permuted Hierarchical clustering 06/2004 Analyzing Time Series Expression Data
Results – Synthetic Data Hierarchical clustering Input Optimal ordering Input Optimal ordering Hierarchical clustering 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data 24 cell cycle experiments 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Short Time Series 60% of the time series datasets are short (<7). Over 40000 signals are measured, data is very noisy and experiments are compared across all time points. Most clustering algorithms will miss small sets, and in addition, cannot be used to compare datasets. 06/2004 Analyzing Time Series Expression Data
Taking advantage of the small number of points 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Networks Pattern Recognition Individual Gene Experimental Design 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Systems Biology Different types of data provide partial information about the activity in the cell. By integrating these data sources we can obtain a better picture of the activity in the cell. A lot of current interest – though relatively few methods construct temporal models. 06/2004 Analyzing Time Series Expression Data
Dynamic Bayesian Networks Bayesian networks are graphical models which can account for the stochastisity in the data. Can be extended to handle time series data (dynamic Bayesian networks). So far have been used for small scale modeling. 06/2004 Analyzing Time Series Expression Data
Modeling tryptophan metabolism on E. coli Ong et al Bioinformatics 2002 06/2004 Analyzing Time Series Expression Data
Genetic RegulAtory Modules (GRAM) Gene Modules: Set of genes that are co-regulated and co-expressed. Functional Module: Collection of gene modules with related function. 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Assembly of the Cell Cycle Transcriptional Regulatory Network Blue boxes: gene modules We combine GRAM with our continuous alignment algorithms to construct a dynamic model for a sub-network 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Assembly of the Cell Cycle Transcriptional Regulatory Network Blue boxes: gene modules Individual regulators: ovals, connected to their modules Dashed line: extends from module encoding a regulator to the regulator protein oval 06/2004 Analyzing Time Series Expression Data
Comparing the Continuous Representation WT Knockout 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Assembly of the Cell Cycle Transcriptional Regulatory Network Blue boxes: gene modules Individual regulators: ovals, connected to their modules Dashed line: extends from module encoding a regulator to the regulator protein oval 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Summary Time series expression data can be used to answer important biological questions. Pros: Autocorrelation, allows for casual inference, provides a better view of cellular activity Cons: Large number of signals but small number of time points, noise, lack of repeats By using methods specifically developed for this data we can overcome the above problems and take advantage of its unique properties 06/2004 Analyzing Time Series Expression Data
Analyzing Time Series Expression Data Want to know more ? Z. Bar-Joseph, “Analyzing time series gene expression data” Bioinformatics, in press. www.cs.cmu.edu/~zivbj zivbj@cs.cmu.edu 06/2004 Analyzing Time Series Expression Data