1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

The Authors Saeed Tavazoie (middle) Professor Dept. of Molecular Biology Mike Beer Postdoctoral Researcher Ph.D, Princeton (1995)

The Question Transcription factor binding sites are relatively well-characterized in Saccharomyces cerevisiae But - the presence of a TF binding site alone is not sufficient to predict expression of a gene Multiple regulatory factors are often involved How do you identify the elaborate rules for gene regulation?

Simple regulatory structures Each possible combination of TFs must be tested in the lab; This is a hugely time-consuming task..

Problems with predicting gene regulation Numerous transcription factors can bind to any one motif Regulatory motif sequences have low consensus e.g. The well known “TATA box” has a consensus of TATA(A/T)A(A/T)(A/G) Many genes have multiple known motifs upstream of ATG

Example of cis-regulatory logic From Yuh et al (1998), Science 279, 1896-1902

The Approach 1. Using microarray expression data, the authors built clusters of genes with similar expression patterns. From brain expression data in Wen et al (1998), PNAS 95, 334-339

The Approach, con’t. 2. From groups of genes with similar expression patterns, a search is undertaken for consensus sequence motifs within 800bp upstream of ATG in each cluster.

The Approach, con’t 3. The authors built a Markov model using the TF sequence motifs as parent nodes, and the expression data as data values. 4.This can be applied to a gene of interest by identifying the upstream TF motifs for that gene, and finding the model(s) that best fits the known upstream TF motifs. 5.If the expression data is within the parameters predicted by the model, then there is a decent chance that its associated gene regulatory structure can be verified experimentally.

Two examples from yeast Both clusters have at least 10 genes each, and there is some confidence that genes with the same upstream TFs will exhibit the same expression pattern as these clusters.

Constructing the models Using expression data from 30 microarrays, the authors identified 5547 genes with “significant” expression levels in yeast, and this data was used to construct 49 models of expression patterns.

These 49 models were applied to five test sets of expression data, using only the upstream 800 bp region as input. They found that the expression pattern was correctly predicted for 1898 genes out of the test set(s) of 2587 genes. This amounts to 73% accuracy (random would be 1/49, or 2%). Predictive accuracy

Application to C. elegans Given the larger amount of regulatory sequences in higher order organisms, and the potential for more complex regulation, the authors had low expectations for applying this model to C. elegans. Using 2000 bp of upstream sequence, and microarray expression data including Hill (2000), the authors were surprised to learn that they could predict expression patterns for roughly half of the genes in the C. elegans dataset.

An example from C. elegans

Is it really so simple? Gene regulation involves a complex combinatorial dance of numerous factors aside from the presence or absence of TF binding sites. The authors have deliberately limited their scope to cis-acting upstream factors-- ignoring regulatory elements in introns or downstream regions, as well as the effects of operons, alternative splicing, histone modifications, methylation, et cetera

Model constraints Several bits of information were found to be significant factors in improving the predictive accuracy of the models: A.Motif orientiation ( ) B.Distance from the start codon C.The particular order of various TFs D.The presence of multiple copies of the same TF All of those factors were included in the model as priors.

Why is distance from the start codon significant? From Harbison et al (2004), Nature 431, 99-104

The number of copies of a TF binding site is relevant.. From Molecular Biology of the Cell, 4th edition

Motif combinatorics and predictive accuracy The order of various TFs is significant Combinatoric models are more accurate than single-TF models (unless a gene is under the control of only one TF).

Future directions.. Because of the sensitivity of the model(s), even a very small amount of ambiguity can yield junk results. For this reason, SAGE data is not particularly suitable, as only unique SAGE tags can be said to be unambiguous; this in turn excludes all sorts of potentially useful data. However, we could use the microarray-based predictions to pick gene regulatory structures to investigate..

1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Similar presentations

Presentation on theme: "1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)

Similar presentations

Presentation on theme: "1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004)"— Presentation transcript:

Similar presentations

About project

Feedback