Beyond Co-expression: Gene Network Inference Patrik D’haeseleer Harvard University

Beyond Co-expression: Gene Network Inference Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik

Beyond Co-expression Clustering approaches rely on co-expression of genes under different conditions Assumes co-expression is caused by co-regulation We would like to do better than that: –Causal inference –What is regulating what?

Gene Network Inference

Overview Modeling Issues: –Level of biochemical detail –Boolean or continuous? –Deterministic or stochastic? –Spatial or non-spatial? Data Requirements Linear Models Nonlinear models Conclusions

Level of Biochemical Detail Detailed models require lots of data! Highly detailed biochemical models are only feasible for very small systems which are extensively studied Example: Arkin et al. (1998), Genetics 149(4):1633-48 lysis-lysogeny switch in Lambda: 5 genes, 67 parameters based on 50 years of research, stochastic simulation required supercomputer

Example: Lysis-Lysogeny Arkin et al. (1998), Genetics 149(4):1633-48

Level of Biochemical Detail In-depth biochemical simulation of e.g. a whole cell is infeasible (so far) Less detailed network models are useful when data is scarce and/or network structure is unknown Once network structure has been determined, we can refine the model

Boolean or Continuous? Boolean Networks ( Kauffman (1993), The Origins of Order ) assumes ON/OFF gene states. Allows analysis at the network-level Provides useful insights in network dynamics Algorithms for network inference from binary data A B C C = A AND B 0 1 0

Boolean or Continuous? Boolean abstraction is poor fit to real data Cannot model important concepts: –amplification of a signal –subtraction and addition of signals –compensating for smoothly varying environmental parameter (e.g. temperature, nutrients) –varying dynamical behavior (e.g. cell cycle period) Feedback control: negative feedback is used to stabilize expression  causes oscillation in Boolean model

Deterministic or Stochastic? Use of concentrations assumes individual molecules can be ignored Known examples (in prokaryotes) where stochastic fluctuations play an essential role (e.g. lysis- lysogeny in lambda) Requires stochastic simulation ( Arkin et al. (1998), Genetics 149(4):1633-48), or modeling molecule counts (e.g. Petri nets, Goss and Peccoud (1998), PNAS 95(12):6750-5) Significantly increases model complexity

Deterministic or Stochastic? Eukaryotes: larger cell volume, typically longer half- lives. Few known stochastic effects. Yeast: 80% of the transcriptome is expressed at 0.1-2 mRNA copies/cell Holstege, et al.(1998), Cell 95:717-728. Human: 95% of transcriptome is expressed at <5 copies/cell Velculescu et al.(1997), Cell 88:243-251

Spatial or Non-Spatial Spatiality introduces additional complexity: –intercellular interactions –spatial differentiation –cell compartments –cell types Spatial patterns also provide more data e.g. stripe formation in Drosophila: Mjolsness et al. (1991), J. Theor. Biol. 152: 429-454. Few (no?) large-scale spatial gene expression data sets available so far.

Overview Modeling Issues: –Level of biochemical detail –Boolean or continuous? –Deterministic or stochastic? –Spatial or non-spatial? Data Requirements Linear Models Nonlinear models Conclusions

Overview Modeling Issues Data Requirements: –Lower bounds from information theory –Effect of limited connectivity –Comparison with clustering –Variety of data points needed Linear Models Nonlinear models Conclusions

Lower Bounds from Information Theory How many bits of information are needed just to specify the connection pattern of a network? N 2 possible connections between N nodes  N 2 bits needed to specify which connections are present or absent O (N) bits of information per “data point”  O (N) data points needed

Effect of Limited Connectivity Assume only K inputs per gene (on average)  NK connections out of N 2 possible: possible connection patterns Number of bits needed to fully specify the connection pattern:  O (Klog(N/K)) data points needed

Comparison with clustering Use pairwise correlation comparisons as a stand-in for clustering As number of genes increases, number of false positives will increase as well  need to use more stringent correlation test If we want to use the same correlation cutoff value r, we need to increase the number of data points as N increases  O (log(N)) data points needed

Summary Fully connectedN(thousands) Connectivity KKlog(N/K)(hundreds?) Clusteringlog(N)(tens) Additional constraints reduce data requirements: –limited connectivity –choice of regulatory functions Network inference is feasible, but does require much more data than clustering

Variety of Data Points Needed To unravel regulation of a gene, need to sample many different combinations of its regulatory inputs (different environmental conditions and perturbations) Time series data yields dynamics, but all data points are related Steady-state data yields attractors, can sample state space more efficiently Both types of data will be needed, and multiple data sets of each

Overview Modeling Issues Data Requirements Linear Models: –Formulation –Underdetermined problem! –Solution 1: reduce N –Solution 2 Nonlinear models Conclusions

Linear Models Basic model: weighted sum of inputs Simple network representation: Only first-order approximation Parameters of the model: weight matrix containing NxN interaction weights “Fitting” the model: find the parameters w ji, b i such that model best fits available data w 23 g1 g2 g3 g4 g5 w 12 w 55 or

Underdetermined problem! Assumes fully connected network: need at least as many data points (arrays, conditions) as variables (genes)! Underdetermined (underconstrained, ill-posed) model: we have many more parameters than data values to fit No single solution, rather infinite number of parameter settings that will all fit the data equally well

Solution 1: reduce N Rather than trying to model all genes, we can reduce the dimensionality of the problem: Network of clusters: construct a linear model based on the cluster centroids –rat CNS data (4 clusters): Wahde and Hertz (2000), Biosystems 55, 1-3:129-136. –yeast cell cycle (15-18 clusters): Mjolsness et al.(2000), Advances in Neural Information Processing Systems 12; van Someren et al.(2000) ISMB2000, 355-366. Network of Principle Components: linear model between “characteristic modes” of the data Holter et al.(2001), PNAS 98(4):1693-1698.

Solution 2: Take advantage of additional information: –replicates –accuracy of measurements –smoothness of time series –… Most likely, the network will still be poorly constrained.  Need a method to identify and extract those parts of the model that are well-determined and robust

What’s next? Regulatory motifs: once we have identified the corresponding DNA binding proteins (transcription factors), we can start building the gene network from there Integration with other data: –transcription factors –functional annotation –known interactions in the literature –protein-protein interactions –protein expression levels –genetic data –...

Linking Regulatory Motifs to Expression Data Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik

Introduction Gene expression is regulated by Transcription Factors (TFs), that bind to specific regulatory motifs in the promoter region of the gene. Synonyms: regulatory element, regulatory sequence, promoter elements, promoter motifs, (TF) binding site, operator (in prokaryotes), … Question: Do genes with similar expression patterns share regulatory motifs? TF regulatory motif DNA gene

1: Systematic Determination of Genetic Network Architecture Time-point 1 Time-point 3 Time-point 2 -1.8 -1.3 -0.8 -0.3 0.2 0.7 1.2 123 -2 -1.5 -0.5 0 0.5 1 1.5 123 -1.5 -0.5 0 0.5 1 1.5 123 Time -point Normalized Expression Normalized Expression Normalized Expression Tavazoie et al., Nature Genetics 22, 281 – 285 (1999)

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 300-600 bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. Search for Motifs in Promoter Regions

AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Best Motif Found by AlignACE

MCB CLUSTER Number of ORFs Distance from ATG (b.p.) Number of sites MIPS Functional category (# ORFs) ORFs within category DNA synthesis and replication (82) Cell cycle control and mitosis (312) Recombination and DNA repair (84) Nuclear organization (720) 23 30 11 40 N=182

Systematic Determination of Genetic Network Architecture Tavazoie et al., Nature Genetics 22, 281–285 (1999) Most motifs found are highly selective for the cluster they were found in. Can find many known binding sites for transcription factors. Also finds many novel regulatory motifs, associated with specific functional categories. 1) cluster 2) identify regulatory motifs in clustered genes 3) identify TF’s that bind to those motifs  Gene regulation network

2: Regulatory Element Detection Using Correlation with Expression What is the contribution of each regulatory motif (or the TF that binds to that motif) to the expression level of the genes containing the motif? Given a set of known or putative regulatory motifs, identify all genes that contain the motif in their promoter region. For a single expression experiment (e.g. single point in a time series), is the presence of the motif correlated with the expression level of the genes? Perform multiple regression of (log) expression level on the presence/absence of the motifs. Plot contribution of motif throughout time series. Bussemaker et al., Nature Genetics 27, 167 – 174 (2001)

: the presence of motif 1 is correlated with the expression levels of the genes in which it appears : motif 2 is not correlated with expression levels of the genes in which it appears : motif 3 is negatively correlated with expression levels of the genes in which it appears Contribution of Motifs to Expression Levels

Linear Combination of Motif Contributions Find the most highly correlated motif. Determine its contribution F i to expression level by linear regression. Subtract its contribution from the expression levels. Find the next highest correlated motif. Repeat until no more significantly correlated motifs. Repeat this entire analysis for each time point of a time series  weights F i for the individual motifs will change throughout he time course.

Time Courses of Regulatory Signals We can think of the time-varying contributions F i of each motif as the Regulatory Signals of the transcription factors that bind to these motifs Time (minutes)

Regulatory Element Detection Using Correlation with Expression Bussemaker et al., Nature Genetics 27, 167–174 (2001) Can be used with known regulatory motifs, sets of putative motifs, and even exhaustively on the set of all motifs up to a certain length (n=7). Known motifs generally have high statistical significance. Allows us to infer regulatory inputs of (possibly unknown) transcription factors. Accounts for only 30% of total signal present in genome- wide expression patterns. Purely linear model: no synergistic effects between TF’s, cooperative binding, etc.

3: Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements Most transcription factors are thought to work in concert with other TF’s.  Synergistic effects Clustering: –a motif may occur in more than one cluster, because it may give rise to different expression patterns depending on its interaction partners. –several motifs may occur in the same cluster. Correlation with expression pattern: –by itself, a motif may not show a clear expression pattern. –contributions of multiple motifs may not be simply additive. Pilpel et al., Nature Genetics 29, 153–159 (2001)

51015 -2 0 2 EC=0.05 51015 -2 0 2 EC=0.05 Time Expression level SFF but not Mcm1Mcm1 but not SFF Mcm1 and SFF were not detected in Tavazoie et al Yet TFs that bind these motifs are known to interact in control of G2-genes (Nature. 2000 406:90-4.) Bussemaker et al found that these motifs are antagonistic. Synergy between Mcm1 and SFF in Cell Cycle Data Set

Expression Coherence and Synergy Expression Coherence (EC) score indicates how tightly clustered the expression profiles of a set of genes are. For every combination of N=2,3 motifs: 1) Calculate the expression coherence score of the genes that have the N motifs 2) Calculate the expression coherence score of genes that have every possible subset of N-1 motifs 3 )Test (statistically) the hypothesis that the score of the orfs with N motifs is significantly higher than that of orfs that have any sub set of N-1 motifs

The “Combinogram” Ho et al. Nature. 2002 Highly synergistic interaction between MCB and SFF Previously unknown Subsequently predicted via chromatin immuno- precipitation (ChIP) (cell cycle data)

Identifying Regulatory Networks by Combinatorial Analysis of Promoter Elements Pilpel et al., Nature Genetics 29, 153–159 (2001) Found several known and novel interactions between regulatory elements active in cell cycle, sporulation and stress response. Doesn’t assume a specific (e.g. linear) model of TF interactions. Combined with TF expression patterns, may allow us to infer a model of interaction.

Protein Networks Patrik D’haeseleer Harvard University http:/genetics.med.harvard.edu/~patrik

Yeast 2-Hybrid Assays “bait” fusion: “prey” fusion: AD BD Prot1 AD BD Prot2 Binding siteReporter gene AD BD Prot1 AD BD Prot2 AD BD Transcription Factor (e.g. Gal4) Binding site Reporter gene AD BD Prot1 AD BD Prot2 + Fields and Song, Nature 340:245-246 (1989) MATa MAT  Reconstituted active TF

Large-Scale 2-Hybrid Data Sets Uetz et al, Nature 403:623-627 (2000) –6000 x 192 protein pairs screened using protein array –nearly all 6000 x 6000 pairs, using pooled prey libraries –total of 957 putative interactions between 1004 proteins Ito et al, PNAS 98:4569-4574 (2001) –nearly all 6000 x 6000 pairs, using bait and prey pools –total of 4549 putative interactions between 3278 proteins –core set of 841 interactions between 797 proteins Surprisingly little overlap between the data sets, possibly indicating a large number of missed interactions (false negative).

MIPS 1546 Ito full 4475 Uetz 947 28 49 54 156 4242 1415 709 MIPS 1546 1436 756 648 21 28 61 109 Uetz 947 Ito core 806 Intersections between Protein Interaction Data Sets

Causes of False Positives Bait acts as activator Bait interacts with endogenous activator Prey binds to DNA Prey interacts with endogenous transcription factors Bait interacts with Activation Domain Prey interacts with DNA Binding Domain “Sticky” proteins (nonspecific binding) Changes in plasmid copy number Various other artifacts... AD BD Prot1 AD BD Prot2 Binding siteReporter gene

Yeast Protein-Protein Interaction Map Uetz, Schwikowski, Fields and co-workers; Ito and co-workers Each node is a protein Each line is an interaction 5560 putative interactions 3725 different proteins ~ 3 interactions / protein

Membrane Proteins Transcription Factors

- membrane protein - DNA-binding protein - all other yeast proteins - physical interaction between two proteins

Possible Paths from Ste3: 1045 different paths to 143 transcription factors Problem: How to Rank Possible Pathways? Ste12 Ste2/3

Average Pairwise Correlation Coefficient Among Pathway Members * Microarray Data Downloaded from Rosetta Inpharmatics STE3  AKR1  STE5  STE4  FAR1  CDC24  SOH1 0.190 STE3  AKR1  IQG1  CDC42  BEM4  RHO1  SKN7 0.059 STE3  AKR1  STE4  FAR1  FUS3  DIG2  STE12 0.281 STE3  AKR1  GCS1  YGL198W  SAS10  NET1-0.106 Rank Predicted Paths by Degree of Expression Correlation from Microarray Expreriments Known pathways often show correlated expression Known interacting proteins often show correlated expression

Classical View of MAPK Pathways adapted from C.Roberts, et al., Science, 287, 873 (2000)

The Protein Network View Highly interconnected, not just a linear pathway! Some proteins are missing from the protein interaction data sets (Cdc42, Ste20). Includes several additional proteins (especially Akr1, Kss1).

Conclusions Protein interaction data and expression data are both noisy. Combining them increases the accuracy. Can estimate protein interaction error rates by looking at consistency between data sets  probabilistic interaction model (work in progress). Pathways are far more interconnected than often portrayed. Can integrate various other forms of data: –co-localization of proteins –homology with known interacting proteins –“Rosetta Stone” method

Acknowledgments Roland Somogyi Stefanie Fuhrman Xiling Wen UNM: Stephanie Forrest Andreas Wagner David Peabody Barak Pearlmutter NCGR: Jason Stewart Pedro Mendes Harvard: Tzachi Pilpel Martin Steffen Allegra Petti John Aach George Church The Santa Fe Institute

Beyond Co-expression: Gene Network Inference Patrik D’haeseleer Harvard University

Similar presentations

Presentation on theme: "Beyond Co-expression: Gene Network Inference Patrik D’haeseleer Harvard University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Beyond Co-expression: Gene Network Inference Patrik D’haeseleer Harvard University

Similar presentations

Presentation on theme: "Beyond Co-expression: Gene Network Inference Patrik D’haeseleer Harvard University"— Presentation transcript:

Similar presentations

About project

Feedback