Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modeling Dependencies in Protein-DNA Binding Sites 1 School of Computer Science & Engineering 2 Hadassah Medical School The Hebrew University, Jerusalem,

Similar presentations


Presentation on theme: "Modeling Dependencies in Protein-DNA Binding Sites 1 School of Computer Science & Engineering 2 Hadassah Medical School The Hebrew University, Jerusalem,"— Presentation transcript:

1 Modeling Dependencies in Protein-DNA Binding Sites 1 School of Computer Science & Engineering 2 Hadassah Medical School The Hebrew University, Jerusalem, Israel Yoseph Barash 1 Gal Elidan 1 Nir Friedman 1 Tommy Kaplan 1,2

2 promoter gene binding site Dependent positions in binding sites Pros: Biology suggests dependencies Single amino-acid interacts with two nucleotides Change in conformation of protein or DNA Cons: Modeling dependencies is harder Additional parameters Requires more data, not as robust A ?C?C ?T?T To model or not to model dependencies ? [Man & Stormo 2001, Bulyk et al, 2002, Benos et al, 2002] Most approaches assume position independence

3 wCan we learn dependencies from available genomic data ? wDo dependency models perform better ? Outline wFlexible models of dependencies wLearning from (un)aligned sequences wSystematic evaluation Biological insights Data driven approach Yes

4 How to model binding sites ? X1X1 X2X2 X3X3 X4X4 X5X5 Profile: Independency model Tree: Direct dependencies Mixture of Profiles: Global dependencies Mixture of Trees: Both types of dependencies X1X1 X2X2 X3X3 X4X4 X5X5 T X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T represent a distribution of binding sites

5 Learning models: Aligned binding sites Learning based on methods for probabilistic graphical models (Bayesian networks) GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT AAAGGGCCGGGC GGGAGGCCGGGA GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC Aligned binding sites Models X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T Learning Machinery select maximum likelihood model

6 Evaluation using aligned data Estimate generalization of each model: Test: how probable is the site given the model? -20.34 -23.03 -21.31 -19.10 -18.42 -19.70 -22.39 -23.54 -22.39 -23.54 -18.07 -19.18 -18.31 -21.43 ATGGGGCGGGGC GTGGGGCGGGGC ATGGGGCGGGGC GTGGGGCGGGGC GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC Data setTest Log-LikelihoodTest set Training set Test avg. LL = -20.77 95 TFs with 20 binding sites from TRANSFAC database [Wingender et al, 2001] Cross-validation:

7 Arabidopsis ABA binding factor 1 Profile Test LL per instance -19.93 Mixture of Profiles 76% 24% Test LL per instance -18.70 (+1.23) (improvement in likelihood > 2-fold) X4X4 X5X5 X6X6 X7X7 X8X8 X9X9 X 10 X 11 X 12 Tree Test LL per instance -18.47 (+1.46) (improvement in likelihood > 2.5-fold)

8 Likelihood improvement over profiles TRANSFAC 95 aligned data sets 0.5 1 2 4 8 16 32 64 128 102030405060708090 Significant (paired t-test) Fold-change in likelihood Not significant Significant improvement in generalization Data often exhibits dependencies

9 Sources of data : wGene annotation (e.g. Hughes et al, 2000) wGene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000) wChIP (e.g. Simon et al, 2001; Lee et al, 2002) Motif finding problem Input: A set of potentially co-regulated genes Output:A common motif in their promoters Evaluation for unaligned data

10 EM algorithm Learning models: unaligned data Use EM algorithm to simultaneously wIdentify binding site positions wLearn a dependency model Unaligned Data Learn a model Identify binding sites Models X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T X1X1 X2X2 X3X3 X4X4 X5X5 X1X1 X2X2 X3X3 X4X4 X5X5 T

11 ChIP location analysis [ Lee et al, 2002 ] Yeast genome-wide location experiments Target genes for 106 TFs in 146 experiments YAL005C... YAL010C YAL012C YAL013W YPR201W YAL001C YAL002W YAL003W Gene YAL001C YAL002W YAL003W +–+–...+–––+–+–...+––– ABF1 Targets –+––...–++––+––...–++– ZAP1 Targets ….... # genes ~ 6000

12 Learned Mixture of Profiles 43 492 Example: Models learned for ABF1 (YPD) Autonomously replicating sequence-binding factor 1 Learned profile Known profile (from TRANSFAC)

13 Evaluating Performance Detect target genes on a genomic scale: ACGTAT…………….………………….AGGGATGCGAGC -10000 -473

14 -180-160-140-120-100-80-60 p-value 10 -8 10 -7 10 -6 10 -5 10 -4 10 -2 10 Profile 10 -3 Evaluating Performance Mix of Trees Bonferroni corrected p-value 0.01 Gal4 regulates Gal80 Biologically verified site Detect target genes on a genomic scale:

15 YAL005C YAL007C YAL008W YAL009W YAL010C YAL012C YAL013W YPR201W Evaluation using ChIP location data [ Lee et al, 2002 ] Evaluate using a 5-fold cross-validation test: +–++–+ YAL001C YAL002W YAL003W Data setTest set Prediction –+––+––––+––+––– YAL001C YAL002W YAL003W +–++–+

16 ––––++––––––++–– Evaluate using a 5-fold cross-validation test: +–++–+ True –+––+––––+––+––– +–++–+ FN FP YAL005C YAL007C YAL008W YAL009W YAL010C YAL012C YAL013W YPR201W Data set YAL001C YAL002W YAL003W Prediction Evaluation using ChIP location data [ Lee et al, 2002 ]

17 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 1% 2% 3% 4% 5% True Positive Rate (Sensitivity) False Positive Rate Profile Example: ROC curve of HSF1 Mixture of Trees Tree ~60 FP Mixture of Profiles

18 True Predicted TP Improvement in sensitivity & specificity 30 615 3 Sensitivity TP / True Specificity TP / Predicted 105 unaligned data sets from Lee et al.

19 Δ specificity Δ sensitivity Mixture of Profiles vs. Profile True Predicted TP Improvement in sensitivity & specificity 52 1718 0 Sensitivity TP / True Specificity TP / Predicted 105 unaligned data sets from Lee et al.

20 Δ specificity Δ sensitivity Mixture of Trees vs. Profile True Predicted TP Improvement in sensitivity & specificity 84 162 1 Sensitivity TP / True Specificity TP / Predicted 105 unaligned data sets from Lee et al.

21 Is it worthwhile to model dependencies? Evaluation clearly supports this What about the underlying biology ? (with Prof. Hanah Margalit, Hadassah Medical School)

22 Distance between dependent positions Tree models learned from the aligned data sets < 1/3 of the dependencies

23 0.5 1 2 4 8 16 32 64 128 Fold-change in likelihood Zinc finger bZIP bHLH Helix Turn Helix β Sheet others ??? Structural families Dependency models vs. Profile on aligned data sets 0.5 1 2 4 8 16 32 64 128 102030405060708090 Significant (paired t-test) Fold-change in likelihood Not Significant

24 Conclusions wFlexible framework for learning dependencies Dependencies are found in many cases It is worthwhile to model them - Better learning and binding site prediction http://compbio.cs.huji.ac.il/TFBN Future work wLink to the underlying structural biology wIncorporate as part of other regulatory mechanism models


Download ppt "Modeling Dependencies in Protein-DNA Binding Sites 1 School of Computer Science & Engineering 2 Hadassah Medical School The Hebrew University, Jerusalem,"

Similar presentations


Ads by Google