Download presentation
Presentation is loading. Please wait.
1
Drosophila modENCODE Data Integration
Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory
2
mod/ENCODE: (aka. everything you wanted to know about gene regulation but were afraid to ask)
This talk Organism goes here
3
The challenge ahead Understand regulatory logic specifying development
Binding sites of every developmental regulator Sequence motifs for every regulator Annotations & images for all expression patterns GAF, check Su(Hw), check BEAF-32, variant Mod(mdg4), novel CP190, novel CTCF, check Dorsal-Ventral Expression domain primitives reveal underlying logic Anterior-Posterior Understand regulatory logic specifying development
4
The components of genomes and gene regulation
Goal: A systems-level understanding of genomes and gene regulation: The regulators: TFs, GFs, miRNAs, their specificities The regions: enhancers, promoters, insulators The targets: individual regulatory motif instances The grammars: combinations predictive of tissue-specific activity The parts list = Building blocks of gene regulation Our tools: Comparative genomics & large-scale experimental datasets. Evolutionary signatures for promoter/enhancer/3’UTR motif annotation Chromatin signatures for integrating histone modification datasets Sequence signatures associated with TF binding, chromatin, dynamics Infer regulatory networks, their temporal and spatial dynamics Integrate diverse datasets add cartoon image here (remember slide is copied below) 4
5
Outline Annotate regulatory regions Annotate chromatin states
Promoters, enhancers, insulators Annotate chromatin states De novo learning of chromatin mark combinations Predict TF/Chromatin binding Sequence -> TFs -> Chromatin -> Expression Infer regulatory networks Integrate motifs, expression, chromatin Predictive models of gene expression Chromatin/expression time-course Embryo expression domains
6
Annotate Regulatory Regions
Promoters, enhancers, insulators
7
1. Predict and classify promoter regions
Features: Shape and intensity information Classification performance: AUC Datasets positive negative Time Array Seq Array & Seq E0-4hr 0.941 0.907 0.949 E4-8hr 0.908 0.924 0.935 E8-12hr 0.872 0.889 0.909 E12-16hr 0.912 0.923 0.936 E16-20hr 0.871 0.903 E20-24hr 0.876 0.892 0.913 L1 0.804 0.818 0.869 L2 0.844 0.832 0.877 L3 0.855 0.851 0.886 Pupae 0.847 0.850 0.883 AdultMale 0.843 0.806 0.866 AdultFemale 0.853 0.901 Higher in earlier stages Lower for later stages Predictions confirmed w/TSS expression 209 3830 5329 6808 8694 11253 Score = 0.983 Gene start Example Higher scores: Broad promoters, Inr motif n = 3983 n = 1988 n = 300 p = 1.1e-33 p = 8.2e-07 p = 3.2e-17 n = 2889 n = 332 Application: microRNAs, low-expression genes, new stages Understand relationship between chromatin and expression Chris Bristow
8
2. Enhancer prediction from TFs/GFs/Chromatin
Combinations of features improve performance Enrichment in individual features Validation: in situ expression / motif enrichment Logistic regression classifier recovers known CRMs Combinations of features across classes even stronger Enhancers more likely near patterned genes Motifs strongly enriched in predicted enhancers Rachel Sealfon, Chris Bristow
9
3. Identify and classify insulator regions
Class II H3K27 boundaries Class I Class II Divergent promoters Adjacent promoters B. Chromatin Boundaries. Class I and II insulators are enriched at chromatin boundaries (here defined by H3K27me3 domains). But only Class I insulators are enriched at syntenic breakpoints, supporting their role as gene boundaries. Class I Class II Class I Class II Gene Boundaries. Class I insulators segregate gene promoters but not enhancer/promoter. Two classes of insulator regions with different proteins different functions Nicolas Negre, Casey Brown, Kevin White
10
Annotate Chromatin States
De novo learning of mark combinations
11
De novo chromatin states from mark combinations
Promoter states Transcribed states Active Intergenic Repressed Learn de novo significant combinations of chromatin marks Reveal functional elements, even without looking at sequence Use for genome annotation Use for studying regulation dynamics in different cell types Jason Ernst
12
Cartoon Illustration of ChromHMM
Transcription Start Site Enhancer DNA Observed chromatin marks. Called based on a poisson distribution Most likely Hidden State Transcribed Region 1 6 5 3 4 1: 3: 4: 5: 6: High Probability Chromatin Marks in State 2: 0.8 0.9 0.7 200bp intervals All probabilities are learned from the data 2 K4me3 K36me3 K4me1 K27ac We had talked about adding the H3K4 etc labels within the shapes Each state: vector of emissions, vector of transitions Jason Ernst
13
Each chromatin state associated w/ distinct function
Tentative annotations Reveals several classes of promoters, enhancers Distinct marks in transcripts, exons/introns, 5’/3’ UTRs Distinguish inactive, repressed, heterochromatin Jason Ernst, Gary Karpen 13
14
Transcriptional unit enrichment
Jason Ernst, Gary Karpen 14
15
States show distinct functional properties
Chromatin marks DV enhancers AP enhancers General TFs Insulators Replication Motifs Jason Ernst, Gary Karpen
16
Predictive models of TF/Chromatin
Sequence TFs Chromatin Expression
17
1. TF binding prediction highly combinatorial
Transcription factor binding Many motifs enriched in binding of corresponding TF (diagonal) However, extensive cross-enrichment suggests extensive cross-talk across binding of factors Motif enrichment 2-4 24 Fold enrichment Indeed, predictive power for binding increases with motif combinations Both synergistic and antagonistic effects Pouya Kheradpour, Rachel Sealfon
18
2. Combinations of TFs predictive of chromatin states
1.3 0.7 1.1 0.8 0.6 1.5 2.4 0.9 0.1 0.3 0.2 1.4 1.0 2.2 1.8 0.4 0.0 5.4 2.6 6.4 0.5 15.5 1.2 2.0 3.0 3.6 8.2 7.9 2.3 3.2 3.8 3.5 5.0 8.9 1.9 5.2 2.9 2.7 3.3 4.3 1.7 2.8 3.1 2.5 1.6 4.6 3.4 6.1 4.0 13.6 14.4 3.7 7.3 14.5 6.5 10.3 12.3 6.3 5.8 4.8 4.2 4.5 4.4 9.2 6.7 9.6 6.2 11.0 11.7 8.1 11.6 12.2 15.1 18.2 5.3 8.6 8.5 5.6 7.2 2.1 4.1 AP-state 60-fold enriched in enhancers Trx in enhancer states Polycomb states enriched for enhancers Ubiquitous genes enriched for multiple states BEAF/Chro in TSS for ubiquitous genes Strong Su(Hw) in Negative outside promoter states Apply ChromHMM to reveal TF combinations Highly enriched in distinct chromatin states Jason Ernst, Chris Bristow
19
3. Chromatin marks strong predictors of gene expression
quantile levels shape parameters: 5’, 3’ enrichment SVM predictors Gene expression level distribution largely bimodal Task 1: Predict presence/absence: very strong Task2: Predict expression magnitude: somewhat Peter Kharchenko, Peter Park
20
Inferring regulatory networks
Integrate motifs, expression, chromatin
21
1. Motif discovery pipeline for each TF / mark ChIP
Pipeline outperforms all methods Take top 400 ChIP-chip peaks by intensity Random partition of regions #2 Random partition of regions #1 Randomly split regions into two partitions Weeder MEME AlignACE MDscan Compendium of discovered motifs Motif discovery in peak centers ±200bp Discovered motifs ranked by enrichment Enrichment of region #2 motifs in region #1 Motif preferential conservation Found in top 1 Found in top 5 Examples of motifs discovered GAF, check Mod(mdg4), novel CP190, novel CTCF, check Pouya Kheradpour
22
2. Motif target identification pipeline
Evolutionary signature of motif target Increased phylogenetic conservation Non-random compared to control motifs Allow for motif movements Sequencing/alignment errors Loss, movement, divergence Measure branch-length score Sum evidence along branches Close species little Aim From BLS to confidence Promoter/intron enrichment Recover in vivo bound sites Motif Confidence # of moitf instances Motif Confidence Pouya Kheradpour, Alex Stark 22
23
Initial regulatory network for an animal genome
ChIP-grade quality Similar functional enrichment High sens. High spec. Systems-level 81% of Transc. Factors 86% of microRNAs 8k + 2k targets 46k connections Lessons learned Pre- and post- are correlated (hihi/lolo) Regulators are heavily targeted, feedback loop Pouya Kheradpour, Sushmita Roy, Alex Stark
24
3. Data integration for improved network prediction
TF Target Input features used: Conserved TF motif in target ChIP binding of TF in target TF/target co-chromatin marks TF/target co-expression Training set: Edges found in REDfly entwork Test set: Cross-validation Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias
25
Integration improves precision and recall
Comparison of integration methods Comparison of individual features ~10% recovery at ~40% precision ~60% recovery at ~20% precision Linear/logistic regression best, similar to each other use logistic regression Predictive power of individual features: Best: Evolutionarily-conserved motifs Next: chromatin time-course, ChIP-chip for TFs Next: chromatin cell-lines, expression data (RNA-seq and microarrays) Conclusion: Experimental datasets together dramatically improve performance Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias
26
Predictive models of gene regulation
Chromatin/expression timecourse Embryo expression domains
27
1. Chromatin time-course reveals stage regulators
H3K27me3 abd-A motif is enriched in new H3K27me3 regions at L2 Coincides with a drop in the expression of abd-A Model: sites gain H3K27me3 as abd-A binding lost Additional intriguing stories found, to be explored Fold enrichment or over expression Pouya Kheradpour
28
2. Predicting changes in time-series expression
Notice: Adf1 targets appear positively then negatively regulated. Consistent with changes in Adf1 expression (not an input to model) Adf1 activator is ON (targets induced) Adf1 activator is OFF (targets not induced) Adf1 Trl Vnd Tin Abd-A Hmx CG11085 CG34031 En Mad Grh Btd Abd-B Ftz Antp … E2F gt3 Dref gt sna trl esg adf1 byn tin Inv Twi Kr vnd exex en h Integrate TF-target motif associations with time-course Predict positive/negative regulators at each split Jason Ernst
29
Predictive power of inferred network
Target Prediction Coefficients bap w1 en w2 Snail, stages 4 to 6 w3 Snail Mef2 Embryo w0 w4 tin w5 twi White = correlation of individual TF image with target gene. Black = weight. Multi-variate regression coefficient for that target. White one for after regression combination. Graphs: AUC. Also available L2 distance between reconstruction & Target. Intersection / Union. Predict target expression as linear comb of TFs, fit wi Future: can motif grammars predict weights directly? Charlie Frogner, Tom Morgan, Lorenzo Rosasco
30
Additional examples: striped, changing coeffs
Target Prediction Coefficients Trl sna hb Mef2 prd slp1 w1 w2 w3 w4 w5 Embryo w0 slp1, stages 4 to 6 Adf1 sna cad twi bcd hb w1 w2 w3 w4 w5 Embryo w0 Target Prediction Coefficients pan w6 Hunchback, stages 4 to 6 White = correlation of individual TF image with target gene. Black = weight. Multi-variate regression coefficient for that target. White one for after regression combination. Graphs: AUC. Also available L2 distance between reconstruction & Target. Intersection / Union. Charlie Frogner, Tom Morgan, Lorenzo Rosasco
31
Outline Annotate regulatory regions Annotate chromatin states
Promoters, enhancers, insulators Annotate chromatin states De novo learning of chromatin mark combinations Predict TF/Chromatin binding Sequence -> TFs -> Chromatin -> Expression Infer regulatory networks Integrate motifs, expression, chromatin Predictive models of gene expression Chromatin/expression time-course Embryo expression domains
32
The challenge ahead Understand regulatory logic specifying development
Binding sites of every developmental regulator Sequence motifs for every regulator Annotations & images for all expression patterns GAF, check Su(Hw), check BEAF-32, variant Mod(mdg4), novel CP190, novel CTCF, check Dorsal-Ventral Expression domain primitives reveal underlying logic Anterior-Posterior Understand regulatory logic specifying development
33
Drosophila modENCODE Analysis Group
AWG Fly modEncode Sue Celniker Brenton Graveley Steve Brenner Michael Brent Gary Karpen Sarah Elgin Mitzi Kuroda Vince Pirrotta Peter Park Peter Kharchenko Michael Tolstorukov Eric Bishop Kevin White Casey Brown Nicolas Negre Nick Bild Bob Grossman Eric Lai Nicolas Robine David MacAlpine Matthew Eaton Steve Henikoff Peter Bickel Ben Brown Lincoln Stein Group Suzanna Lewis Gos Micklem Nicole Washington EO Stinson Marc Perry Peter Ruzanov Chris Bristow Pouya Kheradpour Rachel Sealfon Jason Ernst Mike Lin Stefan Washietl Networks group Rogerio Candeias Daniel Marbach Patrick Meyer Sushmita Roy Image analysis Tom Morgan Charlie Frogner Lorenzo Rosasco
34
Worm Integrative analysis
Mark Gerstein and… A Agarwal, P Alves, B Arshinoff, R Auerbach, B Brown, A Carr, A Chateigner, C Cheng, N Cheung, T Down, X Feng, L Habegger, L Hillier, A Kanapin, T Liu, L Lochovsky, Z Lu, R Lyne, S Mackowiak, R Robilotto, J Rozowsky, H Shin, C Shou, S Taing, K Yan, K Yip, Z Zheng PIs & co-PIs who participated in the worm calls: J Ahringer, P Bickel, M Gerstein, K Gunsalus, S Kim, S Lewis, J Lieb, S Liu, G Micklem, D Miller, F Piano, N Rajewsky, V Reinke, M Snyder, L Stein, S Strome, R Waterston
35
Large-scale data accumulation
DAC creation DAC meeting Boston Integrative analysis Number of datasets Lincoln Stein, Gos Micklem, Data Coordination Center
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.