Drosophila modENCODE Data Integration

Slides:



Advertisements
Similar presentations
Gene regulation /function card Anatomical network card Tassy et al., Figure S1: Navigation diagram of ANISEED Anatomical structure card Expression card.
Advertisements

Genome Sequence & Gene Expression Chromatin & Nuclear Organization Chromosome Inheritance & Genome Stability.
Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Applications of Visualization and Data Clustering to 3D Gene Expression Data Oliver Rübel 1,2,3,7, Gunther H. Weber 3,7, Min-Yu Huang 1,7, E. Wes Bethel.
Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.
Current Topics of Genomics and Epigenomics. Outline  Motivation for analysis of higher order chromatin structure  Methods for studying long range chromatin.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
The Hardwiring of development: organization and function of genomic regulatory systems Maria I. Arnone and Eric H. Davidson.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
P300 Marks Active Enhancers Ruijuan LiChao HeRui Fu.
Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
Integrative fly analysis: specific aims Aim 1: Comprehensive data collection – Data QC / data standards / – consistent pipelines Aim 2: Integrative annotation.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
Jason Ernst Broad Institute of MIT and Harvard
CS173 Lecture 9: Transcriptional regulation III
Transcription factor binding motifs (part II) 10/22/07.
Genomics 2015/16 Silvia del Burgo. + Same genome for all cells that arise from single fertilized egg, Identity?  Epigenomic signatures + Epigenomics:
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Integrative Genomics. Double-helix DNA strands are separated in the gene coding region Which enzyme detects the beginning of a gene ? RNA Polymerase (multi-subunit.
Identification of Functional Elements and Regulatory Circuits by Drosophila modENCODE by, Sushmita Roy, Jason Ernst, Peter V. Kharchenko, Pouya Kheradpour,
Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent
CS273B: Deep learning for Genomics and Biomedicine
Epigenetics Continued
Functional Elements in the Human Genome
Epigenetics 04/04/16.
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Figure 1. Distinct chromatin regions isolated by the N-ChroP strategy
Manolis Kellis Broad Institute of MIT and Harvard
Dynamic epigenetic enhancer signatures reveal key transcription factors associated with monocytic differentiation states by Thu-Hang Pham, Christopher.
Volume 4, Issue 4, Pages e5 (April 2017)
Interpreting the human genome
Chromatin state and DNA sequence in TF binding dynamics and disease
Volume 12, Issue 11, Pages (September 2015)
Volume 63, Issue 2, Pages (July 2016)
Volume 33, Issue 4, Pages (February 2009)
In collaboration with Mikkelsen Lab
Volume 26, Issue 1, Pages (January 2016)
Mapping Global Histone Acetylation Patterns to Gene Expression
Hannah K. Long, Sara L. Prescott, Joanna Wysocka  Cell 
Volume 23, Issue 9, Pages (May 2018)
Presented by, Jeremy Logue.
Diverse patterns, similar mechanism
Systematic mapping of functional enhancer-promoter connections with CRISPR interference by Charles P. Fulco, Mathias Munschauer, Rockwell Anyoha, Glen.
Evolution of Alu Elements toward Enhancers
Volume 10, Issue 10, Pages (October 2017)
Systematic mapping of functional enhancer–promoter connections with CRISPR interference by Charles P. Fulco, Mathias Munschauer, Rockwell Anyoha, Glen.
Transcription Factor Networks in Drosophila melanogaster
Anh Pham Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.
Presented by, Jeremy Logue.
Volume 32, Issue 5, Pages (May 2010)
Volume 26, Issue 12, Pages e5 (March 2019)
Integrative analysis of 111 reference human epigenomes
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Multiplex Enhancer Interference Reveals Collaborative Control of Gene Regulation by Estrogen Receptor α-Bound Enhancers  Julia B. Carleton, Kristofer.
IMPACT: Genomic Annotation of Cell-State-Specific Regulatory Elements Inferred from the Epigenome of Bound Transcription Factors  Tiffany Amariuta, Yang.
Chromatin state mapping pinpoints PAX3–FOXO1 (P3F) in active enhancers
Presentation transcript:

Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory

mod/ENCODE: (aka. everything you wanted to know about gene regulation but were afraid to ask) This talk Organism goes here

The challenge ahead Understand regulatory logic specifying development Binding sites of every developmental regulator Sequence motifs for every regulator Annotations & images for all expression patterns GAF, check Su(Hw), check BEAF-32, variant Mod(mdg4), novel CP190, novel CTCF, check Dorsal-Ventral Expression domain primitives reveal underlying logic Anterior-Posterior Understand regulatory logic specifying development

The components of genomes and gene regulation Goal: A systems-level understanding of genomes and gene regulation: The regulators: TFs, GFs, miRNAs, their specificities The regions: enhancers, promoters, insulators The targets: individual regulatory motif instances The grammars: combinations predictive of tissue-specific activity  The parts list = Building blocks of gene regulation Our tools: Comparative genomics & large-scale experimental datasets. Evolutionary signatures for promoter/enhancer/3’UTR motif annotation Chromatin signatures for integrating histone modification datasets Sequence signatures associated with TF binding, chromatin, dynamics Infer regulatory networks, their temporal and spatial dynamics  Integrate diverse datasets add cartoon image here (remember slide is copied below) 4

Outline Annotate regulatory regions Annotate chromatin states Promoters, enhancers, insulators Annotate chromatin states De novo learning of chromatin mark combinations Predict TF/Chromatin binding Sequence -> TFs -> Chromatin -> Expression Infer regulatory networks Integrate motifs, expression, chromatin Predictive models of gene expression Chromatin/expression time-course Embryo expression domains

Annotate Regulatory Regions Promoters, enhancers, insulators

1. Predict and classify promoter regions Features: Shape and intensity information Classification performance: AUC Datasets positive negative Time Array Seq Array & Seq E0-4hr 0.941 0.907 0.949 E4-8hr 0.908 0.924 0.935 E8-12hr 0.872 0.889 0.909 E12-16hr 0.912 0.923 0.936 E16-20hr 0.871 0.903 E20-24hr 0.876 0.892 0.913 L1 0.804 0.818 0.869 L2 0.844 0.832 0.877 L3 0.855 0.851 0.886 Pupae 0.847 0.850 0.883 AdultMale 0.843 0.806 0.866 AdultFemale 0.853 0.901 Higher in earlier stages Lower for later stages Predictions confirmed w/TSS expression 209 3830 5329 6808 8694 11253 Score = 0.983 Gene start Example Higher scores: Broad promoters, Inr motif n = 3983 n = 1988 n = 300 p = 1.1e-33 p = 8.2e-07 p = 3.2e-17 n = 2889 n = 332 Application: microRNAs, low-expression genes, new stages Understand relationship between chromatin and expression Chris Bristow

2. Enhancer prediction from TFs/GFs/Chromatin Combinations of features improve performance Enrichment in individual features Validation: in situ expression / motif enrichment Logistic regression classifier recovers known CRMs Combinations of features across classes even stronger Enhancers more likely near patterned genes Motifs strongly enriched in predicted enhancers Rachel Sealfon, Chris Bristow

3. Identify and classify insulator regions Class II H3K27 boundaries Class I Class II Divergent promoters Adjacent promoters B. Chromatin Boundaries. Class I and II insulators are enriched at chromatin boundaries (here defined by H3K27me3 domains). But only Class I insulators are enriched at syntenic breakpoints, supporting their role as gene boundaries. Class I Class II Class I Class II Gene Boundaries. Class I insulators segregate gene promoters but not enhancer/promoter. Two classes of insulator regions with different proteins different functions Nicolas Negre, Casey Brown, Kevin White

Annotate Chromatin States De novo learning of mark combinations

De novo chromatin states from mark combinations Promoter states Transcribed states Active Intergenic Repressed Learn de novo significant combinations of chromatin marks Reveal functional elements, even without looking at sequence Use for genome annotation Use for studying regulation dynamics in different cell types Jason Ernst

Cartoon Illustration of ChromHMM Transcription Start Site Enhancer DNA Observed chromatin marks. Called based on a poisson distribution Most likely Hidden State Transcribed Region 1 6 5 3 4 1: 3: 4: 5: 6: High Probability Chromatin Marks in State 2: 0.8 0.9 0.7 200bp intervals All probabilities are learned from the data 2 K4me3 K36me3 K4me1 K27ac We had talked about adding the H3K4 etc labels within the shapes Each state: vector of emissions, vector of transitions Jason Ernst

Each chromatin state associated w/ distinct function Tentative annotations Reveals several classes of promoters, enhancers Distinct marks in transcripts, exons/introns, 5’/3’ UTRs Distinguish inactive, repressed, heterochromatin Jason Ernst, Gary Karpen 13

Transcriptional unit enrichment Jason Ernst, Gary Karpen 14

States show distinct functional properties Chromatin marks DV enhancers AP enhancers General TFs Insulators Replication Motifs Jason Ernst, Gary Karpen

Predictive models of TF/Chromatin Sequence  TFs  Chromatin  Expression

1. TF binding prediction highly combinatorial Transcription factor binding Many motifs enriched in binding of corresponding TF (diagonal) However, extensive cross-enrichment suggests extensive cross-talk across binding of factors Motif enrichment 2-4 24 Fold enrichment Indeed, predictive power for binding increases with motif combinations Both synergistic and antagonistic effects Pouya Kheradpour, Rachel Sealfon

2. Combinations of TFs predictive of chromatin states 1.3 0.7 1.1 0.8 0.6 1.5 2.4 0.9 0.1 0.3 0.2 1.4 1.0 2.2 1.8 0.4 0.0 5.4 2.6 6.4 0.5 15.5 1.2 2.0 3.0 3.6 8.2 7.9 2.3 3.2 3.8 3.5 5.0 8.9 1.9 5.2 2.9 2.7 3.3 4.3 1.7 2.8 3.1 2.5 1.6 4.6 3.4 6.1 4.0 13.6 14.4 3.7 7.3 14.5 6.5 10.3 12.3 6.3 5.8 4.8 4.2 4.5 4.4 9.2 6.7 9.6 6.2 11.0 11.7 8.1 11.6 12.2 15.1 18.2 5.3 8.6 8.5 5.6 7.2 2.1 4.1 AP-state 60-fold enriched in enhancers Trx in enhancer states Polycomb states enriched for enhancers Ubiquitous genes enriched for multiple states BEAF/Chro in TSS for ubiquitous genes Strong Su(Hw) in Negative outside promoter states Apply ChromHMM to reveal TF combinations Highly enriched in distinct chromatin states Jason Ernst, Chris Bristow

3. Chromatin marks strong predictors of gene expression quantile levels shape parameters: 5’, 3’ enrichment SVM predictors Gene expression level distribution largely bimodal Task 1: Predict presence/absence: very strong Task2: Predict expression magnitude: somewhat Peter Kharchenko, Peter Park

Inferring regulatory networks Integrate motifs, expression, chromatin

1. Motif discovery pipeline for each TF / mark ChIP Pipeline outperforms all methods Take top 400 ChIP-chip peaks by intensity Random partition of regions #2 Random partition of regions #1 Randomly split regions into two partitions Weeder MEME AlignACE MDscan Compendium of discovered motifs Motif discovery in peak centers ±200bp Discovered motifs ranked by enrichment Enrichment of region #2 motifs in region #1 Motif preferential conservation Found in top 1 Found in top 5 Examples of motifs discovered GAF, check Mod(mdg4), novel CP190, novel CTCF, check Pouya Kheradpour

2. Motif target identification pipeline Evolutionary signature of motif target Increased phylogenetic conservation Non-random compared to control motifs Allow for motif movements Sequencing/alignment errors Loss, movement, divergence Measure branch-length score Sum evidence along branches Close species little Aim From BLS to confidence Promoter/intron enrichment Recover in vivo bound sites Motif Confidence  # of moitf instances Motif Confidence  Pouya Kheradpour, Alex Stark 22

Initial regulatory network for an animal genome ChIP-grade quality Similar functional enrichment High sens. High spec. Systems-level 81% of Transc. Factors 86% of microRNAs 8k + 2k targets 46k connections Lessons learned Pre- and post- are correlated (hihi/lolo) Regulators are heavily targeted, feedback loop Pouya Kheradpour, Sushmita Roy, Alex Stark

3. Data integration for improved network prediction TF Target Input features used: Conserved TF motif in target ChIP binding of TF in target TF/target co-chromatin marks TF/target co-expression Training set: Edges found in REDfly entwork Test set: Cross-validation Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias

Integration improves precision and recall Comparison of integration methods Comparison of individual features ~10% recovery at ~40% precision ~60% recovery at ~20% precision Linear/logistic regression best, similar to each other  use logistic regression Predictive power of individual features: Best: Evolutionarily-conserved motifs Next: chromatin time-course, ChIP-chip for TFs Next: chromatin cell-lines, expression data (RNA-seq and microarrays) Conclusion: Experimental datasets together dramatically improve performance Daniel Marbach, Sushmita Roy, Patrick Meyer, Rogerio Candeias

Predictive models of gene regulation Chromatin/expression timecourse Embryo expression domains

1. Chromatin time-course reveals stage regulators H3K27me3 abd-A motif is enriched in new H3K27me3 regions at L2 Coincides with a drop in the expression of abd-A Model: sites gain H3K27me3 as abd-A binding lost Additional intriguing stories found, to be explored Fold enrichment or over expression Pouya Kheradpour

2. Predicting changes in time-series expression Notice: Adf1 targets appear positively then negatively regulated. Consistent with changes in Adf1 expression (not an input to model) Adf1 activator is ON (targets induced) Adf1 activator is OFF (targets not induced) Adf1 Trl Vnd Tin Abd-A Hmx CG11085 CG34031 En Mad Grh Btd Abd-B Ftz Antp … E2F gt3 Dref gt sna trl esg adf1 byn tin Inv Twi Kr vnd exex en h Integrate TF-target motif associations with time-course Predict positive/negative regulators at each split Jason Ernst

Predictive power of inferred network Target Prediction Coefficients bap w1 en w2 Snail, stages 4 to 6 w3 Snail Mef2 Embryo w0 w4 tin w5 twi White = correlation of individual TF image with target gene. Black = weight. Multi-variate regression coefficient for that target. White one for after regression combination. Graphs: AUC. Also available L2 distance between reconstruction & Target. Intersection / Union. Predict target expression as linear comb of TFs, fit wi Future: can motif grammars predict weights directly? Charlie Frogner, Tom Morgan, Lorenzo Rosasco

Additional examples: striped, changing coeffs Target Prediction Coefficients Trl sna hb Mef2 prd slp1 w1 w2 w3 w4 w5 Embryo w0 slp1, stages 4 to 6 Adf1 sna cad twi bcd hb w1 w2 w3 w4 w5 Embryo w0 Target Prediction Coefficients pan w6 Hunchback, stages 4 to 6 White = correlation of individual TF image with target gene. Black = weight. Multi-variate regression coefficient for that target. White one for after regression combination. Graphs: AUC. Also available L2 distance between reconstruction & Target. Intersection / Union. Charlie Frogner, Tom Morgan, Lorenzo Rosasco

Outline Annotate regulatory regions Annotate chromatin states Promoters, enhancers, insulators Annotate chromatin states De novo learning of chromatin mark combinations Predict TF/Chromatin binding Sequence -> TFs -> Chromatin -> Expression Infer regulatory networks Integrate motifs, expression, chromatin Predictive models of gene expression Chromatin/expression time-course Embryo expression domains

The challenge ahead Understand regulatory logic specifying development Binding sites of every developmental regulator Sequence motifs for every regulator Annotations & images for all expression patterns GAF, check Su(Hw), check BEAF-32, variant Mod(mdg4), novel CP190, novel CTCF, check Dorsal-Ventral Expression domain primitives reveal underlying logic Anterior-Posterior Understand regulatory logic specifying development

Drosophila modENCODE Analysis Group AWG Fly modEncode Sue Celniker Brenton Graveley Steve Brenner Michael Brent Gary Karpen Sarah Elgin Mitzi Kuroda Vince Pirrotta Peter Park Peter Kharchenko Michael Tolstorukov Eric Bishop Kevin White Casey Brown Nicolas Negre Nick Bild Bob Grossman Eric Lai Nicolas Robine David MacAlpine Matthew Eaton Steve Henikoff Peter Bickel Ben Brown Lincoln Stein Group Suzanna Lewis Gos Micklem Nicole Washington EO Stinson Marc Perry Peter Ruzanov Chris Bristow Pouya Kheradpour Rachel Sealfon Jason Ernst Mike Lin Stefan Washietl Networks group Rogerio Candeias Daniel Marbach Patrick Meyer Sushmita Roy Image analysis Tom Morgan Charlie Frogner Lorenzo Rosasco

Worm Integrative analysis Mark Gerstein and… A Agarwal, P Alves, B Arshinoff, R Auerbach, B Brown, A Carr, A Chateigner, C Cheng, N Cheung, T Down, X Feng, L Habegger, L Hillier, A Kanapin, T Liu, L Lochovsky, Z Lu, R Lyne, S Mackowiak, R Robilotto, J Rozowsky, H Shin, C Shou, S Taing, K Yan, K Yip, Z Zheng PIs & co-PIs who participated in the worm calls: J Ahringer, P Bickel, M Gerstein, K Gunsalus, S Kim, S Lewis, J Lieb, S Liu, G Micklem, D Miller, F Piano, N Rajewsky, V Reinke, M Snyder, L Stein, S Strome, R Waterston

Large-scale data accumulation DAC creation DAC meeting Boston Integrative analysis Number of datasets Lincoln Stein, Gos Micklem, Data Coordination Center