Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Functional Non-Coding DNA Part II DNA Regulatory Elements BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Biol/Chem 473 Schulze lecture 2: Eukaryotic gene structure.
Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Two short pieces MicroRNA Alternative splicing.
Understanding the Human Genome: Lessons from the ENCODE project
Speaker: HU Xue-Jia Supervisor: WU Yun-Dong Date: 19/12/2013.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.
Computational Molecular Biology Biochem 218 – BioMedical Informatics Gene Regulatory.
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
Manolis Kellis Broad Institute of MIT and Harvard
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Comparative genomics for pathogen/vector annotation Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and.
Computational personal genomics: selection, regulation, epigenomics, disease Manolis Kellis MIT Computer Science & Artificial Intelligence Laboratory Broad.
Igor Ulitsky.  “the branch of genetics that studies organisms in terms of their genomes (their full DNA sequences)”  Computational genomics in TAU ◦
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
Lecture 4. Topics in Gene Regulation and Epigenomics (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology.
Manolis Kellis modENCODE analysis group January 11, 2007 Part 1: Target identification: comparative vs. exprmt. (really the topic for today) Part 2: Target.
I519 Introduction to Bioinformatics, Fall, 2012
Integrative fly analysis: specific aims Aim 1: Comprehensive data collection – Data QC / data standards / – consistent pipelines Aim 2: Integrative annotation.
CSLS Retreat 2007 Matan Hofree & Assaf Weiner 1. Outline  A brief introduction to microRNA  Project motivation and goal  Selecting the data sets 
Sackler Medical School
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Drosophila modENCODE Data Integration Manolis Kellis on behalf of: modEncode Analysis Working Group (AWG) modEncode Data Analysis Center (DAC) MIT Computer.
TITLE OF PRESENTATION Board of Scientific Counselors January 2007 Your Name.
Questions?. Novel ncRNAs are abundant: Ex: miRNAs miRNAs were the second major story in 2001 (after the genome). Subsequently, many other non-coding genes.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Manolis Kellis Broad Institute of MIT and Harvard
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
Overview of ENCODE Elements
Jason Ernst Broad Institute of MIT and Harvard
Motif Search and RNA Structure Prediction Lesson 9.
CS173 Lecture 9: Transcriptional regulation III
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Finding genes in the genome
Outline Molecular Cell Biology Assessment Review from last lecture Role of nucleoporins in transcription Activators and Repressors Epigenetic mechanisms.
Motif instance identification using comparative genomics Pouya Kheradpour Joint work with: Alexander Stark, Sushmita Roy and Manolis Kellis.
Genomics 2015/16 Silvia del Burgo. + Same genome for all cells that arise from single fertilized egg, Identity?  Epigenomic signatures + Epigenomics:
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Transcriptional Enhancers Looking out for the genes and each other Sridhar Hannenhalli Department of Cell Biology and Molecular Genetics Center for Bioinformatics.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Integrative Genomics. Double-helix DNA strands are separated in the gene coding region Which enzyme detects the beginning of a gene ? RNA Polymerase (multi-subunit.
Identification of Functional Elements and Regulatory Circuits by Drosophila modENCODE by, Sushmita Roy, Jason Ernst, Peter V. Kharchenko, Pouya Kheradpour,
The Chromatin State The scientific quest to decipher the histone code Lior Zimmerman.
Chapter 18 – Gene Regulation Part 2
Fig Prokaryotes and Eukaryotes
Epigenetics Continued
Comparative genomics in flies and mammals
Epigenetics 04/04/16.
Functional Mapping and Annotation of GWAS: FUMA
TSS Annotation Workflow
Chao He
Interpreting the human genome
In collaboration with Mikkelsen Lab
Diverse patterns, similar mechanism
Drosophila modENCODE Data Integration
Integrative analysis of 111 reference human epigenomes
Chromatin state mapping pinpoints PAX3–FOXO1 (P3F) in active enhancers
Presentation transcript:

Fly ModENCODE data integration update Manolis Kellis, MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard

modENCODE integration goals Annotate all functional elements –Enhancers, promoters, insulators, silencers –Protein-coding genes, RNA genes, alternative splice forms Understand their dynamics –Tissue- and stage-specific activity of each type of element Mechanisms –Relative roles of histones, chromatin, specific/general TFs –Sequence specificity, regulatory motifs and grammars Community involvement will be key –Seeking both computational and experimental partners –Large-scale: Complementary datasets / computation –Small-scale: Directed follow-up studies / genes, pathways Drosophila 2009 modENCODE workshop discussion

Each dataset is supported by all others Each type of element requires multiple data types –Protein genes –RNA genes –Promoters –Enhancers –Transcripts –Heterochromatin –Initiation sites Replication Chromatin Nucleosomes Small RNAs Transcripts TFs/Chromatin Karpen Henikoff Celniker White Lai Mac Alpine Already presented Underway Data Integration efforts

modENCODE is not alone Community data types –Boundaries –DNAse HS sites, low buoyant density (protein binding) –evolutionary properties (correlations with conserved/non- conserved properties) –Dam mapping –Small RNAs Techniques and functional genomics –Gene Disruption projects –RNAi collection –Recombineering –Computational analyses Replication Chromatin Nucleosomes Small RNAs Transcripts TFs/Chromatin Karpen Henikoff Celniker White Lai Mac Alpine Boundaries DNAse HS 12flies (+8 flies) Dam mapping etc

Comparative resources for Drosophila genomes Identify functional elements by their evolutionary signatures: complement experimental studies done priority1 priority2 New SpeciesDist D. ficusphila0.80 D. biarmipes0.70 D. elegans0.72 D. kikkawai0.89 D. eugracilis0.76 D. takahashii0.65 D. rhopaloa0.66 D. bipectinata0.99

Evolutionary signatures for diverse functions Protein-coding genes - Codon Substitution Frequencies - Reading Frame Conservation RNA structures - Compensatory changes - Silent G-U substitutions microRNAs - Shape of conservation profile - Structural features: loops, pairs - Relationship with 3’UTR motifs Regulatory motifs - Mutations preserve consensus - Increased Branch Length Score - Genome-wide conservation Stark et al, Nature 2007; Clark et al, Nature 2007

Functional annotation of Novel Transcripts using evo. sigs CSF Score (best 30 aa window) CSF Score (best 30 aa window) Fraction Frequency 73 Putative protein coding 57 Putative non-coding CSF = Heuristic metric for codon substitution frequency Mike Lin, Jane Landolin, Sue Celniker

ConsensusMCSMatches to known Tissue specific target expression PromotersEnhancers 1CTAATTAAA65.6engrailed (en) TTKCAATTAA57.3reversed-polarity (repo) WATTRATTK54.9araucan (ara) AAATTTATGCK54.4paired (prd) GCAATAAA51ventral veins lacking (vvl) DTAATTTRYNR46.7Ultrabithorax (Ubx) TGATTAAT45.7apterous (ap) YMATTAAAA43.1abdominal A (abd-A)72.2 9AAACNNGTT RATTKAATT GCACGTGT39.5fushi tarazu (ftz) AACASCTG38.8broad-Z3 (br-Z3) AATTRMATTA TATGCWAAT TAATTATG37.5Antennapedia (Antp) CATNAATCA TTACATAA RTAAATCAA AATKNMATTT ATGTCAAHT ATAAAYAAA YYAATCAAA WTTTTATG33.8Abdominal B (Abd-B) TTTYMATTA33.6extradenticle (exd) TGTMAATA TAAYGAG AAAKTGA AAANNAAA RTAAWTTAT32.9gooseberry-neuro (gsb-n) TTATTTAYR32.9Deformed (Dfd)30.7 Discover motifs associated with binding Ability to discover full dictionary of regulatory motifs de novo Stark et al, Nature, 2007

ChIP-grade quality –Similar functional enrichment –High sens. High spec. Systems-level –81% of Transc. Factors –86% of microRNAs –8k + 2k targets –46k connections Lessons learned –Pre- and post- are correlated (hihi/lolo) –Regulators are heavily targeted, feedback loop Kheradpour et al, Genome Research, 2007 Sushmita Roy Initial regulatory network for an animal genome

Temporal latencies in regulatory networks TF-specific latencies, coherent with TF function Latencies associated with network motifs Extensions to tissue-specific networks Rogerio Candeias

Incorporating ENCODE functional datasets Pouya Kheradpour, Jason Ernst, Chris Bristow, Rachel Sealfon

modENCODE and gene regulation Goal: Understand the DNA elements responsible for gene regulation: The regulators: TFs, GFs, miRNAs, their specificities The regions: enhancers, promoters, insulators The targets: individual regulatory motif instances The grammars: combinations predictive of tissue-specific activity  Building blocks of gene regulation Our tools: Comparative genomics & large-scale experimental datasets. Evolutionary signatures for promoter/enhancer/3’UTR motif annotation Chromatin signatures for integrating histone modification datasets TFs, GFs, motifs, instances associated with tissue-specific activity Infer regulatory networks, their temporal and spatial dynamics  Integrate diverse datasets

Sequence motifs predictive of insulators Understand specificity of each factor How predictable are these of binding Motif combinations and grammars GAF, check CTCF, check Su(Hw), check BEAF-32, variant Mod(mdg4), novel CP190, novel Motifs specific to each insulator Pouya Kheradpour

Motif instances correlate with ChIP peaks CTCF motif instances correlate strongly with narrow peak calls from multiple peak callers, even at 40bp window Correlation extends down rank link (to all 50,000 peaks) Implications for peak calling and for motif discovery SPP, 40bp window Narrow Peak Interval Rank x10 4 Fraction overlapping CTCF motif instances Pouya Kheradpour, Ben Brown Performance (higher is better) Peak size Recovery of CTCF inst. at 90% confid.

Motifs and tissue-specific chromatin marks Fold enrichment or over expression The NF-κB motif is enriched in H3K4me2 regions found uniquely in GM12878 cells It is likewise enriched in the uniquely bound regions for other active marks Conversely, it is enriched in the uniquely unbound regions for the repressive mark H3K27me3 We find that NF-κB is also over expressed in GM12878, suggesting a causative explanation NF-κB motif Active marks Repressive mark Pouya Kheradpour

Motifs and stage-specific chromatin marks Fold enrichment or over expression abd-A motif is enriched in new H3K27me3 regions at L2 –Coincides with a drop in the expression of abd-A –Model: sites gain H3K27me3 as abd-A binding lost Additional intriguing stories found, to be explored H3K27me3

What about combinations of chromatin marks? Jason Ernst

A hidden Markov model for chromatin state Enhancer Transcription Start Site DNA Observed Histone Modifications Most likely Hidden State Transcribed Region : 3: 4: 5: 6: Even though modification was not observed can still infer correct state based on neighboring locations that this state is likely of the same type as its neighboring states 6 Highly Likely Modifications in State 2:

20 distinct chromatin states, combinations of marks Combinations of chromatin marks –More informative than individual marks (A&B ≠ A&C) –Small number of states (20 instead of all 2 million=2 21 ) –Allow study of co-occurrence patterns, independence…

Each chromatin state associated w/ distinct function Reveals active/repressed promoters & enhancers Distinct enrichments for 5’UTR/3’UTR/transcripts Distinct chromatin properties of exons / introns Tentative annotations

Transcriptional unit enrichment

Transcription start site (TSS) enrichment

Transcription termination site (TTS) enrichment

Transcriptional unit enrichment

Chromatin signatures as context for TF analysis TF role in establishing chromatin states Chromatin role in modulating TF function

Specific enrichment for DV and AP factors

Functions of 20 distinct chromatin states in fly DV enhancersAP enhancersGeneral TFsInsulatorsReplicationMotifs Chromatin marks

The grand challenge ahead Anterior-Posterior Dorsal-Ventral Annotations & images for all expression patterns Expression domain primitives reveal underlying logic Binding sites of every developmental regulator GAF, check Su(Hw), check BEAF-32, variant Mod(mdg4), novel CP190, novel CTCF, check Sequence motifs for every regulator Understand regulatory logic specifying development

Summary of our lab’s experience in (mod)ENCODE Protein-coding genes (Mike Lin) –Hubbard: Predict new genes, evaluate novel genes –Celniker: Distinguish coding/non-coding transcripts Chromatin domains (Jason Ernst) –Karpen: Chromatin states in Drosophila –Bernstein: Chromatin states in Human Motif and grammar discovery (Pouya Kheradpour) –White: Motifs associated with insulator proteins –Bernstein: Tissue-specific chromatin states –White: Expression and Binding Time-course Tissue-specific gene expression (Chris Bristow) –Celniker: Embryo expression domains –All: Predictive models of gene expression

Acknowledgements Alex Stark TFs/Insul.Kevin White, Bing Ren, Nicolas Negre, Par Shah, Jim Posakony 12+8-fliesAndy Clark, Mike Eisen, Bill Gelbart, Doug Smith, Peter Cherbas ChromatinGary Karpen, Aki Minoda, Nicole Riddle, Peter Park + Kharchenko Prot.GenesBDGP: Sue Celniker, Jane Landolin, FlyBase: Bill Gelbart Pouya Kheradpour Mike Lin Jason Ernst Chris Bristow FundingENCODE, modENCODE, NHGRI, NSF, Sloan Foundation