Integrative analysis of genomic and epigenomic data

Slides:



Advertisements
Similar presentations
Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Advertisements

Functional Non-Coding DNA Part II DNA Regulatory Elements BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Current Topics of Genomics and Epigenomics. Outline  Motivation for analysis of higher order chromatin structure  Methods for studying long range chromatin.
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
Comparative Genomics II: Functional comparisons Caterino and Hayes, 2007.
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Small RNAs and their regulatory roles. Presented by: Chirag Nepal.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Recombination breakpoints Family Inheritance Me vs. my brother My dad (my Y)Mom’s dad (uncle’s Y) Human ancestry Disease risk Genomics: Regions  mechanisms.
Jason Ernst Broad Institute of MIT and Harvard
CS173 Lecture 9: Transcriptional regulation III
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
Genomics 2015/16 Silvia del Burgo. + Same genome for all cells that arise from single fertilized egg, Identity?  Epigenomic signatures + Epigenomics:
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Integrative Genomics. Double-helix DNA strands are separated in the gene coding region Which enzyme detects the beginning of a gene ? RNA Polymerase (multi-subunit.
The Chromatin State The scientific quest to decipher the histone code Lior Zimmerman.
Regulation of Gene Expression
Epigenetics Continued
Functional Elements in the Human Genome
Epigenetics 04/04/16.
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Figure 1. Distinct chromatin regions isolated by the N-ChroP strategy
Integration methods and analysis
Jason Ernst Joint work with Pouya Kheradpour, Luke Ward
Dynamic epigenetic enhancer signatures reveal key transcription factors associated with monocytic differentiation states by Thu-Hang Pham, Christopher.
Jason Ernst Joint work with Pouya Kheradpour, Luke Ward
Genetic-Variation-Driven Gene-Expression Changes Highlight Genes with Important Functions for Kidney Disease  Yi-An Ko, Huiguang Yi, Chengxiang Qiu, Shizheng.
1. Interpreting rich epigenomic datasets
Disentangling the Effects of Colocalizing Genomic Annotations to Functionally Prioritize Non-coding Variants within Complex-Trait Loci  Gosia Trynka,
High-Resolution Profiling of Histone Methylations in the Human Genome
Volume 18, Issue 9, Pages (February 2017)
by Holger Weishaupt, Mikael Sigvardsson, and Joanne L. Attema
Volume 44, Issue 3, Pages (November 2011)
Volume 63, Issue 2, Pages (July 2016)
Volume 44, Issue 1, Pages (October 2011)
Latent Regulatory Potential of Human-Specific Repetitive Elements
Volume 33, Issue 4, Pages (February 2009)
Volume 23, Issue 5, Pages (May 2018)
In collaboration with Mikkelsen Lab
Mapping Global Histone Acetylation Patterns to Gene Expression
High-Resolution Profiling of Histone Methylations in the Human Genome
Presented by, Jeremy Logue.
Volume 23, Issue 1, Pages 9-22 (January 2013)
Volume 17, Issue 6, Pages (November 2016)
Volume 67, Issue 6, Pages e6 (September 2017)
Volume 23, Issue 5, Pages (May 2018)
Volume 44, Issue 3, Pages (November 2011)
Human Promoters Are Intrinsically Directional
Evolution of Alu Elements toward Enhancers
Presentation by: Hannah Mays UCF - BSC 4434 Professor Xiaoman Li
Volume 132, Issue 2, Pages (January 2008)
Volume 10, Issue 10, Pages (October 2017)
Dynamic Regulation of Nucleosome Positioning in the Human Genome
Signatures of activators and repressors
Volume 132, Issue 6, Pages (March 2008)
Volume 66, Issue 4, Pages e4 (May 2017)
Volume 35, Issue 2, Pages (August 2011)
Volume 16, Issue 6, Pages (December 2012)
Adam C. Wilkinson, Hiromitsu Nakauchi, Berthold Göttgens  Cell Systems 
Anh Pham Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.
Presented by, Jeremy Logue.
Volume 17, Issue 3, Pages (September 2009)
Integrative analysis of 111 reference human epigenomes
The 3D Genome in Transcriptional Regulation and Pluripotency
IMPACT: Genomic Annotation of Cell-State-Specific Regulatory Elements Inferred from the Epigenome of Bound Transcription Factors  Tiffany Amariuta, Yang.
Volume 11, Issue 7, Pages (May 2015)
Chromatin state mapping pinpoints PAX3–FOXO1 (P3F) in active enhancers
Presentation transcript:

Integrative analysis of genomic and epigenomic data Manolis Kellis, RC1

Jason Ernst Acknowledgements Brad Bernstein Pouya Kheradpour Noam Shoresh Chuck Epstein Tarjei Mikkelsen Pouya Kheradpour

Integrative analysis of genomic / epigenomic data Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions

Chromatin signatures for genome annotation Challenges Dozens of marks Complex combinatorics Diversity and dynamics Histone code hypothesis Distinct function for distinct combinations of marks? Both additive and combinatorial effects How do we find biologically relevant ones? Unsupervised approach Probabilistic model Explicit combinatorics

Cartoon Illustration of ChromHMM Transcription Start Site Enhancer DNA Observed chromatin marks. Called based on a poisson distribution Most likely Hidden State Transcribed Region 1 6 5 3 4 1: 3: 4: 5: 6: High Probability Chromatin Marks in State 2: 0.8 0.9 0.7 200bp intervals All probabilities are learned from the data 2 K4me3 K36me3 K4me1 K27ac We had talked about adding the H3K4 etc labels within the shapes Each state: vector of emissions, vector of transitions Ernst et al, In preparation

Application of ChromHMM to 41 chromatin marks in CD4+ T-cells (Barski’07, Wang’08) Promoter Transcribed Active intergenic Repressed Repetitive Chromatin Marks from (Barski et al, Cell 2007; Wang et al Nature Genetics, 2008); DNAseI hypersensitivity from (Boyle et al, Cell 2008); Expression Data from (Su et al, PNAS 2004); Lamina data from (Guelen et al; Naature 2008)

State transition matrix The full transition matrix of the Hidden Markov Model. Each row corresponds to a state transition from and each column a state transitioning to. An entry in a cell is the probability when in the state of the row of transitioning to the state of the column. This grid shows the transition matrix is relatively sparse. Enables separation of distinct sub-groups within each class Reveals transitions between different groups

Integrative analysis of genomic / epigenomic data Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions

(1) Promoter Associated States: Positional and functional properties Fold Enrichment Distance to Nearest TSS GO Category 3 4 5 6 7 8 Cell Cycle Phase 2.10 (2x10-7) 0.57 (1) 1.61 (0.001) 1.45 (1) 1.15 (1) 1.51 (1) Embryonic Development 1.24 (1) 2.82 (9x10-23) 1.07 (1) 0.85 (1) 0.54 (1) 1.00 (1) Chromatin 1.20 (1) 0.48 (1) 2.2 (1.4x10-7) 1.64 (1) Response to DNA Damage Stimulus 0.35 (1) 1.55 (0.074) 2.13 (6.5x10-11) 1.97 (1.0x10-4) 0.84 (1) RNA Processing 0.49 (1) 0.26 (1) 1.31 (1) 1.91 (4.2x10-11) 2.64 (8.7x10-24) 2.45 (3.0x10-4) T cell Activation 0.77 (1) 0.88 (1) 1.27 (1) 0.70 (1) 0.79 (1) 4.72 (2x10-7) Fold Enrichment (corrected p-value) Distinct positional enrichments: Marks can recruit initiation factors Act of transcription reinforces marks Distinct functional enrichments: Epigenetic memory of activation history Much richer epigenetic vocabulary

(2) Actively Transcribed States: Diverse marks, expression/position biases Number of Genes Number of Genes Fold Enrichment TSS-associated states Transcription elongation Exon-associated states No single mark uniquely defined transcribed states Associated with active/repressed, expression levels Distinct for start/elongation, short/long, exon/intron Specific combination defines transcription end sites Highly-specific combinations marks ZNF gees Fold Enrichment Transcription End Site 10

(2) Actively Transcribed States: Recovery of highly specific KAP1 combinations “The achievement of the repressed state by wild-type KAP1 involves decreased recruitment of RNA polymerase II, reduced levels of histone H3 K9 acteylation and H3K4 methylation, an increase in histone occupancy, enrichment of trimethyl histone H3K9, H3K36, and histone H4K20 …” 11

Conserved Motif Enrichment/Depletions (Pouya) 3. Active intergenic states: Distinct TF/motif enrichments Conserved Motif Enrichment/Depletions (Pouya) 12

3. Active intergenic states: Long-range predictive power Enhancer state predictive of expression level Different intergenic states, different dist. activity Distinguish active from less active enhancer Pairwise State Enrichments after 10kb Gap Enhancer states indeed distant from promoters Overlap between promoters / transcribed 13

(4 & 5) Intergenic and Large Scale Repressed States Repetitive Repeat Family Enrichments Transition matrix for large scale repressed states Distinct enrichments with lamina-associated regions. Constitutive vs. facultative heterochromatin Distinct response to HDAC inhibitors: State 44 acetylated suggesting active acetylation turnover Distinct sequence signatures: State 46 CAn/TGn/CATGn low-complexity repeat. Distinct enrichments for distinct classes of repeat elements, distinct epigenetic marks Importance of jointly observing entire vector of marks  repetitive would overwhelm other’s signal

Functional enrichments enable annotation of 51 distinct states

Integrative analysis of genomic / epigenomic data Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions

Apply genome wide to find novel genes, enhancers, insulators 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 19 20 21 22 X Y The enrichments of the states in each chromosome band, the coordinates of which were obtained from the UCSC genome browser (Kent et al, 2002). In this figure one can observe that the satellite enriched states (47-51) are enriched in centromere regions of the chromosome, there are specific chromosome bands where states 41 and 42 have the dominant enrichment signal, the zinc finger enriched state (state 28) enriches on chromosome 19, the unmappable state (state 40) enriches on gapped regions at the beginning of several chromosomes. 16 17 16 17 2 10

Discovery power for promoters, transcripts TSS Transcribed genes True Positive Rate False Positive Rate False Positive Rate (Left) The blue curve in the figure shows a “Receiver operating characteristic” (ROC) for coverage of bins with a TSS if states are ordered based on their fold enrichment for a TSS, (5,7,6,4,8,3,1,2,9,10,11, 21, 45,20, etc.). The green curve was based on ordering the k-means clusters. The red triangles are based on the individual input marks. The purple curve is based on a logistic regression classifier. The features to the classifier were ln(x+1) transformed values of the raw number of tags in a bin for each mark. No spatial information was given. Results for the classifier are based on five-fold cross validation. The TR-IRLS implementation of logistic regression was used with the default settings except the cgdeveps parameter was set to 0.0001.(Right) The same plot as the left side but for RefSeq transcribed regions opposed to RefSeq TSS. Komarek, P and Moore, A.W. Making Logistic Regression A Core Data Mining Tool With TR-IRLS. IEEE ICDM 2005, pages 685-688 Carnici, P. Genome-wide analysis of mammalian promoter architecture and evolution. Nature Genetics: 38: 626-635 (2006). Significantly outperforms single-marks Similar power to supervised learning approach CAGE experiments give possible upper bound

State annotation reveals new protein-coding genes Transcribed/promoter states enriched in novel protein-coding exons Likely to represent short single-exon genes ( promoter states) Likely to represent low-expression genes ( repressed states)

When novel transcribed regions lack protein signatures:  2,000 Large intergenic non-coding RNAs (lincRNAs) H3K4me3 - K3K36me3 Computational Signal: Chromatin signature of promoter and transcribed Evolutionary signature is not protein-coding Experimental confirmation: Produce RNA molecules Exon/intron structures Evolutionary confirmation: Exons are conserved Promoters are conserved Regulation is conserved Experimental follow-up: They play diverse roles in chromatin regulation Mikkelsen et al. 2007 Guttman, Lin, Kellis, Regev, Rinn, Lander, Nature, Feb 2009

Combine chromatin signatures and regulatory motifs  New developmental enhancers in human and fly Visel, Penacchio, Rubin, Ren, Nature 2008 Zeitlinger et al, Genes & Development 2007 Chromatin signatures and evolutionary signature predictive of enhancers Experimental techniques developed for inferring expression domains Large-scale databases mapping every elements to its expression pattern Ability to test new patterns / artificial elements in fly, mouse embryos

Shedding light on GWAS disease SNPs State Enrichments for SNPs and meta-study database of GWAS hits rs12619285 in Chr2 intergenic region 40kb 3’ of IKZF2 (lymphocyte devel) Strongest disease association with numerous inflamations (Gudbjartsson09) Strong hit for State33, while surrounding region unenriched (37 and 41-43)

Integrative analysis of genomic / epigenomic data Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions

Application to ENCODE datasets with Brad Bernstein, Noam Shoresh, Chuck Epstein Chromatin modification marks (Bernstein) Cell-Type specific genome annotation TF binding data (Snyder) Interpretation ENCODE reference cell types 11 chromatin marks 8 cell types Diverse additional functional datasets

Assessing predictive value for subset of marks State Inferred with all 41 marks Recovery of states with increasing number of marks Greedy ordering of marks State Inferred with subset of marks State Inferred with all 41 marks State confusion matrix with 11 ENCODE marks

Comparing chromatin states across cell types K562 HUVEC NHEK Pairwise state fold enrichments Proportion of genome K562 HUVEC CTCF island state (State 9) highly stable across cell types NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to.  TODO: Add interpretation lines to the 10-state model!

Comparing chromatin states across cell types K562 HUVEC NHEK GO Category P-value ectoderm development 2.90E-09 epidermis development 1.80E-08 keratinocyte differentiation 3.00E-06 tissue development 3.20E-06 cell adhesion 1.90E-05 K562 HUVEC GO Enrichment for TSS in Active promoter state (1) in NHEK and unmodified state (7) in HUVEC NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to. NHEK HUVEC

Comparing chromatin states across cell types K562 HUVEC NHEK GO Category P-value blood vessel development 2.60E-05 vasculature development 3.00E-05 angiogenesis 3.50E-05 blood vessel morphogenesis 1.20E-04 K562 HUVEC GO Enrichment for TSS in Active promoter state (1) in HUVEC and unmodified state (7) in NHEK NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to. NHEK HUVEC

Integrative analysis of genomic / epigenomic data Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions

Clusters of genes with coherent marks across entire length CTCF H3K27ac H3K9ac H3K4me3 H3K4me2 H3K4m1 H3K9m1 H4K20m1 Pol2 K36me3 K27me3 HMM 0 HMM 1 HMM 2 HMM 3 GO Category P-value Immune response 2x10-63 Leukocyte activation 5x10-32 Lymphocyte activation 6x10-32 GO Category P-value Cell adhesion 1x10-13 Ecoterm development 1x10-9 Extracellular region part 3x10-8 Cluster 10 Cluster 14

Integrative analysis of genomic / epigenomic data Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions

Signatures of activators and repressors Active states 2-2 22 TF Expression +TF expressed  Motif depleted 2-4 24 Motif enrichment Repressed states TF expr  no motif If motif  No expr 2-2 22 TF Expression Activator signature + 2-4 24 Motif enrichment +TF expressed  Motif enriched 2-2 22 TF Expression Repressed states -TF expressed  Motif enriched 2-4 24 Motif enrichment - 2-2 22 TF Expression Active states 2-4 24 Motif enrichment -TF expressed  Motif depleted TF expr  no motif If motif  No expr Repressor signature

Example of activator and repressor xx 0: “Off” state 5,6: “Enhancer” states 9: “On” state HNF HepG2 activator xx 0: “Off” state 5,6: “Enhancer” states 9: “On” state CREB GM repressor 2-2 22 Expression 2-4 24 Fold enrichment

Integrative analysis of genomic / epigenomic data Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions

Linking candidate enhancers to correlated target genes Search for coherent changes between: gene expression chromatin marks at distant loci (10kb) Combine two vectors: Expression vector for each gene Vector of mark intensities at dist locus (combine marks based on enhancer emissions) 3. High correlation  enhancer/target link 10kb Candidate TM4SF1 Enhancer

Predictive power of distal enhancer regions Correlation of individual regions (Sorted by Rank) Mark intensity correlation w/ expr 10kb upstream 100kb upstream 10kb/100kb controls At least 100 regions with >80% correlation

Integrative analysis of genomic / epigenomic data Defining chromatin states Biologically-meaningful mark combinations Characterizing chromatin states Application to genome annotation Chromatin state dynamics Cluster genes of common behavior Infer cell type activators / repressors Linking enhancers to promoter regions Where to next?

Where to next? Technology Development Data production Data dissemination and visualization Overlaying and combining datasets Integrative data analysis Biological discovery and understanding