Integration methods and analysis

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Functional Non-Coding DNA Part II DNA Regulatory Elements BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Current Topics of Genomics and Epigenomics. Outline  Motivation for analysis of higher order chromatin structure  Methods for studying long range chromatin.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Comparative Genomics II: Functional comparisons Caterino and Hayes, 2007.
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Jason Ernst Broad Institute of MIT and Harvard
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
Genomics 2015/16 Silvia del Burgo. + Same genome for all cells that arise from single fertilized egg, Identity?  Epigenomic signatures + Epigenomics:
The Chromatin State The scientific quest to decipher the histone code Lior Zimmerman.
Epigenetics Continued
Functional Elements in the Human Genome
Epigenetics 04/04/16.
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Figure 1. Distinct chromatin regions isolated by the N-ChroP strategy
DNase‐HS sites are main independent determinants of DNA replication timing Simulations based on genome sequence features (GC content, CpG islands), or.
Jason Ernst Joint work with Pouya Kheradpour, Luke Ward
Dynamic epigenetic enhancer signatures reveal key transcription factors associated with monocytic differentiation states by Thu-Hang Pham, Christopher.
Jason Ernst Joint work with Pouya Kheradpour, Luke Ward
Volume 17, Issue 12, Pages (December 2016)
Genetic-Variation-Driven Gene-Expression Changes Highlight Genes with Important Functions for Kidney Disease  Yi-An Ko, Huiguang Yi, Chengxiang Qiu, Shizheng.
1. Interpreting rich epigenomic datasets
High-Resolution Profiling of Histone Methylations in the Human Genome
Volume 91, Issue 6, Pages (September 2016)
by Holger Weishaupt, Mikael Sigvardsson, and Joanne L. Attema
Volume 44, Issue 3, Pages (November 2011)
Volume 7, Issue 5, Pages (June 2014)
Integrative analysis of genomic and epigenomic data
Volume 63, Issue 2, Pages (July 2016)
Volume 44, Issue 1, Pages (October 2011)
Revisiting the Thrifty Gene Hypothesis via 65 Loci Associated with Susceptibility to Type 2 Diabetes  Qasim Ayub, Loukas Moutsianas, Yuan Chen, Kalliope.
In collaboration with Mikkelsen Lab
High-Resolution Profiling of Histone Methylations in the Human Genome
Presented by, Jeremy Logue.
Volume 23, Issue 1, Pages 9-22 (January 2013)
Volume 17, Issue 6, Pages (November 2016)
Volume 67, Issue 6, Pages e6 (September 2017)
Volume 44, Issue 3, Pages (November 2011)
Human Promoters Are Intrinsically Directional
Evolution of Alu Elements toward Enhancers
Presentation by: Hannah Mays UCF - BSC 4434 Professor Xiaoman Li
Volume 132, Issue 2, Pages (January 2008)
Volume 10, Issue 10, Pages (October 2017)
Dynamic Regulation of Nucleosome Positioning in the Human Genome
Volume 132, Issue 6, Pages (March 2008)
Predicting Gene Expression from Sequence
Volume 35, Issue 2, Pages (August 2011)
Volume 16, Issue 6, Pages (December 2012)
Adam C. Wilkinson, Hiromitsu Nakauchi, Berthold Göttgens  Cell Systems 
Gene Density, Transcription, and Insulators Contribute to the Partition of the Drosophila Genome into Physical Domains  Chunhui Hou, Li Li, Zhaohui S.
Volume 7, Issue 2, Pages (August 2010)
Anh Pham Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.
Volume 1, Issue 1, Pages (July 2015)
Presented by, Jeremy Logue.
Volume 17, Issue 3, Pages (September 2009)
Integrative analysis of 111 reference human epigenomes
Genetic and Epigenetic Regulation of Human lincRNA Gene Expression
Genome Architecture: Domain Organization of Interphase Chromosomes
Multiplex Enhancer Interference Reveals Collaborative Control of Gene Regulation by Estrogen Receptor α-Bound Enhancers  Julia B. Carleton, Kristofer.
IMPACT: Genomic Annotation of Cell-State-Specific Regulatory Elements Inferred from the Epigenome of Bound Transcription Factors  Tiffany Amariuta, Yang.
Volume 11, Issue 7, Pages (May 2015)
Chromatin state mapping pinpoints PAX3–FOXO1 (P3F) in active enhancers
Derek de Rie and Imad Abuessaisa Presented by: Cassandra Derrick
Presentation transcript:

Integration methods and analysis Manolis Kellis Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory

The good news: ever-expanding dimensions Environment Genotype Disease Gender Stage Age Chromatin marks Cell types Now: Cell-type and chromatin-mark dimensions Next: References for each background All clearly needed, and increasingly available

Difficulty of interpreting increasing # marks Challenge: simplify Learn combinations Interpret function Prioritize marks Study dynamics

Overview Learning chromatin states Interpreting chromatin states ChromHMM captures combinatorics / spatial info Interpreting chromatin states Distinct functions, power for genome annotation Selecting number of marks / prioritizing Greedy ordering, tunable to states of interest Chromatin dynamics in human cell lines Activity profiles, linking enhancers, activat/repressors Selecting number of states Interpreting genome at increasing resolution

1. Learn combinations

Challenge of data integration in many marks/cells Epigenetic modifications Dozens of marks Encode epigenetic state Histone code hypothesis Distinct function for distinct combinations of marks? Hundreds of histone marks Astronomical number of histone mark combinations How do we find biologically relevant ones? Unsupervised approach Probabilistic model Explicit combinatorics Ernst et al, In preparation

Chromatin states for genome annotation Learn de novo significant combinations of chromatin marks Reveal functional elements, even without looking at sequence Use for genome annotation Use for studying regulation dynamics in different cell types Promoter states Transcribed states Active Intergenic Repressed

ChromHMM: learning ‘hidden’ chromatin states Transcription Start Site Enhancer DNA Observed chromatin marks. Called based on a poisson distribution Most likely Hidden State Transcribed Region 1 6 5 3 4 1: 3: 4: 5: 6: High Probability Chromatin Marks in State 2: 0.8 0.9 0.7 200bp intervals All probabilities are learned de novo from chromatin data alone (Baum-Welch aka. EM) 2 K4me3 K36me3 K4me1 K27ac We had talked about adding the H3K4 etc labels within the shapes Each state: vector of emissions, vector of transitions Ernst et al, in preparation

Application of ChromHMM to 41 chromatin marks in CD4+ T-cells (Barski’07, Wang’08) Promoter Transcribed Active intergenic Repressed Repetitive Chromatin Marks from (Barski et al, Cell 2007; Wang et al Nature Genetics, 2008); DNAseI hypersensitivity from (Boyle et al, Cell 2008); Expression Data from (Su et al, PNAS 2004); Lamina data from (Guelen et al; Naature 2008)

2. Interpreting chromatin states As learned in the IMR90 REMC datasets

IMR90: 24 marks + DNAse + CpGmethyl Emission Parameters Transition Parameters Interpreting a 40-state model as a basis for analysis

Promoter associated states 1-5

State 1 Bivalent Repressed Promoter

State 3 TSS Specific State

States 6-19: transcribed regions

States 15-19: ends of genes and exons

State 19 is 60-fold enriched for ZNF genes

States 20-30 associated with active intergenic regions

20-24 candidate strong enhancers Increased accessibility; lower metyhylation; greater conservation

States 30-40: large scale repressive states

State 34 - Strong H3K27me3 silenced state

39-40: H3K9me3 repressive domains / experimental nuclear lamina

Specific functional annotations for each of 51 chromatin states Promoter states Transcribed states Active Intergenic Repressed Repetitive

Example applications for genome annotation New protein-coding genes Chromatin signature: promoter / transcribed Evolutionary signature: not protein-coding lincRNAs Known coding Evolutionary CSF score  Long intergenic non-coding RNAs/lincRNAs In promoter(short)/low-expr states Assign candidate functions to intergenic SNPs from genome-wide association studies New developmental enhancer regions

Examples of distinct properties of chromatin states GO Category State 3 State 4 State 5 State 6 State 7 State 8 Cell Cycle Phase 2.10 (2x10-7) 0.57 (1) 1.61 (0.001) 1.45 (1) 1.15 (1) 1.51 (1) Embryonic Development 1.24 (1) 2.82 (9x10-23) 1.07 (1) 0.85 (1) 0.54 (1) 1.00 (1) Chromatin 1.20 (1) 0.48 (1) 2.2 (1.4x10-7) 1.64 (1) Response to DNA Damage Stimulus 0.35 (1) 1.55 (0.074) 2.13 (6.5x10-11) 1.97 (1.0x10-4) 0.84 (1) RNA Processing 0.49 (1) 0.26 (1) 1.31 (1) 1.91 (4.2x10-11) 2.64 (8.7x10-24) 2.45 (3.0x10-4) T cell Activation 0.77 (1) 0.88 (1) 1.27 (1) 0.70 (1) 0.79 (1) 4.72 (2x10-7) Transcription End State State 27 ZNF repressed state recovery State 28: 112-fold ZNF enrich “The achievement of the repressed state by wild-type KAP1 involves decreased recruitment of RNA polymerase II, reduced levels of histone H3 K9 acteylation and H3K4 methylation, an increase in histone occupancy, enrichment of trimethyl histone H3K9, H3K36, and histone H4K20 …” MCB 2006. Promoter state  gene GO function Promoter vs. enhancer regulation TF binding Motif enrichment enhancers promoters State 10kb away predictive of expr. State 30 29 34 42 35 Distinct types of repression - Chrom bands / HDAC resp - Repeat family / composition

Quantifying discovery power for promoters, transcripts (Left) The blue curve in the figure shows a “Receiver operating characteristic” (ROC) for coverage of bins with a TSS if states are ordered based on their fold enrichment for a TSS, (5,7,6,4,8,3,1,2,9,10,11, 21, 45,20, etc.). The green curve was based on ordering the k-means clusters. The red triangles are based on the individual input marks. The purple curve is based on a logistic regression classifier. The features to the classifier were ln(x+1) transformed values of the raw number of tags in a bin for each mark. No spatial information was given. Results for the classifier are based on five-fold cross validation. The TR-IRLS implementation of logistic regression was used with the default settings except the cgdeveps parameter was set to 0.0001.(Right) The same plot as the left side but for RefSeq transcribed regions opposed to RefSeq TSS. Komarek, P and Moore, A.W. Making Logistic Regression A Core Data Mining Tool With TR-IRLS. IEEE ICDM 2005, pages 685-688 Carnici, P. Genome-wide analysis of mammalian promoter architecture and evolution. Nature Genetics: 38: 626-635 (2006). Significantly outperform individual chromatin marks For transcripts, no single mark is sufficient signature CAGE/EST experiments give possible upper bound

3. Prioritize marks Select marks based on state recovery Select appropriate number of states

Recovery of 40 chromatin states with 6 marks Increasing marks show increasing resolution in state recovery

Extending IMR90 set beyond initial 22 marks 22 Marks common with CD4T data H2AK5ac H3K27ac H3K27me3 H3K9me3 H2BK120ac H3K4ac H3K36me3 H4K20me1 H2BK12ac H3K9ac H3K4me1 H2BK20ac H4K5ac H3K4me2 H3K14ac H4K8ac H3K4me3 H3K18ac H4K91ac H3K79me1 H3K23ac H3K79me2 19 Marks only in CD4T data H2AK9ac H2BK5me1 H3K9me2 CTCF H2BK5ac H3K27me1 H3R2me1 H2AZ H3K36ac H3K27me2 H3R2me2 PolII H4K12ac H3K36me1 H4K20me3 H4K16ac H3K79me3 H4R3me2

Selecting marks based on specific states of interest State Inferred with all 41 marks Recovery of states with increasing number of marks Greedy ordering of marks State Inferred with subset of marks State Inferred with all 41 marks State confusion matrix with 11 ENCODE marks

Initial methods: ENCODE 4. Study dynamics Initial methods: ENCODE

Emerging large-scale genomic/epigenomic datasets Multiple cell types Diverse experiments Developmental time-course Reference Epigenome Mapping Centers Used to study many disease epigenomes ENCODE Chromatin Group (PI: Bernstein) Insulator Enhancer Promoter Transcribed Repressed Repetitive 15-state model learned jointly 9 chromatin marks+WCE 9 human cell types HUVEC Umbilical vein endothelial NHEK Keratinocytes GM12878 Lymphoblastoid K562 Myelogenous leukemia HepG2 Liver carcinoma NHLF Normal human lung fibroblast HMEC Mammary epithelial cell HSMM Skeletal muscle myoblasts H1 Embryonic H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac H3K27me3 H4K20me1 H3K36me3 CTCF +WCE +RNA x NHEK HUVEC H1 … Cell type concatenation approach Ensures common emission parameters Verified with independent learning

Chromatin states consistent across cell types Clustering of independently learned 15-state models Promoter Candidate enhancer Insulator Transcribed Repressive State definitions are cell type invariant State locations are cell type specific Study dynamic changes in state assignments Reveal logic of chromatin regulation Repetitive

Chromatin state changes across pairs of cell types K562 HUVEC NHEK Pairwise state fold enrichments Proportion of genome K562 HUVEC CTCF island state (State 9) highly stable across cell types NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to.  TODO: Add interpretation lines to the 10-state model!

Chromatin state changes across pairs of cell types K562 NHEK OFF  HUVEC ON P-value blood vessel development 2.60E-05 vasculature development 3.00E-05 angiogenesis 3.50E-05 blood vessel morphogenesis 1.20E-04 HUVEC NHEK GO Enrichment for TSS in: HUVEC: active promoter (st1) NHEK: unmodified (st7) K562 HUVEC OFFNHEK OFF P-value ectoderm development 2.90E-09 epidermis development 1.80E-08 keratinocyte differentiation 3.00E-06 tissue development 3.20E-06 cell adhesion 1.90E-05 HUVEC GO Enrichment for: NHEK: TSS in active promoter (st1) HUVEC: TSS in unmodified (st7) NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to. NHEK HUVEC

Correlations between multi-cell activity profiles Gene expression Chromatin States Active TF motif enrichment TF regulator expression Dip-aligned motif biases HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 TF On TF Off Motif aligned Flat profile ON OFF Active enhancer Repressed Motif enrichment Motif depletion

(1) Linking enhancer states to correlated target genes Search for coherent changes between: gene expression chromatin marks at distant loci (10kb) Combine two vectors: Expression vector for each gene Vector of mark intensities at dist locus (combine marks based on enhancer emissions) 3. High correlation  enhancer/target link 10kb Candidate TM4SF1 Enhancer

(3) Signatures of activators and repressors from activity profiles “Enhancer” States “On” States “Off” State STAT1 activator of GM12878 STAT5 activator for K562 CREB repressor of GM12878 STAT1 motif STAT5 motif CREB motif 2-4 24 Motif enrichment 22 TF Expression 2-2 38

Summary Learning chromatin states Interpreting chromatin states ChromHMM captures combinatorics / spatial info Interpreting chromatin states Distinct functions, power for genome annotation Selecting number of marks / prioritizing Greedy ordering, tunable to states of interest Chromatin dynamics in human cell lines Activity profiles, linking enhancers, activat/repressors Selecting number of states Interpreting genome at increasing resolution

Selecting number of states

Step 1: Learn a larger model that captures ‘all’ relevant states Comparison of BIC Score vs. Number of States for Random and Nested Initialization This figure shows the Bayesian Information Criterion (BIC) score for the models both when the parameters were randomly initialized and based on the nested initialization scheme (see methods). The BIC score is the log likelihood score minus a penalty term, computed as the number of parameters in the model divided by two times the natural log of the number of data points (defined as the number of 200bp bins). The figure shows the BIC scores of the model based on the nested initialization strategy are better than or close to the best BIC scores for each number of states based on random initialization. The figure also shows the BIC score is not a sufficient criteria to enable a selection of a model with a relatively small number of states for this data as it continued to increase past 70 states. Step 1: Learn a larger model that captures ‘all’ relevant states Step 2: Prune down model greedily eliminating least informative states Step 3: Select arbitrary cutoff based on biological interpretation Result: a 51-state model that captures most biology in least complexity

Recovery of 79-state model in random vs. nested initialization Random Initialization (states appear & disappear) Nested Initialization Selected 51-state model (states consistly recoverd) This illustrates the challenge of comparing models with different numbers of states obtained from training based on different random initializations of the model parameters. Each column shows for a model the best correlation value of the emission parameters with each state of the 79 state model of supplementary Figure 7. There is a column for each model with the best likelihood from three different random initializations for models with between 2 and 80 states. The figure shows that some states such (e.g. 46, 64, 72) are recovered with very high correlation (>0.96) under some random initializations, but for other random initializations for models with more states these states are not recovered.

States capture mark dependencies: Expected vs States capture mark dependencies: Expected vs. Observed Mark Co-Occurence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Evaluation of conditional independence assumption. Each plot corresponds to one state and each plotted blue point corresponds to a pair of marks. The x-axis is how often a pair of marks is expected to be observed together based on the model, that is multiplying the emission probabilities of the marks together. The y-axis is how often a pair of marks in a state are actually observed together. The red line is the y=x line where the expected count agrees exactly with the observed counts. The plot confirms accuracy of our model assumption that conditioned on a state the pairs of marks are independent. 43 44 45 46 47 48 49 50 51

Increasing numbers of states lead to increasing mark independence Increasing number of states capture observed mark dependencies. Expected vs. observed pairwise counts for models with 5, 10, 20, 30, 40, and 51 states based on the nested initialization. The state shown is state 6 in the 51 state model and the most correlated states in terms of the emission parameters in the 5-40 state models. The plot shows as more states are added, the points are closer to the y=x line, meaning that as the number of states increases conditioned on a state pairs of marks effectively become independent. State 6

Other desirable features of the resulting model Power appropriately spent Chromatin states show distinct mark combinations

Selecting number of states

State Assigned in CD4T using 22-common marks State Assigned in CD4T using all marks A value in a cell indicates the percentage of locations assigned to the state of the row with the full set of marks that would be assigned to the state of the column using the subset of marks.

State Assigned in CD4T using 22-common marks State Assigned in CD4T using all marks Many locations assigned to a satellite repeat state with the full set of marks are assigned to a large H3K9me3 heterochromatin state using the set of 22 marks.

State Assigned in CD4T using 22-common marks + H4K20me3 State Assigned in CD4T using all marks With just data on the location of H4K20me3 almost all these locations are assigned to a satellite repeat state.

State Assigned in CD4T using 22-common marks State Assigned in CD4T using all marks State 38 is primarily associated with H2AZ in distal locations

State Assigned in CD4T using 22-common marks + H2AZ State Assigned in CD4T using all marks Adding H2AZ substantially improves the recovery of this state.

State Assigned in CD4T using 22-common marks State Assigned in CD4T using all marks Various expressed transcribed states State 46 is strongly associated with simple repeats (maybe an artifact)

State Assigned in CD4T using 22-common marks + H2BK5me1 State Assigned in CD4T using all marks Various expressed transcribed states State 46 is strongly associated with simple repeats (maybe an artifact)