Integration methods and analysis Manolis Kellis Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory
The good news: ever-expanding dimensions Environment Genotype Disease Gender Stage Age Chromatin marks Cell types Now: Cell-type and chromatin-mark dimensions Next: References for each background All clearly needed, and increasingly available
Difficulty of interpreting increasing # marks Challenge: simplify Learn combinations Interpret function Prioritize marks Study dynamics
Overview Learning chromatin states Interpreting chromatin states ChromHMM captures combinatorics / spatial info Interpreting chromatin states Distinct functions, power for genome annotation Selecting number of marks / prioritizing Greedy ordering, tunable to states of interest Chromatin dynamics in human cell lines Activity profiles, linking enhancers, activat/repressors Selecting number of states Interpreting genome at increasing resolution
1. Learn combinations
Challenge of data integration in many marks/cells Epigenetic modifications Dozens of marks Encode epigenetic state Histone code hypothesis Distinct function for distinct combinations of marks? Hundreds of histone marks Astronomical number of histone mark combinations How do we find biologically relevant ones? Unsupervised approach Probabilistic model Explicit combinatorics Ernst et al, In preparation
Chromatin states for genome annotation Learn de novo significant combinations of chromatin marks Reveal functional elements, even without looking at sequence Use for genome annotation Use for studying regulation dynamics in different cell types Promoter states Transcribed states Active Intergenic Repressed
ChromHMM: learning ‘hidden’ chromatin states Transcription Start Site Enhancer DNA Observed chromatin marks. Called based on a poisson distribution Most likely Hidden State Transcribed Region 1 6 5 3 4 1: 3: 4: 5: 6: High Probability Chromatin Marks in State 2: 0.8 0.9 0.7 200bp intervals All probabilities are learned de novo from chromatin data alone (Baum-Welch aka. EM) 2 K4me3 K36me3 K4me1 K27ac We had talked about adding the H3K4 etc labels within the shapes Each state: vector of emissions, vector of transitions Ernst et al, in preparation
Application of ChromHMM to 41 chromatin marks in CD4+ T-cells (Barski’07, Wang’08) Promoter Transcribed Active intergenic Repressed Repetitive Chromatin Marks from (Barski et al, Cell 2007; Wang et al Nature Genetics, 2008); DNAseI hypersensitivity from (Boyle et al, Cell 2008); Expression Data from (Su et al, PNAS 2004); Lamina data from (Guelen et al; Naature 2008)
2. Interpreting chromatin states As learned in the IMR90 REMC datasets
IMR90: 24 marks + DNAse + CpGmethyl Emission Parameters Transition Parameters Interpreting a 40-state model as a basis for analysis
Promoter associated states 1-5
State 1 Bivalent Repressed Promoter
State 3 TSS Specific State
States 6-19: transcribed regions
States 15-19: ends of genes and exons
State 19 is 60-fold enriched for ZNF genes
States 20-30 associated with active intergenic regions
20-24 candidate strong enhancers Increased accessibility; lower metyhylation; greater conservation
States 30-40: large scale repressive states
State 34 - Strong H3K27me3 silenced state
39-40: H3K9me3 repressive domains / experimental nuclear lamina
Specific functional annotations for each of 51 chromatin states Promoter states Transcribed states Active Intergenic Repressed Repetitive
Example applications for genome annotation New protein-coding genes Chromatin signature: promoter / transcribed Evolutionary signature: not protein-coding lincRNAs Known coding Evolutionary CSF score Long intergenic non-coding RNAs/lincRNAs In promoter(short)/low-expr states Assign candidate functions to intergenic SNPs from genome-wide association studies New developmental enhancer regions
Examples of distinct properties of chromatin states GO Category State 3 State 4 State 5 State 6 State 7 State 8 Cell Cycle Phase 2.10 (2x10-7) 0.57 (1) 1.61 (0.001) 1.45 (1) 1.15 (1) 1.51 (1) Embryonic Development 1.24 (1) 2.82 (9x10-23) 1.07 (1) 0.85 (1) 0.54 (1) 1.00 (1) Chromatin 1.20 (1) 0.48 (1) 2.2 (1.4x10-7) 1.64 (1) Response to DNA Damage Stimulus 0.35 (1) 1.55 (0.074) 2.13 (6.5x10-11) 1.97 (1.0x10-4) 0.84 (1) RNA Processing 0.49 (1) 0.26 (1) 1.31 (1) 1.91 (4.2x10-11) 2.64 (8.7x10-24) 2.45 (3.0x10-4) T cell Activation 0.77 (1) 0.88 (1) 1.27 (1) 0.70 (1) 0.79 (1) 4.72 (2x10-7) Transcription End State State 27 ZNF repressed state recovery State 28: 112-fold ZNF enrich “The achievement of the repressed state by wild-type KAP1 involves decreased recruitment of RNA polymerase II, reduced levels of histone H3 K9 acteylation and H3K4 methylation, an increase in histone occupancy, enrichment of trimethyl histone H3K9, H3K36, and histone H4K20 …” MCB 2006. Promoter state gene GO function Promoter vs. enhancer regulation TF binding Motif enrichment enhancers promoters State 10kb away predictive of expr. State 30 29 34 42 35 Distinct types of repression - Chrom bands / HDAC resp - Repeat family / composition
Quantifying discovery power for promoters, transcripts (Left) The blue curve in the figure shows a “Receiver operating characteristic” (ROC) for coverage of bins with a TSS if states are ordered based on their fold enrichment for a TSS, (5,7,6,4,8,3,1,2,9,10,11, 21, 45,20, etc.). The green curve was based on ordering the k-means clusters. The red triangles are based on the individual input marks. The purple curve is based on a logistic regression classifier. The features to the classifier were ln(x+1) transformed values of the raw number of tags in a bin for each mark. No spatial information was given. Results for the classifier are based on five-fold cross validation. The TR-IRLS implementation of logistic regression was used with the default settings except the cgdeveps parameter was set to 0.0001.(Right) The same plot as the left side but for RefSeq transcribed regions opposed to RefSeq TSS. Komarek, P and Moore, A.W. Making Logistic Regression A Core Data Mining Tool With TR-IRLS. IEEE ICDM 2005, pages 685-688 Carnici, P. Genome-wide analysis of mammalian promoter architecture and evolution. Nature Genetics: 38: 626-635 (2006). Significantly outperform individual chromatin marks For transcripts, no single mark is sufficient signature CAGE/EST experiments give possible upper bound
3. Prioritize marks Select marks based on state recovery Select appropriate number of states
Recovery of 40 chromatin states with 6 marks Increasing marks show increasing resolution in state recovery
Extending IMR90 set beyond initial 22 marks 22 Marks common with CD4T data H2AK5ac H3K27ac H3K27me3 H3K9me3 H2BK120ac H3K4ac H3K36me3 H4K20me1 H2BK12ac H3K9ac H3K4me1 H2BK20ac H4K5ac H3K4me2 H3K14ac H4K8ac H3K4me3 H3K18ac H4K91ac H3K79me1 H3K23ac H3K79me2 19 Marks only in CD4T data H2AK9ac H2BK5me1 H3K9me2 CTCF H2BK5ac H3K27me1 H3R2me1 H2AZ H3K36ac H3K27me2 H3R2me2 PolII H4K12ac H3K36me1 H4K20me3 H4K16ac H3K79me3 H4R3me2
Selecting marks based on specific states of interest State Inferred with all 41 marks Recovery of states with increasing number of marks Greedy ordering of marks State Inferred with subset of marks State Inferred with all 41 marks State confusion matrix with 11 ENCODE marks
Initial methods: ENCODE 4. Study dynamics Initial methods: ENCODE
Emerging large-scale genomic/epigenomic datasets Multiple cell types Diverse experiments Developmental time-course Reference Epigenome Mapping Centers Used to study many disease epigenomes ENCODE Chromatin Group (PI: Bernstein) Insulator Enhancer Promoter Transcribed Repressed Repetitive 15-state model learned jointly 9 chromatin marks+WCE 9 human cell types HUVEC Umbilical vein endothelial NHEK Keratinocytes GM12878 Lymphoblastoid K562 Myelogenous leukemia HepG2 Liver carcinoma NHLF Normal human lung fibroblast HMEC Mammary epithelial cell HSMM Skeletal muscle myoblasts H1 Embryonic H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac H3K27me3 H4K20me1 H3K36me3 CTCF +WCE +RNA x NHEK HUVEC H1 … Cell type concatenation approach Ensures common emission parameters Verified with independent learning
Chromatin states consistent across cell types Clustering of independently learned 15-state models Promoter Candidate enhancer Insulator Transcribed Repressive State definitions are cell type invariant State locations are cell type specific Study dynamic changes in state assignments Reveal logic of chromatin regulation Repetitive
Chromatin state changes across pairs of cell types K562 HUVEC NHEK Pairwise state fold enrichments Proportion of genome K562 HUVEC CTCF island state (State 9) highly stable across cell types NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to. TODO: Add interpretation lines to the 10-state model!
Chromatin state changes across pairs of cell types K562 NHEK OFF HUVEC ON P-value blood vessel development 2.60E-05 vasculature development 3.00E-05 angiogenesis 3.50E-05 blood vessel morphogenesis 1.20E-04 HUVEC NHEK GO Enrichment for TSS in: HUVEC: active promoter (st1) NHEK: unmodified (st7) K562 HUVEC OFFNHEK OFF P-value ectoderm development 2.90E-09 epidermis development 1.80E-08 keratinocyte differentiation 3.00E-06 tissue development 3.20E-06 cell adhesion 1.90E-05 HUVEC GO Enrichment for: NHEK: TSS in active promoter (st1) HUVEC: TSS in unmodified (st7) NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to. NHEK HUVEC
Correlations between multi-cell activity profiles Gene expression Chromatin States Active TF motif enrichment TF regulator expression Dip-aligned motif biases HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 TF On TF Off Motif aligned Flat profile ON OFF Active enhancer Repressed Motif enrichment Motif depletion
(1) Linking enhancer states to correlated target genes Search for coherent changes between: gene expression chromatin marks at distant loci (10kb) Combine two vectors: Expression vector for each gene Vector of mark intensities at dist locus (combine marks based on enhancer emissions) 3. High correlation enhancer/target link 10kb Candidate TM4SF1 Enhancer
(3) Signatures of activators and repressors from activity profiles “Enhancer” States “On” States “Off” State STAT1 activator of GM12878 STAT5 activator for K562 CREB repressor of GM12878 STAT1 motif STAT5 motif CREB motif 2-4 24 Motif enrichment 22 TF Expression 2-2 38
Summary Learning chromatin states Interpreting chromatin states ChromHMM captures combinatorics / spatial info Interpreting chromatin states Distinct functions, power for genome annotation Selecting number of marks / prioritizing Greedy ordering, tunable to states of interest Chromatin dynamics in human cell lines Activity profiles, linking enhancers, activat/repressors Selecting number of states Interpreting genome at increasing resolution
Selecting number of states
Step 1: Learn a larger model that captures ‘all’ relevant states Comparison of BIC Score vs. Number of States for Random and Nested Initialization This figure shows the Bayesian Information Criterion (BIC) score for the models both when the parameters were randomly initialized and based on the nested initialization scheme (see methods). The BIC score is the log likelihood score minus a penalty term, computed as the number of parameters in the model divided by two times the natural log of the number of data points (defined as the number of 200bp bins). The figure shows the BIC scores of the model based on the nested initialization strategy are better than or close to the best BIC scores for each number of states based on random initialization. The figure also shows the BIC score is not a sufficient criteria to enable a selection of a model with a relatively small number of states for this data as it continued to increase past 70 states. Step 1: Learn a larger model that captures ‘all’ relevant states Step 2: Prune down model greedily eliminating least informative states Step 3: Select arbitrary cutoff based on biological interpretation Result: a 51-state model that captures most biology in least complexity
Recovery of 79-state model in random vs. nested initialization Random Initialization (states appear & disappear) Nested Initialization Selected 51-state model (states consistly recoverd) This illustrates the challenge of comparing models with different numbers of states obtained from training based on different random initializations of the model parameters. Each column shows for a model the best correlation value of the emission parameters with each state of the 79 state model of supplementary Figure 7. There is a column for each model with the best likelihood from three different random initializations for models with between 2 and 80 states. The figure shows that some states such (e.g. 46, 64, 72) are recovered with very high correlation (>0.96) under some random initializations, but for other random initializations for models with more states these states are not recovered.
States capture mark dependencies: Expected vs States capture mark dependencies: Expected vs. Observed Mark Co-Occurence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Evaluation of conditional independence assumption. Each plot corresponds to one state and each plotted blue point corresponds to a pair of marks. The x-axis is how often a pair of marks is expected to be observed together based on the model, that is multiplying the emission probabilities of the marks together. The y-axis is how often a pair of marks in a state are actually observed together. The red line is the y=x line where the expected count agrees exactly with the observed counts. The plot confirms accuracy of our model assumption that conditioned on a state the pairs of marks are independent. 43 44 45 46 47 48 49 50 51
Increasing numbers of states lead to increasing mark independence Increasing number of states capture observed mark dependencies. Expected vs. observed pairwise counts for models with 5, 10, 20, 30, 40, and 51 states based on the nested initialization. The state shown is state 6 in the 51 state model and the most correlated states in terms of the emission parameters in the 5-40 state models. The plot shows as more states are added, the points are closer to the y=x line, meaning that as the number of states increases conditioned on a state pairs of marks effectively become independent. State 6
Other desirable features of the resulting model Power appropriately spent Chromatin states show distinct mark combinations
Selecting number of states
State Assigned in CD4T using 22-common marks State Assigned in CD4T using all marks A value in a cell indicates the percentage of locations assigned to the state of the row with the full set of marks that would be assigned to the state of the column using the subset of marks.
State Assigned in CD4T using 22-common marks State Assigned in CD4T using all marks Many locations assigned to a satellite repeat state with the full set of marks are assigned to a large H3K9me3 heterochromatin state using the set of 22 marks.
State Assigned in CD4T using 22-common marks + H4K20me3 State Assigned in CD4T using all marks With just data on the location of H4K20me3 almost all these locations are assigned to a satellite repeat state.
State Assigned in CD4T using 22-common marks State Assigned in CD4T using all marks State 38 is primarily associated with H2AZ in distal locations
State Assigned in CD4T using 22-common marks + H2AZ State Assigned in CD4T using all marks Adding H2AZ substantially improves the recovery of this state.
State Assigned in CD4T using 22-common marks State Assigned in CD4T using all marks Various expressed transcribed states State 46 is strongly associated with simple repeats (maybe an artifact)
State Assigned in CD4T using 22-common marks + H2BK5me1 State Assigned in CD4T using all marks Various expressed transcribed states State 46 is strongly associated with simple repeats (maybe an artifact)