Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integration methods and analysis

Similar presentations


Presentation on theme: "Integration methods and analysis"— Presentation transcript:

1 Integration methods and analysis
Manolis Kellis Broad Institute of MIT and Harvard MIT Computer Science & Artificial Intelligence Laboratory

2 The good news: ever-expanding dimensions
Environment Genotype Disease Gender Stage Age Chromatin marks Cell types Now: Cell-type and chromatin-mark dimensions Next: References for each background All clearly needed, and increasingly available

3 Difficulty of interpreting increasing # marks
Challenge: simplify Learn combinations Interpret function Prioritize marks Study dynamics

4 Overview Learning chromatin states Interpreting chromatin states
ChromHMM captures combinatorics / spatial info Interpreting chromatin states Distinct functions, power for genome annotation Selecting number of marks / prioritizing Greedy ordering, tunable to states of interest Chromatin dynamics in human cell lines Activity profiles, linking enhancers, activat/repressors Selecting number of states Interpreting genome at increasing resolution

5 1. Learn combinations

6 Challenge of data integration in many marks/cells
Epigenetic modifications Dozens of marks Encode epigenetic state Histone code hypothesis Distinct function for distinct combinations of marks? Hundreds of histone marks Astronomical number of histone mark combinations How do we find biologically relevant ones? Unsupervised approach Probabilistic model Explicit combinatorics Ernst et al, In preparation

7 Chromatin states for genome annotation
Learn de novo significant combinations of chromatin marks Reveal functional elements, even without looking at sequence Use for genome annotation Use for studying regulation dynamics in different cell types Promoter states Transcribed states Active Intergenic Repressed

8 ChromHMM: learning ‘hidden’ chromatin states
Transcription Start Site Enhancer DNA Observed chromatin marks. Called based on a poisson distribution Most likely Hidden State Transcribed Region 1 6 5 3 4 1: 3: 4: 5: 6: High Probability Chromatin Marks in State 2: 0.8 0.9 0.7 200bp intervals All probabilities are learned de novo from chromatin data alone (Baum-Welch aka. EM) 2 K4me3 K36me3 K4me1 K27ac We had talked about adding the H3K4 etc labels within the shapes Each state: vector of emissions, vector of transitions Ernst et al, in preparation

9 Application of ChromHMM to 41 chromatin marks in CD4+ T-cells (Barski’07, Wang’08)
Promoter Transcribed Active intergenic Repressed Repetitive Chromatin Marks from (Barski et al, Cell 2007; Wang et al Nature Genetics, 2008); DNAseI hypersensitivity from (Boyle et al, Cell 2008); Expression Data from (Su et al, PNAS 2004); Lamina data from (Guelen et al; Naature 2008)

10 2. Interpreting chromatin states
As learned in the IMR90 REMC datasets

11 IMR90: 24 marks + DNAse + CpGmethyl
Emission Parameters Transition Parameters Interpreting a 40-state model as a basis for analysis

12 Promoter associated states 1-5

13 State 1 Bivalent Repressed Promoter

14 State 3 TSS Specific State

15 States 6-19: transcribed regions

16 States 15-19: ends of genes and exons

17 State 19 is 60-fold enriched for ZNF genes

18 States 20-30 associated with active intergenic regions

19 20-24 candidate strong enhancers
Increased accessibility; lower metyhylation; greater conservation

20 States 30-40: large scale repressive states

21 State 34 - Strong H3K27me3 silenced state

22 39-40: H3K9me3 repressive domains / experimental nuclear lamina

23 Specific functional annotations for each of 51 chromatin states
Promoter states Transcribed states Active Intergenic Repressed Repetitive

24 Example applications for genome annotation
New protein-coding genes Chromatin signature: promoter / transcribed Evolutionary signature: not protein-coding lincRNAs Known coding Evolutionary CSF score  Long intergenic non-coding RNAs/lincRNAs In promoter(short)/low-expr states Assign candidate functions to intergenic SNPs from genome-wide association studies New developmental enhancer regions

25 Examples of distinct properties of chromatin states
GO Category State 3 State 4 State 5 State 6 State 7 State 8 Cell Cycle Phase 2.10 (2x10-7) (1) 1.61 (0.001) (1) 1.15 (1) 1.51 (1) Embryonic Development 1.24 (1) (9x10-23) 1.07 (1) 0.85 (1) 0.54 (1) 1.00 (1) Chromatin 1.20 (1) 0.48 (1) 2.2 (1.4x10-7) 1.64 (1) Response to DNA Damage Stimulus 0.35 (1) 1.55 (0.074) (6.5x10-11) (1.0x10-4) 0.84 (1) RNA Processing 0.49 (1) 0.26 (1) 1.31 (1) 1.91 (4.2x10-11) (8.7x10-24) (3.0x10-4) T cell Activation 0.77 (1) 0.88 (1) 1.27 (1) 0.70 (1) 0.79 (1) (2x10-7) Transcription End State State 27 ZNF repressed state recovery State 28: 112-fold ZNF enrich “The achievement of the repressed state by wild-type KAP1 involves decreased recruitment of RNA polymerase II, reduced levels of histone H3 K9 acteylation and H3K4 methylation, an increase in histone occupancy, enrichment of trimethyl histone H3K9, H3K36, and histone H4K20 …” MCB 2006. Promoter state  gene GO function Promoter vs. enhancer regulation TF binding Motif enrichment enhancers promoters State 10kb away predictive of expr. State 30 29 34 42 35 Distinct types of repression - Chrom bands / HDAC resp - Repeat family / composition

26 Quantifying discovery power for promoters, transcripts
(Left) The blue curve in the figure shows a “Receiver operating characteristic” (ROC) for coverage of bins with a TSS if states are ordered based on their fold enrichment for a TSS, (5,7,6,4,8,3,1,2,9,10,11, 21, 45,20, etc.). The green curve was based on ordering the k-means clusters. The red triangles are based on the individual input marks. The purple curve is based on a logistic regression classifier. The features to the classifier were ln(x+1) transformed values of the raw number of tags in a bin for each mark. No spatial information was given. Results for the classifier are based on five-fold cross validation. The TR-IRLS implementation of logistic regression was used with the default settings except the cgdeveps parameter was set to (Right) The same plot as the left side but for RefSeq transcribed regions opposed to RefSeq TSS. Komarek, P and Moore, A.W. Making Logistic Regression A Core Data Mining Tool With TR-IRLS. IEEE ICDM 2005, pages Carnici, P. Genome-wide analysis of mammalian promoter architecture and evolution. Nature Genetics: 38: (2006). Significantly outperform individual chromatin marks For transcripts, no single mark is sufficient signature CAGE/EST experiments give possible upper bound

27 3. Prioritize marks Select marks based on state recovery
Select appropriate number of states

28 Recovery of 40 chromatin states with 6 marks
Increasing marks show increasing resolution in state recovery

29 Extending IMR90 set beyond initial 22 marks
22 Marks common with CD4T data H2AK5ac H3K27ac H3K27me3 H3K9me3 H2BK120ac H3K4ac H3K36me3 H4K20me1 H2BK12ac H3K9ac H3K4me1 H2BK20ac H4K5ac H3K4me2 H3K14ac H4K8ac H3K4me3 H3K18ac H4K91ac H3K79me1 H3K23ac H3K79me2 19 Marks only in CD4T data H2AK9ac H2BK5me1 H3K9me2 CTCF H2BK5ac H3K27me1 H3R2me1 H2AZ H3K36ac H3K27me2 H3R2me2 PolII H4K12ac H3K36me1 H4K20me3 H4K16ac H3K79me3 H4R3me2

30 Selecting marks based on specific states of interest
State Inferred with all 41 marks Recovery of states with increasing number of marks Greedy ordering of marks State Inferred with subset of marks State Inferred with all 41 marks State confusion matrix with 11 ENCODE marks

31 Initial methods: ENCODE
4. Study dynamics Initial methods: ENCODE

32 Emerging large-scale genomic/epigenomic datasets
Multiple cell types Diverse experiments Developmental time-course Reference Epigenome Mapping Centers Used to study many disease epigenomes ENCODE Chromatin Group (PI: Bernstein) Insulator Enhancer Promoter Transcribed Repressed Repetitive 15-state model learned jointly 9 chromatin marks+WCE 9 human cell types HUVEC Umbilical vein endothelial NHEK Keratinocytes GM12878 Lymphoblastoid K562 Myelogenous leukemia HepG2 Liver carcinoma NHLF Normal human lung fibroblast HMEC Mammary epithelial cell HSMM Skeletal muscle myoblasts H1 Embryonic H3K4me1 H3K4me2 H3K4me3 H3K27ac H3K9ac H3K27me3 H4K20me1 H3K36me3 CTCF +WCE +RNA x NHEK HUVEC H1 Cell type concatenation approach Ensures common emission parameters Verified with independent learning

33 Chromatin states consistent across cell types
Clustering of independently learned 15-state models Promoter Candidate enhancer Insulator Transcribed Repressive State definitions are cell type invariant State locations are cell type specific Study dynamic changes in state assignments Reveal logic of chromatin regulation Repetitive

34 Chromatin state changes across pairs of cell types
K562 HUVEC NHEK Pairwise state fold enrichments Proportion of genome K562 HUVEC CTCF island state (State 9) highly stable across cell types NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to.  TODO: Add interpretation lines to the 10-state model!

35 Chromatin state changes across pairs of cell types
K562 NHEK OFF  HUVEC ON P-value blood vessel development 2.60E-05 vasculature development 3.00E-05 angiogenesis 3.50E-05 blood vessel morphogenesis 1.20E-04 HUVEC NHEK GO Enrichment for TSS in: HUVEC: active promoter (st1) NHEK: unmodified (st7) K562 HUVEC OFFNHEK OFF P-value ectoderm development 2.90E-09 epidermis development 1.80E-08 keratinocyte differentiation 3.00E-06 tissue development 3.20E-06 cell adhesion 1.90E-05 HUVEC GO Enrichment for: NHEK: TSS in active promoter (st1) HUVEC: TSS in unmodified (st7) NHEK It’d be nice to point to a more variable mark too. CTCF is a pretty obvious thing, it shouldn’t be the only thing you point to. NHEK HUVEC

36 Correlations between multi-cell activity profiles
Gene expression Chromatin States Active TF motif enrichment TF regulator expression Dip-aligned motif biases HUVEC NHEK GM12878 K562 HepG2 NHLF HMEC HSMM H1 TF On TF Off Motif aligned Flat profile ON OFF Active enhancer Repressed Motif enrichment Motif depletion

37 (1) Linking enhancer states to correlated target genes
Search for coherent changes between: gene expression chromatin marks at distant loci (10kb) Combine two vectors: Expression vector for each gene Vector of mark intensities at dist locus (combine marks based on enhancer emissions) 3. High correlation  enhancer/target link 10kb Candidate TM4SF1 Enhancer

38 (3) Signatures of activators and repressors from activity profiles
“Enhancer” States “On” States “Off” State STAT1 activator of GM12878 STAT5 activator for K562 CREB repressor of GM12878 STAT1 motif STAT5 motif CREB motif 2-4 24 Motif enrichment 22 TF Expression 2-2 38

39 Summary Learning chromatin states Interpreting chromatin states
ChromHMM captures combinatorics / spatial info Interpreting chromatin states Distinct functions, power for genome annotation Selecting number of marks / prioritizing Greedy ordering, tunable to states of interest Chromatin dynamics in human cell lines Activity profiles, linking enhancers, activat/repressors Selecting number of states Interpreting genome at increasing resolution

40 Selecting number of states

41 Step 1: Learn a larger model that captures ‘all’ relevant states
Comparison of BIC Score vs. Number of States for Random and Nested Initialization This figure shows the Bayesian Information Criterion (BIC) score for the models both when the parameters were randomly initialized and based on the nested initialization scheme (see methods). The BIC score is the log likelihood score minus a penalty term, computed as the number of parameters in the model divided by two times the natural log of the number of data points (defined as the number of 200bp bins). The figure shows the BIC scores of the model based on the nested initialization strategy are better than or close to the best BIC scores for each number of states based on random initialization. The figure also shows the BIC score is not a sufficient criteria to enable a selection of a model with a relatively small number of states for this data as it continued to increase past 70 states. Step 1: Learn a larger model that captures ‘all’ relevant states Step 2: Prune down model greedily eliminating least informative states Step 3: Select arbitrary cutoff based on biological interpretation Result: a 51-state model that captures most biology in least complexity

42 Recovery of 79-state model in random vs. nested initialization
Random Initialization (states appear & disappear) Nested Initialization Selected 51-state model (states consistly recoverd) This illustrates the challenge of comparing models with different numbers of states obtained from training based on different random initializations of the model parameters. Each column shows for a model the best correlation value of the emission parameters with each state of the 79 state model of supplementary Figure 7. There is a column for each model with the best likelihood from three different random initializations for models with between 2 and 80 states. The figure shows that some states such (e.g. 46, 64, 72) are recovered with very high correlation (>0.96) under some random initializations, but for other random initializations for models with more states these states are not recovered.

43 States capture mark dependencies: Expected vs
States capture mark dependencies: Expected vs. Observed Mark Co-Occurence 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Evaluation of conditional independence assumption. Each plot corresponds to one state and each plotted blue point corresponds to a pair of marks. The x-axis is how often a pair of marks is expected to be observed together based on the model, that is multiplying the emission probabilities of the marks together. The y-axis is how often a pair of marks in a state are actually observed together. The red line is the y=x line where the expected count agrees exactly with the observed counts. The plot confirms accuracy of our model assumption that conditioned on a state the pairs of marks are independent. 43 44 45 46 47 48 49 50 51

44 Increasing numbers of states lead to increasing mark independence
Increasing number of states capture observed mark dependencies. Expected vs. observed pairwise counts for models with 5, 10, 20, 30, 40, and 51 states based on the nested initialization. The state shown is state 6 in the 51 state model and the most correlated states in terms of the emission parameters in the 5-40 state models. The plot shows as more states are added, the points are closer to the y=x line, meaning that as the number of states increases conditioned on a state pairs of marks effectively become independent. State 6

45 Other desirable features of the resulting model
Power appropriately spent Chromatin states show distinct mark combinations

46 Selecting number of states

47 State Assigned in CD4T using 22-common marks
State Assigned in CD4T using all marks A value in a cell indicates the percentage of locations assigned to the state of the row with the full set of marks that would be assigned to the state of the column using the subset of marks.

48 State Assigned in CD4T using 22-common marks
State Assigned in CD4T using all marks Many locations assigned to a satellite repeat state with the full set of marks are assigned to a large H3K9me3 heterochromatin state using the set of 22 marks.

49 State Assigned in CD4T using 22-common marks + H4K20me3
State Assigned in CD4T using all marks With just data on the location of H4K20me3 almost all these locations are assigned to a satellite repeat state.

50 State Assigned in CD4T using 22-common marks
State Assigned in CD4T using all marks State 38 is primarily associated with H2AZ in distal locations

51 State Assigned in CD4T using 22-common marks + H2AZ
State Assigned in CD4T using all marks Adding H2AZ substantially improves the recovery of this state.

52 State Assigned in CD4T using 22-common marks
State Assigned in CD4T using all marks Various expressed transcribed states State 46 is strongly associated with simple repeats (maybe an artifact)

53 State Assigned in CD4T using 22-common marks + H2BK5me1
State Assigned in CD4T using all marks Various expressed transcribed states State 46 is strongly associated with simple repeats (maybe an artifact)


Download ppt "Integration methods and analysis"

Similar presentations


Ads by Google