Download presentation
Presentation is loading. Please wait.
Published byChristiana Baker Modified over 9 years ago
1
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html tresch@mpipz.mpg.de Chromatin Immunoprecipitation (ChIP) data
2
mRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Regulation of gene expression Transcriptional Post-transcriptional
3
At which loci does a protein bind the DNA? Are there cell-type or environment-specific variations of binding affinity? Which histone modifications determine chromatin structure? To which motifs does a transcription factor bind? What is the “cis-regulatory code” of a gene? Regulation of gene expression DNA Activation Repression x
4
Sequencing DNA binding protein of interest Antibody Chromatin Immunoprecipitation (ChIP)
5
Control: input DNA Chromatin Immunoprecipitation (ChIP) Sequencing
6
ChIP-Seq Analysis Workflow Peak Detection Annotation Motif Analysis Visualization Alignment Chromatin Immunoprecipitation (ChIP) ELAND Bowtie SOAP SeqMap … SISSRs QuEST MACS CisGenome … STAN chromHMM … IGV Ensembl GB UCSC GB … cERMIT HMMer Xxmotif …
7
ACCAATAATCAGCTAAGCCGTTAGCCACAGATGGAA Protein of interest Chromatin Immunoprecipitation (ChIP) Sonication crosslink site
8
Read Alignment
9
Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Read Alignment
10
Read direction provides extra information Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 Read Alignment
11
Peak Detection We need to correct for input DNA reads (control) - non-uniformly distributed (form peaks too) - vastly different numbers of reads between ChIP and input Calculate read count at each position (bp) in genome Determine if read count is greater than expected
12
Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count Read count Frequency Peak Detection The Poisson distribution
13
x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x 10 -10 log10 P(X>=10) = -9.77 -log10 P(X>=10) = 9.77 Peak Detection Is the observed read count at a given genomic position greater than expected ?
14
Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len Peak Detection
15
Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len Peak Detection
16
Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) Log(P c ) - Log(P i ) Threshold Genome positions (bp) INPUT ChIP Peak Detection
17
Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Peak Detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100b
18
ChIP reads Input reads Detected Peaks 80% are within <20kb of a known gene Visualization
19
No … Random regions True TF binding peak? Yes … Target regions 0.400.100.33 0.100.400.00 True TF peak Absent Present No Yes Motif Dependence is quantified using the mutual information Motif Search
20
k-mer MI CTCATCG 0.0618 TCATCGC 0.0485 AAAATTT 0.0438 GATGAGC 0.0434 AAAAATT 0.0383 ATGAGCT 0.0334 TTGCCAC 0.0322 TGCCACC 0.0298 ATCTCAT 0.0265... ACGCGCG 0.0018 CGACGCG 0.0012 TACGCTA 0.0011 ACCCCCT 0.0010 CCACGGC 0.0009 TTCAAAA 0.0005 AGACGCG 0.0004 CGAGAGC 0.0003 CTTATTA 0.0002 Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040 Motif Search
21
No … Random regions Optimizing k-mers into more informative degenerate motifs ATCCGTACA ATCC[C/G]TACA which character increases the mutual information by the largest amount ? A/G T/G C/GA/C/G A/T/G C/G/T True TF binding peak? Yes … Target regions Motif Search Motif occurrence
22
change Motif Search
23
The ENCODE Project Goal: Define all functional elements in the human genome How: Lots of groups Lots of assays Lots of cell lines Lots of communication/consortium analysis Standardization of methods, reagents, analysis Genome-wide A lot of money
24
24 2 Tier 1 cell lines –GM12878 (B cell) –K562 (CML cells) 5 Tier 2 cells –HeLa S3, HepG2, HUVEC, primary keratinocytes, hESC Many Tier 3 cells RNA profiling (Scott Tenenbaum): Inter-cell line differences are greater than inter-lab differences The ENCODE Project
25
25 RNA-seq RNA-array TF ChIP-seq Histone modif ChIP-seq DNaseHS-seq Methyl-seq Methyl27-bisulfite 1M SNP genotyping Lots of data and data types generated by The ENCODE Project
26
26 Dynamic Bayesian Networks HMM segmentation PCA analysis Open Chromatin Trans. Factor Chip-seq Histone Mod. Chip-seq RNA Std. Peaks Region callsActive regions …… Biological interpretation Integrative Data Analysis
27
27 12 Histone modifications 2 Transcription factors GM12878 K562 “Standard” EM Training Posterior Probability Decoding Genome Viterbi Path State FState IState AState CState E Data: Entire ENCODE Consortium Analysis: Jason Ernst/Manolis Kellis 25-state HMM Integrative Data Analysis
28
Pol II F B H Kin28 pA CTD Example: Pol II transcription cycle ChIP metagene profiles, averaged across ~300 genes of average length and expression Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012. initiation
29
Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012. Pol II F B H Kin28 pA P P CTD S5P S7P 5‘ promotor escape Example: Pol II transcription cycle initiation nascent RNA
30
Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G Pol II P P * 20 80 CBP Spt4/5 P P S2P Spt6 Elf1 Ctk1 Example: Pol II transcription cycle elongationpromotor escape initiation nascent RNA Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.
31
Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G CBP Elf1 * P P Pol II Pcf11 P Example: Pol II transcription cycle termination promotor escape initiation elongation nascent RNA Pol II P P * 20 80 Spt4/5 P P S2P Spt6 Ctk1 nascent RNA
32
Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G CBP Elf1 * P P Pol II Pcf11 P Example: Pol II transcription cycle termination promotor escape initiation elongation nascent RNA Pol II P P * 20 80 Spt4/5 P P S2P Spt6 Ctk1 nascent RNA Metagene Analysis is biased towards the genes selected for metagene construction. It analyses only annotated regions, cannot detect new regions with interesting behavior. Metagene Analysis may hide variation in ChIP profiles, i.e., aberrant behavior of a subset of genes. It does not detect transitions that occur at variable distance to the TSS. Metagene Analysis is biased towards the genes selected for metagene construction. It analyses only annotated regions, cannot detect new regions with interesting behavior. Metagene Analysis may hide variation in ChIP profiles, i.e., aberrant behavior of a subset of genes. It does not detect transitions that occur at variable distance to the TSS.
33
Single position profiles genomic position Is there a universal sequence of transcription-related events for all genes?
34
Hidden Markov Models (HMMs) genomic position
35
Hidden Markov Models (HMMs) ChIP-chip occupancy vectors
36
Hidden Markov Models (HMMs) state 1state 2 state 3 state 4state 5 typical occupancy vector(s) transition matrix 1234512345 1 2 3 4 5 Viterbi path
37
State annotation = Viterbi path (maximum likelihood path) ;Θ;Θ Hidden Markov Models (HMMs) Pr Likelihood function HMM parameters
38
Results on the S.cerevisiae data set transition matrix transition graph initiation initiation- elongation early elongation productive elongation termination intergenic/untranscribed
39
Fitted bdHMM transitions Simulated profile and HMM fit U - Untranscribed E - Early stage L - Late stage Bidirectional HMMs - Idea +/- direction
40
Bidirectional HMM – Defining property xy ij yx ji Forward process Reverse process Obser- vable layer Hidden layer Conjugate / twin states j and j ** t - 1 t t Definition: A bdHMM is an HMM that satisfies the bidirectionality condition i jx y x y j i
41
“Watson“ transcription states „Crick“ transcription states Intergenic state The bidirectional Hidden Markov Model Ψ 1 Ψ 2... Ψ k Ψ k... Ψ 2 Ψ 1 Constraint 1: Corresponding Watson and Crick states have identical emission distributions Constraint 2: Γ 12 = P(X t+1 = Ψ 2 | X t = Ψ 1 ) = P(X t = Ψ 1 |X t+1 = Ψ 2 ) = Γ 21 P(X t = Ψ 1 ) / P(X t = Ψ 2 ) Constraint 3: π k = π k
42
Theorem: An HMM satisfies the bidirectionality condition if and only if the following three conditions hold: Generalized detailed belance Initiation symmetry Observation symmetry Bidirectional HMM – Defining property STAN package Benedikt Zacher bdHMM parameter learning is at first sight a non-convex optimization problem ( difficult in general). We found an exact and efficient solution!
43
Strand-specific state annotation in yeast
44
Fine structure of the transcription cycle Intensity Promoter escape (PE) PE1 PE2
45
Fine structure of the transcription cycle Intensity Promoter escape (PE) PE1 PE2
46
Fine structure of the transcription cycle Alternative promoter escapes
47
Fine structure of the transcription cycle Alternative promoter escapes Intensity Promoter escape (PE) PE1 PE2
48
Variations of the transcription cycle Clusters HMM transcription states
49
Variations of the transcription cycle 43 genes 694 genes 147 genes Clusters HMM transcription states
50
Variations of the transcription cycle Canonical cluster Attenuated cluster Evidence for a checkpoint after early elongation: Spt5, Spn1, Bur1, Spt16 are recruited in cluster 32, but not Paf1 and Ctk1. Similar promoter escape Different elongation Nrd1 attenuates cluster 32 genes ChIP signal
51
Conclusion Is there a universal sequence of transcription-related events for all genes? There seem to be distinct variations of the transcription cycle. They mainly differ in their promoter escape mechanisms.
52
Targeted identification of genomic features 1076 bidirectional promoters found state sequence regular expression While the nucleosome-free region can vary in size, the positioning of the +1,+2,… nucleosomes is constant.
53
Xxmotif (Hartmann et al. Genome Res. 2013) state-specific motifs State transitions triggered by sequence motifs state sequence … … negative set (150bp) 50bp 50bp positive set motifs?
54
Annotation of 45 unknown transcripts two new transcripts Viterbi sequence from directional HMM stable transcripts on the - strand cryptic transcripts on the - strand Strand-specific transcription data from Xu et al., Nature 2009
55
Outlook: Application to ENCODE data ~30 ChIP Seq tracks of various histone marks chromHMM (Ernst and Kellis, Nat. Biotech 2012) bdHMM chromHMM There is much more „junk“ (regions declared as intergenic / untranscribed) than claimed by ENCODE.
56
Application to ENCODE data Directionality score + directionality assignment
57
Application to ENCODE data chrommHMM flux diagram
58
Application to ENCODE data bdHMM flux diagram
59
Outlook: Combination of histone marks + RNA-seq Application to RNA-Seq + histone marks ChIP-Seq data from Nir Friedman ‚s and Steve Jacobsen‘s lab There are only few distinct histone patterns. The histone modification pattern alone contains directionality information. It can tell, e.g., which of two overlapping genes is transcribed. There are only few distinct histone patterns. The histone modification pattern alone contains directionality information. It can tell, e.g., which of two overlapping genes is transcribed.
60
Conclusion bdHMMs give an unsupervised, strand-specific annotation of the genome using ChIP and RNA expression data. bdHMMs are unbiased: No need to predefine gene sets or regions of interest. bdHMMs reveal Variations of the Pol II transcription cycle by clustering of state sequences. Regular expression search can be used to identify new genomic features. bdHMM states are enriched in functional DNA motifs and can be used for improved motif discovery. bdHMMs give an unsupervised, strand-specific annotation of the genome using ChIP and RNA expression data. bdHMMs are unbiased: No need to predefine gene sets or regions of interest. bdHMMs reveal Variations of the Pol II transcription cycle by clustering of state sequences. Regular expression search can be used to identify new genomic features. bdHMM states are enriched in functional DNA motifs and can be used for improved motif discovery.
61
Julien Gagneur LMU Munich Michael Lidschreiber MPI Göttingen Patrick Cramer MPI Göttingen Acknowledgements Benedik Zacher MPI Cologne + LMU Munich
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.