Presentation is loading. Please wait.

Presentation is loading. Please wait.

Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.

Similar presentations


Presentation on theme: "Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data."— Presentation transcript:

1 Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

2 mRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Regulation of gene expression Transcriptional Post-transcriptional

3 Where does each transcription factor bind in the genome, in each cell type, at a given time? Near which genes ? What is the “cis-regulatory code” of each factor ? Does it require any co-factors ? DNA Activation Repression Regulation of gene expression

4 Sequencing Transcription factor of interest Antibody Chromatin Immunoprecipitation (ChIP)

5 Control: input DNA Chromatin Immunoprecipitation (ChIP) Sequencing

6 ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGA TTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTA ATCACTTAAG Average length ~ 250bp 25-40bp Sonication Chromatin Immunoprecipitation (ChIP)

7 ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGA TTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTA ATCACTTAAG Average length ~ 250bp 25-40bp Chromatin Immunoprecipitation (ChIP) Sonication

8 ChIP-Seq Analysis Workflow Peak Detection Annotation Motif Analysis Visualization Alignment Chromatin Immunoprecipitation (ChIP) ELAND Bowtie SOAP SeqMap … FindPeaks CHiPSeq BS-Seq SISSRs QuEST MACS CisGenome …

9 Read direction provides extra information Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 Read Alignment

10 Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Read Alignment

11

12 Peak Detection We need to correct for input DNA reads (control) - non-uniformly distributed (form peaks too) - vastly different numbers of reads between ChIP and input Calculate read count at each position (bp) in genome Determine if read count is greater than expected

13 Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count Read count Frequency Peak Detection The Poisson distribution

14 x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x 10 -10 log10 P(X>=10) = -9.77 -log10 P(X>=10) = 9.77 Peak Detection Is the observed read count at a given genomic position greater than expected ?

15 Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len Peak Detection

16 Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len Peak Detection

17 Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) Log(P c ) - Log(P i ) Threshold Genome positions (bp) INPUT ChIP Peak Detection

18 Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Peak Detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100b

19 The constant rate assumption does not hold! Negative binomial model fits the data better! Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 Peak Detection

20 ChIP reads Input reads Detected Peaks 80% are within <20kb of a known gene Visualization

21 No … Random regions True TF binding peak? Yes … Target regions 0.400.100.33 0.100.400.00 True TF peak Absent Present No Yes Motif Dependence is quantified using the mutual information Motif Search

22 k-mer MI CTCATCG 0.0618 TCATCGC 0.0485 AAAATTT 0.0438 GATGAGC 0.0434 AAAAATT 0.0383 ATGAGCT 0.0334 TTGCCAC 0.0322 TGCCACC 0.0298 ATCTCAT 0.0265... ACGCGCG 0.0018 CGACGCG 0.0012 TACGCTA 0.0011 ACCCCCT 0.0010 CCACGGC 0.0009 TTCAAAA 0.0005 AGACGCG 0.0004 CGAGAGC 0.0003 CTTATTA 0.0002 Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040 Motif Search

23 No … Random regions Optimizing k-mers into more informative degenerate motifs ATCCGTACA ATCC[C/G]TACA which character increases the mutual information by the largest amount ? A/G T/G C/GA/C/G A/T/G C/G/T True TF binding peak? Yes … Target regions Motif Search

24 change Motif Search

25 Enrichment Depletion Motif co-occurrence anallysis Discovered Motifs Motif Analysis

26 The ENCODE Project Goal: Define all functional elements in the human genome How: Lots of groups Lots of assays Lots of cell lines Lots of communication/consortium analysis Standardization of methods, reagents, analysis Genome-wide A lot of money

27 27 2 Tier 1 cell lines –GM12878 (B cell) –K562 (CML cells) 5 Tier 2 cells –HeLa S3, HepG2, HUVEC, primary keratinocytes, hESC Many Tier 3 cells RNA profiling (Scott Tenenbaum): Inter-cell line differences are greater than inter-lab differences The ENCODE Project

28 28 RNA-seq RNA-array TF ChIP-seq Histone modif ChIP-seq DNaseHS-seq Methyl-seq Methyl27-bisulfite 1M SNP genotyping Lots of data and data types generated by The ENCODE Project

29 29 Dynamic Bayesian Networks HMM segmentation PCA analysis Open Chromatin Trans. Factor Chip-seq Histone Mod. Chip-seq RNA Std. Peaks Region callsActive regions …… Biological interpretation Integrative Data Analysis

30 30 12 Histone modifications 2 Transcription factors GM12878 K562 “Standard” EM Training Posterior Probability Decoding Genome Viterbi Path State FState IState AState CState E Data: Entire ENCODE Consortium Analysis: Jason Ernst/Manolis Kellis 25-state HMM Integrative Data Analysis

31 Pol II F B H Kin28 pA CTD initiation Metagene Analysis of RNA transcription ChIP-chip profiles, averaged across ~300 expressed genes of medium length Lidschreiber et al., N SMB 2010, Mayer et al., Science 2012.

32 Pol II F B H Kin28 pA P P CTD S5P S7P 5‘ promotor escape initiation nascent RNA Lidschreiber et al., N SMB 2010, Mayer et al., Science 2012. Metagene Analysis of RNA transcription

33 Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G Pol II P P * 20 80 CBP Spt4/5 P P S2P Spt6 Elf1 Ctk1 elongationpromotor escape initiation nascent RNA Lidschreiber et al., N SMB 2010, Mayer et al., Science 2012. Metagene Analysis of RNA transcription

34 Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G Pol II P P * 20 80 CBP Spt4/5 P P S2P Spt6 Elf1 Ctk1 * P P Pol II Pcf11 P termination promotor escape initiation elongation Lidschreiber et al., N SMB 2010, Mayer et al., Science 2012. nascent RNA Metagene Analysis of RNA transcription

35 Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G Pol II P P * 20 80 CBP Spt4/5 P P S2P Spt6 Elf1 Ctk1 * P P Pol II Pcf11 P promotor escape initiation elongation termination Is the sequence of binding, dissociation and modification events universal? Metagene Analysis of RNA transcription

36 HMM Analysis of RNA transcription ChIP-chip occupancy profiles genomic position Ernst and Kellis (2012): ChromHMM: automating chromatin state discovery and characterization

37 HMM Analysis of RNA transcription ChIP-chip occupancy vectors

38 HMM Analysis of RNA transcription state 1state 2 state 3 state 4state 5 typical occupancy vector(s) transition matrix

39 X1X1 X2X2 X3X3 ΓX1X2ΓX1X2 ΓX2X3ΓX2X3 D1D1 D2D2 D3D3 ΨX1ΨX1 ΨX2ΨX2 ΨX3ΨX3 X1X1 X2X2 X3X3 D1D1 D2D2 D3D3 ΨX1ΨX1 ΨX2ΨX2 ΨX3ΨX3 X : Hidden (transcription) states Γ : Transition probabilities D : Data (occupancy vectors) Textbook: Hidden Markov Models (HMMs) Ψ : Emission distributions [less important: P( X 1 ) : Initial state distribution] Likelihood: Decoding: Viterbi algorithm Baum-Welch algorithm Parameter Learning: genomic position

40 Results on the S.cerevisiae data set Viterbi paths transcription start site genes

41 Results on the S.cerevisiae data set Initiation- elongation transition Nucl. high Ser2P low Productive elongation Elf1, Ser2P high Termination Pcf11 high Untranscribed regions all low except Nucl. 2 1 5 8 8 Initiation state: TFIIB high Nucl., Spt5, Ser2P, Elf1 low

42 Results on the S.cerevisiae data set transition matrix transition graph Observation: The transition matrix is almost symmetric, due to transcription in forward and reverse direction initiation initiation- elongation early elongation productive elongation termination intergenic/untranscribed

43 ChIP-chip tracks (multivariate Gaussian emissions) transcript annotation Transcription on Watson strand Transcription on Crick strand Transcrpt. on Crick strand X3X3 X4X4 X5X5 D3D3 D4D4 D5D5 X1X1 X2X2 D1D1 D2D2 X6X6 D6D6 Sense vs. antisense transcription

44 “Watson“ transcription states „Crick“ transcription states Intergenic state The bidirectional Hidden Markov Model Ψ 1 Ψ 2... Ψ k Ψ k... Ψ 2 Ψ 1 Additional constraint 1: Corresponding Watson and Crick states have identical emission distributions Additional constraint 2: Γ 12 = P(X t+1 = Ψ 2 | X t = Ψ 1 ) = P(X t = Ψ 1 |X t+1 = Ψ 2 ) = Γ 21 P(X t = Ψ 1 ) / P(X t = Ψ 2 )

45 State transitions reflect biochemichal transitions standard transcription 2147598 untranscribed genes 10 Mayer et al. (2010): transition from initiation to elongation at +150bp

46 Different transcription cycles ?! standard transcription (stepwise recruitment) 2147598 highly transcribed genes (immediate recruitment) 2138

47 very low synthesis rate high decay rate Enrichment of stress response genes P I A grammar of transcription low synthesis rate very high decay rate enrichment of genes involved in epigenetic regulation of gene expression, cell cycle medium synthesis rate medium decay rate Enrichment of genes involved in reproduction high synthesis rate low decay rate Enrichment of genes involved in ribosome biogenesis, rRNA processing PE EE1 EE2 E1 T P PE EE1 EE2 E2 T P PPE E3 T P

48 A grammar of transcription very high synthesis rate Very low decay rate Enrichment of ribosomal protein genes, intron containing genes medium synthesis rate medium decay rate Enrichment of genes involved ijn G1 phase of cell cycle PE EE1EE2 T P P PPE T

49 Text search (Regular Expression)...........-pE-pE-pE-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P- P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-T-T-T-T-T-T-T-T-T-T- T-T-T-T-T-T-T-T-T-eE2--eE2-eE2-eE2-eE2-eE2-eE2- eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-E3-E3-E3-E3-E3-E3-E3- E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3- I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I- I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I- I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I- I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-.............. Viterbi sequence from directional HMM 696 bidirectional promoters Annotation of bidirectional promoters PE- P- P+ PE+

50 Annotation of 45 unknown transcripts two new transcripts Viterbi sequence from directional HMM stable transcripts on the - strand cryptic transcripts on the - strand Strand-specific transcription data from Xu et al., Nature 2009

51 Acknowledgements Benedikt Zacher Julien Gagneur Patrick Cramer Michael Lidschreiber Andreas Mayer Daniel Schulz Björn Schwalb STAN package


Download ppt "Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data."

Similar presentations


Ads by Google