Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.

Slides:



Advertisements
Similar presentations
Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes.
Advertisements

Methods to read out regulatory functions
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Periodic clusters. Non periodic clusters That was only the beginning…
Hidden Markov Model in Biological Sequence Analysis – Part 2
Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Hidden Markov Model.
Analysis of ChIP-Seq Data
Hidden Markov Models Modified from:
Profiles for Sequences
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html Chromatin Immunoprecipitation (ChIP) data.
Identification of Polycomb Response Elements in Mammalian Embryonic Stem Cells and Cancer Cells Kit J. Menlove Mentored by Jianpeng Ma, Timothy Palzkill,
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Lecture 5: Learning models using EM
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Gene expression.
“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.
Supplementary Material Supplementary Tables Supplementary Table 1. Sequencing statistics for ChIP-seq samples. Supplementary Table 2. Pearson correlation.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Hidden Markov Models In BioInformatics
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
Hidden Markov Models for Sequence Analysis 4
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
I519 Introduction to Bioinformatics, Fall, 2012
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
From Genomes to Genes Rui Alves.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
California Pacific Medical Center
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
(H)MMs in gene prediction and similarity searches.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology.
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Transcription factor binding motifs (part II) 10/22/07.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
- DNA sequencing in the last century - Current technologies (Illumina, Ion Torrent) - New developments (PacBio, Nanopore) Topics.
Hidden Markov Models Achim Tresch MPI for Plant Breedging Research & University of Cologne.
The Chromatin State The scientific quest to decipher the histone code Lior Zimmerman.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Input Output HMMs for modeling network dynamics
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presented by, Jeremy Logue.
ChIP-seq Robert J. Trumbly
Volume 56, Issue 5, Pages (December 2014)
Volume 66, Issue 1, Pages e6 (April 2017)
Human Promoters Are Intrinsically Directional
Volume 10, Issue 10, Pages (October 2017)
Volume 21, Issue 9, Pages (November 2017)
Presented by, Jeremy Logue.
Presentation transcript:

Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data

mRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Regulation of gene expression Transcriptional Post-transcriptional

Where does each transcription factor bind in the genome, in each cell type, at a given time? Near which genes ? What is the “cis-regulatory code” of each factor ? Does it require any co-factors ? DNA Activation Repression Regulation of gene expression

Sequencing Transcription factor of interest Antibody Chromatin Immunoprecipitation (ChIP)

Control: input DNA Chromatin Immunoprecipitation (ChIP) Sequencing

ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGA TTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTA ATCACTTAAG Average length ~ 250bp 25-40bp Sonication Chromatin Immunoprecipitation (ChIP)

ACCAATAACCGAGGCTCATGCTAAGGCGTTAGCCACAGATGGAAGTCCGACGGCTTGATCCAGAATGGTGTGTGGATTGCCTTGGAACTGA TTAGTGAATTC TGGTTATTGGCTCCGAGTACGATTCCGCAATCGGTGTCTACCTTCAGGCTGCCGAACTAGGTCTTACCACACACCTAACGGAACCTTGACTA ATCACTTAAG Average length ~ 250bp 25-40bp Chromatin Immunoprecipitation (ChIP) Sonication

ChIP-Seq Analysis Workflow Peak Detection Annotation Motif Analysis Visualization Alignment Chromatin Immunoprecipitation (ChIP) ELAND Bowtie SOAP SeqMap … FindPeaks CHiPSeq BS-Seq SISSRs QuEST MACS CisGenome …

Read direction provides extra information Hongkai Ji et al. Nature Biotechnology 26: Read Alignment

Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Read Alignment

Peak Detection We need to correct for input DNA reads (control) - non-uniformly distributed (form peaks too) - vastly different numbers of reads between ChIP and input Calculate read count at each position (bp) in genome Determine if read count is greater than expected

Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count Read count Frequency Peak Detection The Poisson distribution

x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x log10 P(X>=10) = log10 P(X>=10) = 9.77 Peak Detection Is the observed read count at a given genomic position greater than expected ?

Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len Peak Detection

Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len Peak Detection

Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) Log(P c ) - Log(P i ) Threshold Genome positions (bp) INPUT ChIP Peak Detection

Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Peak Detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100b

The constant rate assumption does not hold! Negative binomial model fits the data better! Hongkai Ji et al. Nature Biotechnology 26: Peak Detection

ChIP reads Input reads Detected Peaks 80% are within <20kb of a known gene Visualization

No … Random regions True TF binding peak? Yes … Target regions True TF peak Absent Present No Yes Motif Dependence is quantified using the mutual information Motif Search

k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC CTTATTA Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040 Motif Search

No … Random regions Optimizing k-mers into more informative degenerate motifs ATCCGTACA ATCC[C/G]TACA which character increases the mutual information by the largest amount ? A/G T/G C/GA/C/G A/T/G C/G/T True TF binding peak? Yes … Target regions Motif Search

change Motif Search

Enrichment Depletion Motif co-occurrence anallysis Discovered Motifs Motif Analysis

The ENCODE Project Goal: Define all functional elements in the human genome How: Lots of groups Lots of assays Lots of cell lines Lots of communication/consortium analysis Standardization of methods, reagents, analysis Genome-wide A lot of money

27 2 Tier 1 cell lines –GM12878 (B cell) –K562 (CML cells) 5 Tier 2 cells –HeLa S3, HepG2, HUVEC, primary keratinocytes, hESC Many Tier 3 cells RNA profiling (Scott Tenenbaum): Inter-cell line differences are greater than inter-lab differences The ENCODE Project

28 RNA-seq RNA-array TF ChIP-seq Histone modif ChIP-seq DNaseHS-seq Methyl-seq Methyl27-bisulfite 1M SNP genotyping Lots of data and data types generated by The ENCODE Project

29 Dynamic Bayesian Networks HMM segmentation PCA analysis Open Chromatin Trans. Factor Chip-seq Histone Mod. Chip-seq RNA Std. Peaks Region callsActive regions …… Biological interpretation Integrative Data Analysis

30 12 Histone modifications 2 Transcription factors GM12878 K562 “Standard” EM Training Posterior Probability Decoding Genome Viterbi Path State FState IState AState CState E Data: Entire ENCODE Consortium Analysis: Jason Ernst/Manolis Kellis 25-state HMM Integrative Data Analysis

Pol II F B H Kin28 pA CTD initiation Metagene Analysis of RNA transcription ChIP-chip profiles, averaged across ~300 expressed genes of medium length Lidschreiber et al., N SMB 2010, Mayer et al., Science 2012.

Pol II F B H Kin28 pA P P CTD S5P S7P 5‘ promotor escape initiation nascent RNA Lidschreiber et al., N SMB 2010, Mayer et al., Science Metagene Analysis of RNA transcription

Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G Pol II P P * CBP Spt4/5 P P S2P Spt6 Elf1 Ctk1 elongationpromotor escape initiation nascent RNA Lidschreiber et al., N SMB 2010, Mayer et al., Science Metagene Analysis of RNA transcription

Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G Pol II P P * CBP Spt4/5 P P S2P Spt6 Elf1 Ctk1 * P P Pol II Pcf11 P termination promotor escape initiation elongation Lidschreiber et al., N SMB 2010, Mayer et al., Science nascent RNA Metagene Analysis of RNA transcription

Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G Pol II P P * CBP Spt4/5 P P S2P Spt6 Elf1 Ctk1 * P P Pol II Pcf11 P promotor escape initiation elongation termination Is the sequence of binding, dissociation and modification events universal? Metagene Analysis of RNA transcription

HMM Analysis of RNA transcription ChIP-chip occupancy profiles genomic position Ernst and Kellis (2012): ChromHMM: automating chromatin state discovery and characterization

HMM Analysis of RNA transcription ChIP-chip occupancy vectors

HMM Analysis of RNA transcription state 1state 2 state 3 state 4state 5 typical occupancy vector(s) transition matrix

X1X1 X2X2 X3X3 ΓX1X2ΓX1X2 ΓX2X3ΓX2X3 D1D1 D2D2 D3D3 ΨX1ΨX1 ΨX2ΨX2 ΨX3ΨX3 X1X1 X2X2 X3X3 D1D1 D2D2 D3D3 ΨX1ΨX1 ΨX2ΨX2 ΨX3ΨX3 X : Hidden (transcription) states Γ : Transition probabilities D : Data (occupancy vectors) Textbook: Hidden Markov Models (HMMs) Ψ : Emission distributions [less important: P( X 1 ) : Initial state distribution] Likelihood: Decoding: Viterbi algorithm Baum-Welch algorithm Parameter Learning: genomic position

Results on the S.cerevisiae data set Viterbi paths transcription start site genes

Results on the S.cerevisiae data set Initiation- elongation transition Nucl. high Ser2P low Productive elongation Elf1, Ser2P high Termination Pcf11 high Untranscribed regions all low except Nucl Initiation state: TFIIB high Nucl., Spt5, Ser2P, Elf1 low

Results on the S.cerevisiae data set transition matrix transition graph Observation: The transition matrix is almost symmetric, due to transcription in forward and reverse direction initiation initiation- elongation early elongation productive elongation termination intergenic/untranscribed

ChIP-chip tracks (multivariate Gaussian emissions) transcript annotation Transcription on Watson strand Transcription on Crick strand Transcrpt. on Crick strand X3X3 X4X4 X5X5 D3D3 D4D4 D5D5 X1X1 X2X2 D1D1 D2D2 X6X6 D6D6 Sense vs. antisense transcription

“Watson“ transcription states „Crick“ transcription states Intergenic state The bidirectional Hidden Markov Model Ψ 1 Ψ 2... Ψ k Ψ k... Ψ 2 Ψ 1 Additional constraint 1: Corresponding Watson and Crick states have identical emission distributions Additional constraint 2: Γ 12 = P(X t+1 = Ψ 2 | X t = Ψ 1 ) = P(X t = Ψ 1 |X t+1 = Ψ 2 ) = Γ 21 P(X t = Ψ 1 ) / P(X t = Ψ 2 )

State transitions reflect biochemichal transitions standard transcription untranscribed genes 10 Mayer et al. (2010): transition from initiation to elongation at +150bp

Different transcription cycles ?! standard transcription (stepwise recruitment) highly transcribed genes (immediate recruitment) 2138

very low synthesis rate high decay rate Enrichment of stress response genes P I A grammar of transcription low synthesis rate very high decay rate enrichment of genes involved in epigenetic regulation of gene expression, cell cycle medium synthesis rate medium decay rate Enrichment of genes involved in reproduction high synthesis rate low decay rate Enrichment of genes involved in ribosome biogenesis, rRNA processing PE EE1 EE2 E1 T P PE EE1 EE2 E2 T P PPE E3 T P

A grammar of transcription very high synthesis rate Very low decay rate Enrichment of ribosomal protein genes, intron containing genes medium synthesis rate medium decay rate Enrichment of genes involved ijn G1 phase of cell cycle PE EE1EE2 T P P PPE T

Text search (Regular Expression) pE-pE-pE-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P- P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE-pE- pE-pE-pE-pE-pE-pE-pE-pE-pE-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-T-T-T-T-T-T-T-T-T-T- T-T-T-T-T-T-T-T-T-eE2--eE2-eE2-eE2-eE2-eE2-eE2- eE2-eE2-eE2-eE2-eE2-eE2-eE2-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1-eE1- eE1-eE1-eE1-eE1-eE1-eE1-E3-E3-E3-E3-E3-E3-E3- E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3-E3- I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I- I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I- I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I- I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I Viterbi sequence from directional HMM 696 bidirectional promoters Annotation of bidirectional promoters PE- P- P+ PE+

Annotation of 45 unknown transcripts two new transcripts Viterbi sequence from directional HMM stable transcripts on the - strand cryptic transcripts on the - strand Strand-specific transcription data from Xu et al., Nature 2009

Acknowledgements Benedikt Zacher Julien Gagneur Patrick Cramer Michael Lidschreiber Andreas Mayer Daniel Schulz Björn Schwalb STAN package