1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html Chromatin Immunoprecipitation (ChIP) data.

Slides:



Advertisements
Similar presentations
Yaroslav Ryabov Lognormal Pattern of Exon size distributions in Eukaryotic genomes.
Advertisements

Methods to read out regulatory functions
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Periodic clusters. Non periodic clusters That was only the beginning…
Hidden Markov Model in Biological Sequence Analysis – Part 2
Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Detecting DNA-protein Interactions Xinghua Lu Dept Biomedical Informatics BIOST 2055.
Analysis of ChIP-Seq Data
Hidden Markov Models Modified from:
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Lecture 5: Learning models using EM
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIP-seq Data by Zerrin Işık Volkan Atalay Rengül Çetin-Atalay Middle East Technical.
ChIP-seq QC Xiaole Shirley Liu STAT115, STAT215. Initial QC FASTQC Mappability Uniquely mapped reads Uniquely mapped locations Uniquely mapped locations.
Eukaryotic Gene Finding
“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.
ENCODE enhancers 12/13/2013 Yao Fu Gerstein lab. ‘Supervised’ enhancer prediction Yip et al., Genome Biology (2012) Get enhancer list away to genes DNase.
1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"
MRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Transcriptional and post-transcriptional regulation of gene expression.
Model Selection in Machine Learning + Predicting Gene Expression from ChIP-Seq signals
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
ChIP-on-Chip and Differential Location Analysis Junguk Hur School of Informatics October 4, 2005.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
I519 Introduction to Bioinformatics, Fall, 2012
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Thoughts on ENCODE Annotations Mark Gerstein. Simplified Comprehensive (published annotation, mostly in '12 & '14 rollouts)
Overview of ENCODE Elements
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology.
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
- DNA sequencing in the last century - Current technologies (Illumina, Ion Torrent) - New developments (PacBio, Nanopore) Topics.
The Chromatin State The scientific quest to decipher the histone code Lior Zimmerman.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
CS273B: Deep learning for Genomics and Biomedicine
Epigenetics 04/04/16.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Volume 66, Issue 4, Pages e3 (May 2017)
Volume 66, Issue 4, Pages e3 (May 2017)
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
ChIP-seq Robert J. Trumbly
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Human Promoters Are Intrinsically Directional
Volume 10, Issue 10, Pages (October 2017)
Predicting Gene Expression from Sequence
Presentation transcript:

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html Chromatin Immunoprecipitation (ChIP) data

mRNA protein DNA Activation Repression Translation Localization Stability Pol II 3’UTR Regulation of gene expression Transcriptional Post-transcriptional

At which loci does a protein bind the DNA? Are there cell-type or environment-specific variations of binding affinity? Which histone modifications determine chromatin structure? To which motifs does a transcription factor bind? What is the “cis-regulatory code” of a gene? Regulation of gene expression DNA Activation Repression x

Sequencing DNA binding protein of interest Antibody Chromatin Immunoprecipitation (ChIP)

Control: input DNA Chromatin Immunoprecipitation (ChIP) Sequencing

ChIP-Seq Analysis Workflow Peak Detection Annotation Motif Analysis Visualization Alignment Chromatin Immunoprecipitation (ChIP) ELAND Bowtie SOAP SeqMap … SISSRs QuEST MACS CisGenome … STAN chromHMM … IGV Ensembl GB UCSC GB … cERMIT HMMer Xxmotif …

ACCAATAATCAGCTAAGCCGTTAGCCACAGATGGAA Protein of interest Chromatin Immunoprecipitation (ChIP) Sonication crosslink site

Read Alignment

Read count genome Expected read count Expected read count = total number of reads * extended fragment length / chr length genome T A T T A A T T A T C C C C A T A T A T G A T A T Read Alignment

Read direction provides extra information Hongkai Ji et al. Nature Biotechnology 26: Read Alignment

Peak Detection We need to correct for input DNA reads (control) - non-uniformly distributed (form peaks too) - vastly different numbers of reads between ChIP and input Calculate read count at each position (bp) in genome Determine if read count is greater than expected

Is the observed read count at a given genomic position greater than expected ? x = observed read count λ = expected read count Read count Frequency Peak Detection The Poisson distribution

x = 10 reads (observed) λ = 0.5 reads (expected) The Poisson distribution genome P(X>=10) = 1.7 x log10 P(X>=10) = log10 P(X>=10) = 9.77 Peak Detection Is the observed read count at a given genomic position greater than expected ?

Read count Expected read count -Log(p) Expected read count = total number of reads * extended frag len / chr len Peak Detection

Read count Expected read count Input reads -Log(p) Expected read count = total number of reads * extended frag len / chr len Peak Detection

Read count Expected read count -Log(P c ) Read count Expected read count -Log(P i ) Log(P c ) - Log(P i ) Threshold Genome positions (bp) INPUT ChIP Peak Detection

Normalized Peak score (at each bp) R = -log10 P(X input ) P(X ChIP ) Will detect peaks with high read counts in ChIP, low in Input Peak Detection Determine all genomic regions with R>=15 Merge peaks separated by less than 100bp Output all peaks with length >= 100b

ChIP reads Input reads Detected Peaks 80% are within <20kb of a known gene Visualization

No … Random regions True TF binding peak? Yes … Target regions True TF peak Absent Present No Yes Motif Dependence is quantified using the mutual information Motif Search

k-mer MI CTCATCG TCATCGC AAAATTT GATGAGC AAAAATT ATGAGCT TTGCCAC TGCCACC ATCTCAT ACGCGCG CGACGCG TACGCTA ACCCCCT CCACGGC TTCAAAA AGACGCG CGAGAGC CTTATTA Not informative Highly informative... MI=0.081 MI=0.045 MI=0.040 Motif Search

No … Random regions Optimizing k-mers into more informative degenerate motifs ATCCGTACA ATCC[C/G]TACA which character increases the mutual information by the largest amount ? A/G T/G C/GA/C/G A/T/G C/G/T True TF binding peak? Yes … Target regions Motif Search Motif occurrence

change Motif Search

The ENCODE Project Goal: Define all functional elements in the human genome How: Lots of groups Lots of assays Lots of cell lines Lots of communication/consortium analysis Standardization of methods, reagents, analysis Genome-wide A lot of money

24 2 Tier 1 cell lines –GM12878 (B cell) –K562 (CML cells) 5 Tier 2 cells –HeLa S3, HepG2, HUVEC, primary keratinocytes, hESC Many Tier 3 cells RNA profiling (Scott Tenenbaum): Inter-cell line differences are greater than inter-lab differences The ENCODE Project

25 RNA-seq RNA-array TF ChIP-seq Histone modif ChIP-seq DNaseHS-seq Methyl-seq Methyl27-bisulfite 1M SNP genotyping Lots of data and data types generated by The ENCODE Project

26 Dynamic Bayesian Networks HMM segmentation PCA analysis Open Chromatin Trans. Factor Chip-seq Histone Mod. Chip-seq RNA Std. Peaks Region callsActive regions …… Biological interpretation Integrative Data Analysis

27 12 Histone modifications 2 Transcription factors GM12878 K562 “Standard” EM Training Posterior Probability Decoding Genome Viterbi Path State FState IState AState CState E Data: Entire ENCODE Consortium Analysis: Jason Ernst/Manolis Kellis 25-state HMM Integrative Data Analysis

Pol II F B H Kin28 pA CTD Example: Pol II transcription cycle ChIP metagene profiles, averaged across ~300 genes of average length and expression Lidschreiber et al., NSMB 2010, Mayer et al., Science initiation

Lidschreiber et al., NSMB 2010, Mayer et al., Science Pol II F B H Kin28 pA P P CTD S5P S7P 5‘ promotor escape Example: Pol II transcription cycle initiation nascent RNA

Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G Pol II P P * CBP Spt4/5 P P S2P Spt6 Elf1 Ctk1 Example: Pol II transcription cycle elongationpromotor escape initiation nascent RNA Lidschreiber et al., NSMB 2010, Mayer et al., Science 2012.

Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G CBP Elf1 * P P Pol II Pcf11 P Example: Pol II transcription cycle termination promotor escape initiation elongation nascent RNA Pol II P P * Spt4/5 P P S2P Spt6 Ctk1 nascent RNA

Pol II F B H Kin28 pA P P CTD S5P S7P CE * m7Gm7G CBP Elf1 * P P Pol II Pcf11 P Example: Pol II transcription cycle termination promotor escape initiation elongation nascent RNA Pol II P P * Spt4/5 P P S2P Spt6 Ctk1 nascent RNA Metagene Analysis is biased towards the genes selected for metagene construction. It analyses only annotated regions, cannot detect new regions with interesting behavior. Metagene Analysis may hide variation in ChIP profiles, i.e., aberrant behavior of a subset of genes. It does not detect transitions that occur at variable distance to the TSS. Metagene Analysis is biased towards the genes selected for metagene construction. It analyses only annotated regions, cannot detect new regions with interesting behavior. Metagene Analysis may hide variation in ChIP profiles, i.e., aberrant behavior of a subset of genes. It does not detect transitions that occur at variable distance to the TSS.

Single position profiles genomic position Is there a universal sequence of transcription-related events for all genes?

Hidden Markov Models (HMMs) genomic position

Hidden Markov Models (HMMs) ChIP-chip occupancy vectors

Hidden Markov Models (HMMs) state 1state 2 state 3 state 4state 5 typical occupancy vector(s) transition matrix Viterbi path

State annotation = Viterbi path (maximum likelihood path) ;Θ;Θ Hidden Markov Models (HMMs) Pr Likelihood function HMM parameters

Results on the S.cerevisiae data set transition matrix transition graph initiation initiation- elongation early elongation productive elongation termination intergenic/untranscribed

Fitted bdHMM transitions Simulated profile and HMM fit U - Untranscribed E - Early stage L - Late stage Bidirectional HMMs - Idea +/- direction

Bidirectional HMM – Defining property xy ij yx ji Forward process Reverse process Obser- vable layer Hidden layer Conjugate / twin states j and j ** t - 1 t t Definition: A bdHMM is an HMM that satisfies the bidirectionality condition i jx y x y j i

“Watson“ transcription states „Crick“ transcription states Intergenic state The bidirectional Hidden Markov Model Ψ 1 Ψ 2... Ψ k Ψ k... Ψ 2 Ψ 1 Constraint 1: Corresponding Watson and Crick states have identical emission distributions Constraint 2: Γ 12 = P(X t+1 = Ψ 2 | X t = Ψ 1 ) = P(X t = Ψ 1 |X t+1 = Ψ 2 ) = Γ 21 P(X t = Ψ 1 ) / P(X t = Ψ 2 ) Constraint 3: π k = π k

Theorem: An HMM satisfies the bidirectionality condition if and only if the following three conditions hold: Generalized detailed belance Initiation symmetry Observation symmetry Bidirectional HMM – Defining property STAN package Benedikt Zacher bdHMM parameter learning is at first sight a non-convex optimization problem (  difficult in general). We found an exact and efficient solution!

Strand-specific state annotation in yeast

Fine structure of the transcription cycle Intensity Promoter escape (PE) PE1 PE2

Fine structure of the transcription cycle Intensity Promoter escape (PE) PE1 PE2

Fine structure of the transcription cycle Alternative promoter escapes

Fine structure of the transcription cycle Alternative promoter escapes Intensity Promoter escape (PE) PE1 PE2

Variations of the transcription cycle Clusters HMM transcription states

Variations of the transcription cycle 43 genes 694 genes 147 genes Clusters HMM transcription states

Variations of the transcription cycle Canonical cluster Attenuated cluster Evidence for a checkpoint after early elongation: Spt5, Spn1, Bur1, Spt16 are recruited in cluster 32, but not Paf1 and Ctk1. Similar promoter escape Different elongation Nrd1 attenuates cluster 32 genes ChIP signal

Conclusion Is there a universal sequence of transcription-related events for all genes? There seem to be distinct variations of the transcription cycle. They mainly differ in their promoter escape mechanisms.

Targeted identification of genomic features 1076 bidirectional promoters found state sequence regular expression While the nucleosome-free region can vary in size, the positioning of the +1,+2,… nucleosomes is constant.

Xxmotif (Hartmann et al. Genome Res. 2013) state-specific motifs State transitions triggered by sequence motifs state sequence … … negative set (150bp) 50bp 50bp positive set motifs?

Annotation of 45 unknown transcripts two new transcripts Viterbi sequence from directional HMM stable transcripts on the - strand cryptic transcripts on the - strand Strand-specific transcription data from Xu et al., Nature 2009

Outlook: Application to ENCODE data ~30 ChIP Seq tracks of various histone marks chromHMM (Ernst and Kellis, Nat. Biotech 2012) bdHMM chromHMM There is much more „junk“ (regions declared as intergenic / untranscribed) than claimed by ENCODE.

Application to ENCODE data Directionality score + directionality assignment

Application to ENCODE data chrommHMM flux diagram

Application to ENCODE data bdHMM flux diagram

Outlook: Combination of histone marks + RNA-seq Application to RNA-Seq + histone marks ChIP-Seq data from Nir Friedman ‚s and Steve Jacobsen‘s lab There are only few distinct histone patterns. The histone modification pattern alone contains directionality information. It can tell, e.g., which of two overlapping genes is transcribed. There are only few distinct histone patterns. The histone modification pattern alone contains directionality information. It can tell, e.g., which of two overlapping genes is transcribed.

Conclusion bdHMMs give an unsupervised, strand-specific annotation of the genome using ChIP and RNA expression data. bdHMMs are unbiased: No need to predefine gene sets or regions of interest. bdHMMs reveal Variations of the Pol II transcription cycle by clustering of state sequences. Regular expression search can be used to identify new genomic features. bdHMM states are enriched in functional DNA motifs and can be used for improved motif discovery. bdHMMs give an unsupervised, strand-specific annotation of the genome using ChIP and RNA expression data. bdHMMs are unbiased: No need to predefine gene sets or regions of interest. bdHMMs reveal Variations of the Pol II transcription cycle by clustering of state sequences. Regular expression search can be used to identify new genomic features. bdHMM states are enriched in functional DNA motifs and can be used for improved motif discovery.

Julien Gagneur LMU Munich Michael Lidschreiber MPI Göttingen Patrick Cramer MPI Göttingen Acknowledgements Benedik Zacher MPI Cologne + LMU Munich