ChIP-seq analysis 2/28/2018.

Slides:



Advertisements
Similar presentations
Methods to read out regulatory functions
Advertisements

Epigenetics Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Analysis of ChIP-Seq Data
Data Analysis for High-Throughput Sequencing
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
RNAseq analyses -- methods
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
I519 Introduction to Bioinformatics, Fall, 2012
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Gene expression. The information encoded in a gene is converted into a protein  The genetic information is made available to the cell Phases of gene.
Eukaryotic Genomes  The Organization and Control of Eukaryotic Genomes.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
CS173 Lecture 9: Transcriptional regulation III
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Transcription factor binding motifs (part II) 10/22/07.
HOMER – a one stop shop for ChIP-Seq analysis
Gene Regulation, Part 2 Lecture 15 (cont.) Fall 2008.
Projects
Hidden Markov Models BMI/CS 576
Lesson: Sequence processing
Epigenetics Continued
Functional Elements in the Human Genome
Figure 1. Annotation and characterization of genomic target of p63 in mouse keratinocytes (MK) based on ChIP-Seq. (A) Scatterplot representing high degree.
Gene expression from RNA-Seq
RNA-Seq analysis in R (Bioconductor)
A Very Basic Gibbs Sampler for Motif Detection
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Gene Expression.
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Analysing ChIP-Seq Data
Topic 7: The Organization and Control of Eukaryotic Genomes
Ab initio gene prediction
Simon v ChIP-Seq Analysis Simon v
Exploring and Understanding ChIP-Seq data
Epigenetics Study of the modifications to genes which do not involve changing the underlying DNA
Volume 11, Issue 2, Pages (August 2012)
Taichi Umeyama, Takashi Ito  Cell Reports 
Epigenetics System Biology Workshop: Introduction
Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin  Cell Reports 
Maximize read usage through mapping strategies
Mapping Global Histone Acetylation Patterns to Gene Expression
Chromosome Architecture
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
ChIP-seq Robert J. Trumbly
Control of the Embryonic Stem Cell State
False discovery rate estimation
Fine-Resolution Mapping of TF Binding and Chromatin Interactions
Volume 67, Issue 6, Pages e6 (September 2017)
Volume 63, Issue 6, Pages (September 2016)
Evolution of Alu Elements toward Enhancers
Volume 10, Issue 10, Pages (October 2017)
Applying principles of computer science in a biological context
Volume 35, Issue 2, Pages (August 2011)
Volume 63, Issue 3, Pages (August 2016)
Anh Pham Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease.
Sequence Analysis - RNA-Seq 2
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Chromatin modifications
Taichi Umeyama, Takashi Ito  Cell Reports 
The Genetics of Transcription Factor DNA Binding Variation
Differential Expression of RNA-Seq Data
Quality Control & Nascent Sequencing
Derek de Rie and Imad Abuessaisa Presented by: Cassandra Derrick
Presentation transcript:

ChIP-seq analysis 2/28/2018

Acknowledgements Much of the content of this lecture is from: Furey (2012) – ChIP-seq and beyond Park (2009) – ChIP-seq – advantages + challenges Landt et al. (2012) – ChIP-seq guidelines + practices Oldies but goodies

Central Dogma of Biology Some proteins can bind DNA to influence how genes are expressed

ChIP-seq Chromatin immunoprecipitation followed by high-throughput sequencing Assays the genome-wide locations of a single protein (bound to DNA) or a single histone modification

What is chromatin? Complex of macromolecules (DNA, protein, RNA) Packages DNA into compact shape Prevents DNA damage Controls gene expression, DNA replication Here is the complete scale from chromosome to individual nucleotides

What is immunoprecipitation? Antibodies are used to immunoprecipitate proteins Antibodies bind in a (mostly) specific way to their antigen Used by the immune system to neutralize pathogens ChIP-seq uses antibodies raised against proteins Here is the complete scale from chromosome to individual nucleotides

ChIP-seq protocol (very brief) crosslink GOAL: Determine genomic DNA associated with a given protein or histone modification (reverse crosslink) Library prep + sequencing Using antibodies raised against a protein or histone modification of interest, you can isolate

Why study protein binding to DNA? Transcription factors (TFs) affect how genes are regulated DNA binding proteins (such as CTCF and cohesin) regulate the 3D structure of DNA CTCF and cohesion facilitate formation of chromatin loops which brings ChIP-seq can be used therefore to study both the structure of DNA and how the information encoded in DNA is regulated

Why study histone modifications? Combinations of chemical modifications to the histone tails correlate with regulatory activities Referred to as the “histone code” Shown here are 4 of the components of a histone octamer. The full octamer has 2 sets of these. The full octamer is what DNA coils around to create a nucleosome. The nucleosomes are packed together to form chromatin

Integration of protein and histone ChIP-seq ChIP-seq assays only 1 thing at a time Integration of several proteins and histone modifications provides more insight More on this in integrative genomics

Research Questions Protein Histone mods DNA motif discovery – Which DNA sequences does my protein like to bind to OR which binding motifs correlate with a histone modification? Conserved/differential protein binding OR histone modification across conditions (time points, cell types, species, treatments) Genes (and gene sets) under regulation by a given protein or histone modification

ChIP-seq study example Ostuni et al. (2013) Enhancer repertoire expanded during immune response Enhancers did not return to original state post-stimulus (epigenetic memory) Response upon restimulation was stronger and faster There will not be a project on ChIP-seq so really you will only encounter it again if you work at a company/lab that does this kind of stuff Here’s a couple examples of studies that used mostly ChIP-seq results 115 TOTAL CHIP-SEQ EXPERIMENTS FROM THIS STUDY ALONE UPLOADED TO GEO

ChIP-seq Analysis Methods

Preliminary Analysis Goal Define where your protein is binding or where histone modifications are occurring Inferred on a reference genome based on short reads Fragment size is determined by beads and is typically a distribution. Minimum you can select for is about 200bp (I think?) through beads. You end up with a distribution that is typically ~250-300bp

Comparison to RNA-seq Mapping is performed with short reads as with RNA- seq Same standard file formats are used (FASTQ, FASTA) Similar mapping software is generally used (Bowtie2, BWA, etc.) – gapless Same base-calling and read-level QC applies (fastqc) TopHat for example is a gapped aligner which is useful for things like splice junctions. You don’t have to worry about this stuff in ChIP-seq since you’re mapping genomic DNA back to itself

A key difference RNA-seq can use same gene annotation for each experiment Proteins can bind anywhere in the genome ChIP-seq features are experiment-specific Define features (called peaks) as part of the analysis pipeline RNA-seq abundances genes

Defining ChIP-seq peaks Peaks are areas where read mapping is enriched compared to a control experiment Software exists to automate peak finding Popular programs include MACS2, GEM, HOMER, SPP Anyone want to hazard a guess as to why the distributions aren’t the same for the + and – strand? Peaks usually end up being a bit bigger than your fragment size (about ~350bp for a point source TF) but this of course will depend on the software you use. If using HOMER for example, the program enforces the same peak length for each peak centering around the summit. MACS allows for more dynamic peak lengths

Controls for ChIP-seq Input DNA : A portion of DNA sample removed before immunoprecipitation Mock IP : DNA obtained from a fake IP performed without antibodies IgG : DNA from a non-specific IP using antibody against protein not involved in DNA binding Usually 1 is performed and most common is input which accounts for technical biases

ChIP-seq vs. input DNA Input allows for correcting bias in variable solubility, shearing, and amplification during experiments

How does a peak caller work? Walks along the genome to identify enriched regions Estimates fragment size to extend reads into profile

Scoring peaks (general example) Poisson model for tag distribution accounts for ratio as well as absolute tag number This is just a rough example and is not meant to say that small effect sizes between ChIP and control conditions are never significant. This will depend on the individual experiments. Imagine in the case of the blue sample, the control can go higher locally which is kind of a good transition into why correcting for multiple hypothesis testing is important. This is also where analyzing replicates can be applied as well. If these low signal events appear across distinct cell populations for a given type/condition (termed biological replicate) at a similar enrichment ratio across your replicate populations, you might infer that this is actually a real binding event. IDR (irreproducible discovery rate) is a strategy that reports some of these low signal events if they can be replicated

Significance of a Peak Statistical significance formally measured using false discovery rate (FDR) FDR is expected proportion of incorrectly identified sites among those found to be significant Can be measured by swapping input with ChIP sample and identifying false peaks The q value of a peak is the minimum FDR at which the peak is deemed significant Analogous to p value for a single hypothesis test

FDR stats example Using a local Poisson distribution, my peak finding algorithm identifies peak X with p value 0.00024 In the whole data set, there are 1,000 peaks whose p-value <= 0.00024 Swapping input for ChIP sample, 48 “peaks” are falsely identified, also with p-value <= 0.00024 The q value for peak X is 48/1,000 = 0.048 Using an FDR cut-off of 0.05, peak X is significant Notice that the p-value is way below 0.05 which many of you might have seen in papers. Typically in the medical field, for a single hypothesis (statistical) test, the p-value (probability of error) is typically 0.05

Peak calling for TF vs. histone mark Histone ChIP-seq produces much broader regions of enrichment Peak callers usually have a “histone” option or set of “broad” parameters if needed Models and strategies used to call peaks on histone marks are completely different from those for sequence-specific transcription factors Breakdown of H3K27me3: H3 family of histones, K is the abbreviation for lysine (the amino acid), 27 is the position of the aa residue counting from the N-terminus, me is a methyl group, and 3 is the number of methyl groups added. It has to do with downregulation of nearby gene expression and is broadly associated with heterochromatin regions (heterochromatin is the densely packed type of chromatin).

Output from peak calling Took a while to get here… List of genomic loci where either your protein is bound OR your histone is modified) – usually BED format Also by library size, antibody and pull down efficiency, etc. Peak numbers vary wildy by protein, organism, etc.

Some ChIP-seq QC + rules of thumb FRiP: fraction of reads in peaks ( > 1%) Strand cross-correlations Normalized strand coefficient (NSC) – ratio between fragment-length cross-correlation peak and background cross-correlation peak > 1.05 Relative strand coefficient (RSC) – ratio between fragment- length cc peak and read-length cc peak > 0.8 Various packages can be used to compute the above (phantompeakqualtools, deeptools, MEME, HOMER) A lot of technical jargon. What does it actually mean?

Strand cross-correlation and FRiP Shifting by k should result in a huge linear correlation between + and – strand reads that does not occur in background. This method is “peak agnostic”. A good ChIP-seq data set will typically have a high FRiP > 0.01 and NSC > 1.05

Strand cross-correlation calculation In practice, you won’t ever have to do these calculations yourself, just use something like phantompeakqualtools The right is when you might want to have an awkward conversation with the wet lab

Typical analysis workflow Bowtie2 BWA STAR MACS2 HOMER GEM ChIP Short reads Mapped reads Peaks Input Short reads Mapped reads FASTQ FASTA SAM BAM BED In theory, given a data set in any of the above formats, you can jump in where appropriate

Functional Characterization The output from the previous step gives you a list of genomic coordinates/ranges and some associated information for each.

Where and how is my protein binding? Peaks are areas where read mapping is enriched compared to a control experiment (~300bp) Actual binding sites (for proteins) are 8-12bp Binding site can be inferred using motifs and motif analysis Information content is a bit of a weird concept (especially if you don’t really care about DNA binding motif). The gist is that 2 bits is the maximum because you can figure out which base is there by asking a maximum of 2 questions: “Is it a pyrimidine?” if yes, you can ask “Is it T?”. Regardless of the answers, you figure out the base. The y-axis is actually the maximal information possible – actual information. So that big is essentially always there. We don’t need to ask any questions to figure out what’s there (0). 2-0 is 2 which is why those letters are so big. Think of the height of the stack as conservation at that position. For the smaller stacks, we are not quite sure what’s there but we know the proportions.

What is a DNA-binding motif? Information content is a bit of a weird concept (especially if you don’t really care about DNA binding motif). The gist is that 2 bits is the maximum because you can figure out which base is there by asking a maximum of 2 questions: “Is it a pyrimidine?” if yes, you can ask “Is it T?”. Regardless of the answers, you figure out the base. The y-axis is actually the maximal information possible – actual information. So that big is essentially always there. We don’t need to ask any questions to figure out what’s there (0). 2-0 is 2 which is why those letters are so big. Think of the height of the stack as conservation at that position. For the smaller stacks, we are not quite sure what’s there but we know the proportions.

Motif scanning (scoring) “Scan” for binding sites using probability model Ask at each position in peak “how likely is it that this is a binding site and not some random sequence?” Motif occurrences typically are located near peak summits Note: there’s a directionality to scanning. Usually done 5’ to 3’ so you have to scan the reverse complement as well

Motif scoring example How likely is it that this is a binding site and not some random sequence? T G G G G A A G T G Pr (binding site) = 0.207 x 0.705 x 0.830 … Pr (random seq) = 0.250 x 0.250 x 0.250 … You arrive at a sequence that looks like this Dividing top by bottom gives you a likelihood ratio The log version of the formula in previous slide just allows you to add ratios instead of multiply them (easier for computers to do) In practice you don’t really have to worry about all this. Programs will do this for you Significance is kind of baked into the likelihood ratio threshold you’re using

What if I don’t know the binding site? 2 general approaches: Motif enrichment analysis: Scan a library of known motifs against your peaks (and a background) to determine which motifs are most enriched De novo motif finding: learns new motifs using expectation/maximization (MEME) or k-mer based approaches (HOMER) If chipping a protein previously done, both motif analyses should yield similar results

Example Homer report (enrichment)

Example Homer report (de novo)

Example questions motif analysis can answer QC – If you’re chipping protein with a known DNA binding motif, you should be able to find that motif What are some co-occurring motifs in my peaks? Transcription factors often have binding partners Which DNA binding motifs are found in peaks enriched with histone modifications? This can inform future ChIP- seq experiments Which peaks do not contain the canonical motif?

What is my protein doing? Integration with RNA-seq data – You can do pathway/ontology EA using nearest gene (careful!) Differential binding – Very similar to RNA-seq (even uses the same software – genomic loci instead of genes) DESeq2, DiffBind, etc. Integration with other ChIP-seq experiments – Does my protein bind enhancers? Repressed regions? Co-bind with other proteins? Will talk more about these in integrative genomics Differential binding: can be across time-points, conditions, etc.

What is my protein doing? Differential PU.1 binding at multiple steps Baseline Stimuli (IFNg, IL-4) Time-points post-stimuli Re-stimulation Integration with Stat1/6 binding pre- / post-stimulus Uses H3K4me1 and H3K27ac histone modifications to annotate active regulatory elements We can now start to parse out how this group just used a bunch of ChIP-seq experiments to learn something completely new about biology