Chromatin basics & ChIP-seq analysis BS312 – Genome Bioinformatics Lecture 5 Chromatin basics & ChIP-seq analysis Vladimir Teif
Next generation sequencing analysis
Chromatin basics -- reminder https://micro.magnet.fsu.edu/cells/nucleus/images/chromatinstructurefigure1.jpg
Transcription factor-centric view Transcription factor (TF) concentrations Protein assembly at regulatory regions Transcription start site Proteins produced (including TFs) Teif et al. (2013), Methods. 62, 26-38
Transcription factor-centric view Transcription factor (TF) concentrations Enhancer RNA polymerase: enzyme which makes RNA Promoter Proteins produced (including TFs) Teif et al. (2013), Methods. 62, 26-38
Histone modifications-centric view Turner B.M. (2005) Nature Structural & Molecular Biology, 12, 110 - 112
Histone modifications-centric view http://dev.biologists.org/content/139/6/1045
NGS METHODS AND THEIR APPLICATIONS Chromatin domains Hi-C Figure adapted from http://www.scienceinschool.org
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) 1. Crosslink Protein-DNA complexes in situ 2. Isolate nuclei and fragment DNA (sonication or digestion) 3. Immunoprecipitate with antibody against target nuclear protein and reverse crosslinks 4. Release DNA and submit for sequencing Adapted from www.VisiScience.com
MNase-seq (Micrococcal Nuclease digestion followed by sequencing) MM MNase-seq (Micrococcal Nuclease digestion followed by sequencing) MNase = Micrococcal Nuclease (enzyme that cuts DNA between nucleosomes) Teif et al. (2012), Methods, 62, 26-38
FAIRE-seq (Formaldehyde-Assisted Isolation of Regulatory Elements) sequencing Giresi et al (2007), Genome Res. 17, 877–885
DNAse-seq (DNase I digestion followed by sequencing Wang et al. (2012), PLoS ONE 7, e42414
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) How transposase works: https://www.youtube.com/watch?v=XYZHMGUGq6o Buenrostro et al. (2013) Nat Methods. 10, 1213-1218
Methods for 1D genome mapping MM Methods for 1D genome mapping Meyer & Liu, Nature Reviews Genetics 15, 709–721 (2014)
Methods for 1D genome mapping Tsompana and Buck, Epigenetics & Chromatin20147:33
Timeline of NGS methods Bulk methods that require many cells River and Ren (2013), Cell, 155, 39-55 Single-cell methods Hu et al, Front. Cell Dev. Biol., 2018
Where to get NGS data? Do your own experiment Gene Expression Omnibus (GEO) https://www.ncbi.nlm.nih.gov/geo Sequence read archive (SRA) https://www.ncbi.nlm.nih.gov/sra European Nucleotide Archive https://www.ebi.ac.uk/ena The Cancer Genome Atlas (TCGA) https://tcga-data.nci.nih.gov/tcga Exome Aggregation Consortium (ExAC) http://exac.broadinstitute.org/ You also have to upload your data!
How to analyze NGS data? Ask a bioinformatician you need to explain what do you want, and for that you need to understand what/how can be done Do it yourself Command line –> become a bioinformatician Online wrappers –> simpler, but file size limits Example of a convenient online tool: Galaxy http://galaxy.essex.ac.uk/
ChIP-seq (Chromatin ImmunoPrecipitation followed by sequencing) 1. Crosslink Protein-DNA complexes in situ 2. Isolate nuclei and fragment DNA (sonication or digestion) 3. Immunoprecipitate with antibody against target nuclear protein and reverse crosslinks 4. Release DNA and submit for sequencing Adapted from www.VisiScience.com
Experiment Data analysis http://www4.utsouthwestern.edu/mcdermottlab/NGS/index.html
ChIP-seq analysis workflow www.utsouthwestern.edu/labs.bioinformatics-core/analysis/chip-seq.png
NGS data after sequencing but before mapping ( NGS data after sequencing but before mapping (.fastq file aka “raw” data):
Mapping with Bowtie http://bowtie-bio.sourceforge.net/manual.shtml -v <N> Allow no more than N mismatches, where V may be a number from 0 through 3 set using the -v option. -p <N> Use N computer processors/cores in parallel -m <N> disregard reads with >N possible alignments
Guess what this command does bowtie -v 2 -p 2 -m 1 mm9 filename.fastq filename.map -v <N> Allow no more than N mismatches, where V may be a number from 0 through 3 set using the -v option. -p <N> Use N computer processors/cores in parallel -m <N> disregard reads with >N possible alignments
NGS data after mapping: .bed files (BED format) Bowtie, BWA, ELAND, Novoalign, BLAST, ClustalW TopHat (for RNA-seq)
Reads can align to overlapping locations http://biocluster.ucr.edu/~rkaundal/workshops/R_feb2016/ChIPseq/ChIPseq.html We need to count all reads at each base pair
From mapped reads to occupancy landscapes HOMER, BedTools, BamTools, NucTools Teif et al., Methods, 2012
Calculating occupancy with HOMER http://homer.ucsd.edu/homer/ngs/tagDir.html makeTagDirectory <Directory Name> [options] <alignment file>
Quality control (QC) http://homer.ucsd.edu/homer/ngs/tagDir.html
Quality control (QC) Good ChIP-seq Bad ChIP-seq http://homer.ucsd.edu/homer/ngs/tagDir.html Good ChIP-seq Bad ChIP-seq
Data view in genome browsers Jung et al., NAR 2014 UCSC Genome Browser (online) IGV (install on a local computer)
UCSC Genome Browser https://genome.ucsc.edu/
Create UCSC files with HOMER http://homer.ucsd.edu/homer/ngs/ucsc.html makeUCSCfile <tag directory> -o auto
Peak shapes can be different Park P. J., Nature Genetics, 2009
Systematic analysis requires to identify all peaks in all datasets and compare differences Badet et al. (2012) Nature Protocols, 7, 45-61
Peak calling is a method to identify areas in a genome enriched with aligned reads Wilbanks EG (2010) PLoS ONE 5, e11471.
Peak calling: finding the peaks Input: sample that was prepared in the same way as in the ChIP-seq, but no antibody was added, so it has no specific enrichment of our protein of interest Pepke et al. (2009). Nature Methods, 6, S22–S32.
Peak calling: defining statistical significance
Peak calling: defining statistical significance MACS (good for TFs) CISER (histones, etc) HOMER (universal) PeakSeq edgeR CisGenome Is this peak statistically significant? Is this peak statistically significant? Park P. J., Nature Genetics, 2009
Finding peaks with HOMER http://homer.ucsd.edu/homer/ngs/peaks.html
Guess what this command does findPeaks ChIPDirectory -style factor -i InputDirectory We need to map our ChIP-seq and its Input (control), then create their HOMER tag directories ChIPDirectory and InputDirectory, then find peaks using both these directories. Additional optional parameters: -F <#> Enrichment ratio ChIP vs. Input (by default 4-fold) -P <#> P-value cut off (by default 0.0001
ChIP-seq: reads to peaks/regions MACS, CISER, HOMER PeakSeq, edgeR, DESeq, CisGenome
Peaks/regions in BED format pos2bed.pl peakfile.txt > peakfile.bed bed2pos.pl peakfile.bed > peakfile.txt
Intersecting genomic regions BedTools (command line) Galaxy (online)
Genomic features are also regions Mattout et al., Genome Biology, 2015
Let’s look at many similar regions Each horisontal line is one genomic region deepTools NucTools https://github.com/fidelram/deepTools/wiki/Visualizations
ChIP-seq heat maps for all genes, scaled with respect to their start (TSS) and end (TES) https://github.com/fidelram/deepTools/wiki/Visualizations
Cluster heatmaps deepTools 2.0 https://github.com/fidelram/deepTools/wiki/Visualizations
Comparing cluster heatmaps between two cell conditions NucTools
Histone modifications around TSS deepTools http://www.ie-freiburg.mpg.de/bioinformaticsfac
Motif enrichment analysis HOMER, MEME Pavlaki et al., 2017
Finding motifs with HOMER HOMER takes the coordinates of all ChIP-seq peaks, looks at the corresponding DNA sequences of each peak and finds the common consensus motifs that are encountered in many of these peaks. Then HOMER looks in a database and reports which motifs are similar to already known TF binding motifs, and which motifs are new.
http://meme-suite.org The MEME Suite is even more sophisticated and contains all tools that are needed for motif analysis
Summary of ChIP-seq analysis: Map all reads Occupancy calculation Differential peak calling Intersection of different signals Correlation of different signals Motif enrichment in peaks
HEATMAP; AGGREGATE PROFILE; GENE ONTOLOGY (GO) Take home message Raw reads -> mapping -> peak calling MUST KNOW: Where NGS data is stored (GEO, etc) ~100s types of NGS experiments; we focus on chromatin ChIp-seq data structure RAW DATA; MAPPED READS; REGIONS; SITES GENOME BROWSERS. PEAKS. PEAK CALLING HEATMAP; AGGREGATE PROFILE; GENE ONTOLOGY (GO) Optional video: https://www.youtube.com/watch?v=Ob9xGBPvr_s