Epigenetics System Biology Workshop: Introduction Irina Shchukina 11/27/2018 Title
Outline A very short intro into gene expression regulation Some NGS technologies available to study regulation Overview of a standard pipeline for ChIP-seq data processing Next: practical session with ChIP-seq data
Chromatin organization Heterochromatin – packed, unreachable DNA (6-8) Euchromatin – generally more active regions (1–5) https://www.nature.com/scitable/content/chromatin-has-highly-complex-structure-with-several-113743374
Nucleosomes Nucleosomes are made of 8 histone proteins (2x H2A, H2B, H3, and H4). +H1 – linker histone ~1.65 loops of DNA include 147 nucleotides http://en.wikipedia.org/wiki/Nucleosome
Nucleosomes Nucleosomes are made of 8 histone proteins (2x H2A, H2B, H3, and H4). +H1 – linker histone ~1.65 loops of DNA include 147 nucleotides histones tails stick outside and can be recognized chemical modifications of histones influence DNA accessibility histone modifications are dynamic: they can be added, erased, and recognized http://en.wikipedia.org/wiki/Nucleosome
Nucleosomes nucleosomes are made of 8 histone proteins (2x H2A, H2B, H3, and H4). +H1 – linker histone ~1.65 loops of DNA include 147 nucleotides histones tails stick outside and can be recognized chemical modifications of histones influence DNA accessibility histone modifications can be read, erased, and recognized Examples: H3K4me1, H3K4me3, H3K27ac
Transcription factors Proteins that binds to DNA and controls transcription of DNA into RNA http://www.assignmentpoint.com/science/biology/transcription-factor.html
Transcription factors Proteins that binds to DNA and controls transcription of DNA into RNA A lot of interactions with other genome regions and corresponding histone modifications are happening! Enhancer may be located very far from the gene it regulates. https://www.boundless.com/biology/gene-expression/eukaryotic-transcription-gene-regulation/transcriptional-enhancers-and-repressors/
Role of histone modifications H3K27ac – distinguishes active enhancers from poised H3K27me3 – repression of transcription; has only one methyltransferase EZH2, which is a part of PRC2 H3K4me1 – enhancer mark H3K4me3 – promoters, active transcription H3K36me3 – gene body, active transcription
Chromatin immunoprecipitation: ChIP-seq http://www.bio.brandeis.edu/haberlab/jehsite/chIP.html
Chromatin immunoprecipitation: ChIP-seq Need a good antibody for clean data Input sample – no specific IP, control for background noise level, required for normalization and peak calling Can be used for both transcription factors and histone modifications Different proteins → different types of data → different processing: TFs bind narrow region of DNA (5-30bp) Some histone modifications are very broad: H3K36me3 may cover entire body of actively transcribed gene (Will see some examples in the next part) http://www.bio.brandeis.edu/haberlab/jehsite/chIP.html
Ultra-low-input (ULI) ChIP-seq Standard ChIP-seq protocol requires 1-5 million cells (or even more): serious limitation for multiple studies (human samples, rare populations, etc.) Ultra-low-input ChIP-seq allows you to go as low as 100,000 cells per sample. The price is data quality and consistency: higher noise variable signal to noise ratio between samples within one prep hard to process (as you will see in practice session) Major difference in protocol: no crosslinking step. Often Mnase digestion is used instead of sonication to decrease level of noise and keep DNA-protein complexes intact without crosslinking.
Publicly available data ENCODE: https://www.encodeproject.org Raw and processed data available. People actually care about this stuff and generated tons of data Blueprint: http://www.blueprint-epigenome.eu
Standard ChIP-seq pipeline Raw data Alignment Peak calling Interpretation Heavy computational work Machine-time- and memory-consuming part GTAC run their standard pipeline for you Following slides have many keywords, details, names of tools and links, etc – they are for googling https://www.encodeproject.org/chip-seq/histone/
Raw sequencing data: FASTQ files 4 lines per read Line 1: @read ID Line 2: actual sequence Line 3: + id or any description Line 4: encoded quality of each nucleotide Raw data Alignment Peak calling Interpretation
Alignment Finds location in genome for each read Tools to use: bowtie or BWA (shouldn’t make much difference) Raw data Alignment Peak calling Interpretation https://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html
Alignment Finds location in genome for each read Tools to use: bowtie or BWA (shouldn’t make much difference) NB: use consistent reference genome versions! hg18 → hg19 (=GRCh37) → hg38 (=GRCh38) mm8 → mm9 → mm10 If you want compare positional data: Realign one of your dataset Liftover tool from UCSC (https://genome.ucsc.edu/cgi- bin/hgLiftOver) Raw data Alignment Peak calling Interpretation https://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/mapping/tutorial.html
Alignment output: BAM/SAM files Detailed format description: https://samtools.github.io/hts-specs/SAMv1.pdf Set of utilities: SAMtools Raw data Alignment Peak calling Interpretation
Visualization Genome browsers (JBR, IGV, UCSC) work with bigwig (=bw) format BAM to bigwig conversion: deepTools suite, bamCoverage tool Visualization Raw data Alignment Peak calling Interpretation https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html
Visualization Genome browsers (JBR, IGV, UCSC) work with bigwig (=bw) format BAM to bigwig conversion: deepTools suite, bamCoverage tool You can normalize data before visualization: Sequencing depth (bamCoverage 1x normalization) Input (bamCompare) Visualization Raw data Alignment Peak calling Interpretation https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html
Visualization Different ChIP-seq data actually looks different! Raw data Alignment Peak calling Interpretation https://www.encodeproject.org/chip-seq/histone/ ChIP–seq: advantages and challenges of a maturing technology. Peter Park. Nat. reviews
Peak calling The most sensitive step. Choosing appropriate tool is very important! Golden standard tools: MACS (narrow peaks: TFs and some histone modifications) SICER (broad histone modifications) Data quality significantly affects results. ULI-ChIP-seq data requires special treatment: SPAN. More on your practical session. Raw data Alignment Peak calling Interpretation https://www.encodeproject.org/chip-seq/histone/
Peak calling output: BED BED format describes genomic intervals (not specific for ChIP-seq). 3 required fields: chromosome, start, end. Other depend on the tool used Detailed description: https://genome.ucsc.edu/FAQ/FAQformat.html#format1 Toolkit: bedtools Can be visualized in genome browsers together with bigwig files. Raw data Alignment Peak calling Interpretation
Quality control: ENCODE ENCODE established a set of formal data quality standards and metrics: https://www.encodeproject.org/data-standards/ Raw data QC! Alignment QC! Peak calling QC! Interpretation
Quality control Other things to look at: Raw data: FASTQC – report per one FASTQ file, MultiQC – summarizes multiple outputs into single report Aligner usually produces a report. Look for % of aligned reads, % of multimappers, etc. Can be processed with MultiQC. Peak calling: no ultimate QC metric, need to look for multiple scores. General advise: Visualize your data! Check genes that you know are bound/active/repressed/… Comparing to existing dataset may be a good sanity check Raw data QC! Alignment QC! Peak calling QC! Interpretation
Interpretation: binding motifs Particularly useful for transcription factors – can be a part of QC routine for TFs with known binding motif. Major tool: the MEME suite (http://meme-suite.org). MEME-ChIP – specialized version for ChIP-seq. Not too many sequences (e.g. 1,000 – 2,000), not too wide (~100-200bp) – otherwise may take forever to calculate Input is .fa file with actual nucleotide sequences. May be generated using bedtools: bedtools getfasta -fi GRCm38.genome.fa –fo peaks.seq.fa -bed peaks.bed Submit your jobs and be patient Raw data Alignment Peak calling Interpretation
Interpretation: annotation Match peaks and genes. Possible approached: Most common way: find gene that is the closest to the peak. Why you may be losing your favorite gene from resulting list? Assign all genes located within selected range from a peak Mostly for TF: consider only peaks located near TSS (e.g. [-10kb, +3kb] around TSS) Raw data Alignment Peak calling Interpretation
Interpretation: pathway enrichment GREAT: http://great.stanford.edu/public/html/ Choose your genome version Upload your BED file May adjust association rules Submit Result includes: Enrichment analysis against multiple pathway databases Matched peaks and genes Some positional analysis (e.g. distribution of distances from TSS)
Interpretation: differential analysis Not established analysis. No golden tools. Multiple biological replicates are strongly recommended! Good review: A comprehensive comparison of tools for differential ChIP-seq analysis. Steinhauser, Kurzawa, Eils and Herrmann
Next: practical session Thank you! Next: practical session Raw data Alignment Peak calling Interpretation