BS222 – Genome Science Lecture 8 NGS applications. Part 1 Vladimir Teif
Module structure Genomes, sequencing projects and genomic databases (VT) (Oct 9, 2018) Sequencing technologies (VT) (Oct 11, 2018) Genome architecture I: protein coding genes (VT) (Oct 16, 2018) Genome architecture II: transcription regulation (VT) (Oct 18, 2018) Genome architecture III: 3D chromatin organisation (VT) (Oct 23, 2018) Epigenetics overview (PVW) (Oct 25, 2018) DNA methylation and other DNA modifications (VT) (Oct 30, 2018) NGS applications I: Experiments and basic analysis (VT) (Nov 1, 2018) NGS applications II: Data integration (VT) (Nov 8, 2018). Comparative genomics (JP, guest lecture) (Nov 13, 2018) SNPs, CNVs, population genomics (LS, guest lecture) (Nov 15, 2018) Histone modifications (PVW) (Nov 20, 2018) Non-coding RNAs (PVW) (Nov 22, 2018) Genome Stability (PVW) ) (Nov 27, 2018) Transcriptomics (PVW) (Nov 29, 2018) Year's best paper (PVW) (Dec 6, 2018) Revision lecture (all lecturers; spring term)
NGS techniques vs NGS applications NGS techniques: how to sequence DNA (or RNA) (covered in lecture 2; funny recap in this video https://www.youtube.com/watch?v=-7GK1HXwCtE) NGS applications: how to design experiments in order to answer a specific biological question
Examples of NGS applications Chromatin domains Hi-C Figure adapted from http://www.scienceinschool.org
Types of NGS applications RNA-seq, GRO-seq, CAGE, SAGE, CLIP-seq, Drop-seq gene expression; non-coding RNA ChIP-seq, MNase-seq, DNase-seq, ATAC-se, etc protein binding; histone modifications chromatin accessibility; nucleosome positioning Bisulfite sequencing (DNA methylation) Hi-C, 3C, 4C, ChIA-PET, etc (Chromatin loops) Amplicon sequencing targeted regions; philogenomics; metagenomics Whole Genome Sequencing (WGS) de-novo assembly (new species or new analyses) Curated bibliography of *seq methods (~100 methods) can be found at https://liorpachter.wordpress.com/seq/
RNA-seq (RNA sequencing) https://en.wikipedia.org/wiki/RNA-Seq
ChIP-seq (Chromatin Immunoprecipitation followed by sequencing) 1. Crosslink Protein-DNA complexes in situ 2. Isolate nuclei and fragment DNA (sonication or digestion) 3. Immunoprecipitate with antibody against target nuclear protein and reverse crosslinks 4. Release DNA and submit for sequencing Adapted from www.VisiScience.com
MNase-seq (Micrococcal Nuclease digestion followed by sequencing) MM MNase-seq (Micrococcal Nuclease digestion followed by sequencing) MNase = Micrococcal Nuclease (enzyme that cuts DNA between nucleosomes) Teif et al. (2012), Methods, 62, 26-38
FAIRE-seq (Formaldehyde-Assisted Isolation of Regulatory Elements) sequencing Giresi et al (2007), Genome Res. 17, 877–885
DNAse-seq (DNase I digestion followed by sequencing Wang et al. (2012), PLoS ONE 7, e42414
ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) How transposase works: https://www.youtube.com/watch?v=XYZHMGUGq6o Buenrostro et al. (2013) Nat Methods. 10, 1213-1218
Methods for 1D genome mapping MM Methods for 1D genome mapping Meyer & Liu, Nature Reviews Genetics 15, 709–721 (2014)
Methods for 1D genome mapping Tsompana and Buck, Epigenetics & Chromatin20147:33
NGS methods for DNA methylation Bisulfite sequencing Affinity purification (e.g. MeDIP)
Chromatin Conformation Capture methods to map locations of DNA-DNA loops Rao et al., Nature 159, 1665–1680 (2014)
River and Ren (2013), Cell, 155, 39-55 Since 2017 DNA loops can be measured with 100-bp resolution (Bonev et al., Cell, 2017)
Timeline of NGS methods Bulk methods that require many cells River and Ren (2013), Cell, 155, 39-55 Single-cell methods Hu et al, Front. Cell Dev. Biol., 2018
Where to get NGS data? Do your own experiment Gene Expression Omnibus (GEO) https://www.ncbi.nlm.nih.gov/geo Sequence read archive (SRA) https://www.ncbi.nlm.nih.gov/sra European Nucleotide Archive https://www.ebi.ac.uk/ena The Cancer Genome Atlas (TCGA) https://tcga-data.nci.nih.gov/tcga Exome Aggregation Consortium (ExAC) http://exac.broadinstitute.org/ You also have to upload your data!
Next generation sequencing analysis
How to analyze NGS data? Ask a bioinformatician you need to explain what do you want, and for that you need to understand what/how can be done Do it yourself Command line –> become a bioinformatician Online wrappers –> simpler, but file size limits Example of a convenient online tool: Galaxy http://galaxy.essex.ac.uk/
ChIP-seq (Chromatin ImmunoPrecipitation followed by sequencing) 1. Crosslink Protein-DNA complexes in situ 2. Isolate nuclei and fragment DNA (sonication or digestion) 3. Immunoprecipitate with antibody against target nuclear protein and reverse crosslinks 4. Release DNA and submit for sequencing Adapted from www.VisiScience.com
Experiment Data analysis http://www4.utsouthwestern.edu/mcdermottlab/NGS/index.html
ChIP-seq data analysis www.utsouthwestern.edu/labs.bioinformatics-core/analysis/chip-seq.png
Unmapped sequenced reads (this is “raw”, primary data):
Mapped reads are characterised by their locations in the genome Bowtie, BWA, ELAND, Novoalign, BLAST, ClustalW TopHat (for RNA-seq)
Reads can align to overlapping locations http://biocluster.ucr.edu/~rkaundal/workshops/R_feb2016/ChIPseq/ChIPseq.html We need to count all reads at each base pair
ChIP-seq landscapes depend on the protein Park P. J., Nature Genetics, 2009
We can compare different experimental datasets for the same genomic region 5mC Gifford et.al., Cell 2013
We can compare different experimental conditions in a genome browser Jung et al., NAR 2014 UCSC Genome Browser (online) IGV (install on a local computer)
Systematic analysis requires to identify all peaks in all datasets and compare differences Badet et al. (2012) Nature Protocols, 7, 45-61
Peak calling is a method to identify areas in a genome enriched with aligned reads Wilbanks EG (2010) PLoS ONE 5, e11471.
Peak calling: finding the peaks Input: sample that was prepared in the same way as in the ChIP-seq, but no antibody was added, so it has no specific enrichment of our protein of interest Pepke et al. (2009). Nature Methods, 6, S22–S32.
Peak calling: defining statistical significance
Peak calling: defining statistical significance MACS (good for TFs) CISER (histones, etc) HOMER (universal) PeakSeq edgeR CisGenome Is this peak statistically significant? Is this peak statistically significant? Park P. J., Nature Genetics, 2009
Important: peaks are just genomic regions
Genes are also some genomic regions DESeq, edgeR, Cuffdiff
DNA methylation: also genomic regions Individual CpGs Differentially methylated regions DMRcaller BISMARK
Any genomic regions can be intersected BedTools (command line) Galaxy (online)
We can calculate distribution of TF binding sites among different genomic features Toropainen et al. (2016) Scientific Reports, 6, 33510
We can also calculate enrichments of binding sites of our TF in different genomic regions Mattout et al., Genome Biology, 2015
…Or study the DNA sequence inside the peaks to find some common motifs HOMER, MEME Massie et al., EMBO J. (2011) 30, 2719–2733
What else can we do with peaks? Compare two experimental conditions to see which peaks appear/disappear (e.g. protein binding gained/lost); Compute associations of our protein with different genes (e.g. define which genes are regulated by this protein) Study the DNA sequence inside the peaks (e.g. to find which other TFs co-bind with our protein of interest) Look how our peaks are arranged with respect to other peaks (e.g. to check for interactions with other proteins) etc
Take home message NGS data structure NGS data are very large text files. NGS analysis needs “large” computers MUST KNOW: NGS data structure ~100s types of NGS experiments; we focus on ChIP-seq here Where NGS data is stored? (GEO, etc) RAW DATA; MAPPED READS; REGIONS; SITES GENOME BROWSERS. PEAKS. PEAK CALLING Optional video: https://www.youtube.com/watch?v=Ob9xGBPvr_s