Introduction to Data Processing and Variant Detection for NGS DNA Sequencing EMC Galaxy Course November 24-25, 2014 Youri Hoogstrate, David van Zessen, Saskia Hiltemann Guido Jenster, Andrew Stubbs
How does next-gen sequencing work?
Instruments generate short reads that need to be mapped to the reference
High-level overview of NGS data processing
Aligned reads In Galaxy, you can view your data in the built-in genome browser, Trackster
Challenge: distinguishing variants from noise Possible reasons for a mismatch: - True SNP - Error generated in library prep - Base calling error - Misalignment (mapping error) - Error in reference genome
Genotyping - What are the set of alleles at this locus? What are the frequencies? - Genotypers begin with a model of prior knowledge about the likelihood (and types) or errors, and the likelihood of observing real variants. - Error models depend on sequencing technology
What we know about NGS technology Relatively high per-base error rate Reads are higher quality in the middle than at the ends Some technologies are poor with homopolymers, GC rich Indels confuse alignment Sequence coverage is not uniform Alignments are probabilistic Quality Control Local realignment Remove duplicate reads Filter low-quality reads Recalibrate base qualities Read trimming
Quality Score Fastq: raw reads with per-base quality scores Quality = Phred score + 33 (so that all characters are printable) Q= -10 log P (P= base-calling error probability) Q=10 error rate 10% Q=20 error rate 1% Q=30 error rate 0.1% etc..
Quality Control Tool: FastQC
Sequencing Depth Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data. BMC Genomics 2012, 13(Suppl 2):S6
Tools Popular Tools: - SAM Tools Mpileup (practical) - GATK Unified Genotype Caller (practical extra part) - FreeBayes (practical extra part) - MAQ - Varscan2 All available in Galaxy Tool Shed Always a trade-off between sensitivity and specificity; false positives and false negatives
Practical Raw data (fastq files) QC with FastQC Map with BWA Visualize with Trackster Call Variants with Mpileup Annotate variants with ANNOVAR Time permitting: Call Variants with FreeBayes and GATK Unified Genotyper and compare the three callers
Practical Session Learn by doing it yourself! Servers: galaxy-training1.trait-ctmm.cloudlet.sara.nl galaxy-training2.trait-ctmm.cloudlet.sara.nl galaxy-training3.trait-ctmm.cloudlet.sara.nl .. Log in to your account All handouts and slides can be found under Shared Data → Data Libraries Manual: [Course Manual] EMC Galaxy Training 2: Introduction to Galaxy.pdf