Download presentation
1
From Reads to Results Exome-seq analysis at CCBR
Justin Lack March 8, 2015
2
Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation
3
Workflow for Data Analysis
FASTQ format QC analysis Read trimming Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation
4
FASTQ Data Format FASTQ format Sequence ID Sequence Quality score
Phred-scaled quality value – i.e., Q10 mean 1/10 error rate, Q20 means 1/100, etc. Sequence ID Sequence Quality score
5
Read Quality Assessment
Read quality analysis Crucial to ensure high quality data Can reveal issues in library preparation and sequence generation
6
Read Quality Assessment
Read trimming Trims reads for both adapter contamination and low quality Absolutely essential for variant detection
7
Read Quality Assessment
FastQC and trimming (Trimmomatic)
8
Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation Map reads to reference genome Alignment QC
9
Read Mapping Challenge:
compare billions of short sequence reads against human genome (3Gb)
10
Different Alignment Algorithms
BWA – 2009 BWA-SW – 2010 BWA-MEM – 2013 Bowtie – 2009 Bowtie2 – 2012 Gem – 2012 Cushaw2 – 2014 Novoalign Li, arXiv: (2013)
11
Different Alignment Algorithms
BWA – 2009 BWA-SW – 2010 BWA-MEM – 2013 Bowtie – 2009 Bowtie2 – 2012 Gem – 2012 Cushaw2 – 2014 Novoalign Li, arXiv: (2013)
12
SAM/BAM Format SAM (Sequence Alignment/Map) format
Single unified format for storing read alignments to a reference genome BAM (Binary Alignment/Map) format Binary equivalent of SAM Advantages Supports indexing Compact size
13
BAM File Format Header Data
14
Alignment QC Crucial for examining and summarizing quality of alignment at exome targets GATK Depth of Coverage
15
Alignment QC Crucial for examining and summarizing quality of alignment at exome targets Qualimap
16
BAM Visualization - IGV
Mismatches Integrative Genomics Behavior Reference
17
Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation - BAM/SAM Alignment improvement
18
BAM Improvement Short-read mappers designed to balance accuracy and speed Algorithm can result in errors, especially at challenging indels Tools designed to target specific systematic errors Remove duplicates Local realignment Base quality recalibration
19
Library Duplicates All next generation sequencing platforms are NOT single molecule sequencing PCR amplification step in library preparation Can result in duplicate DNA fragments in the final library prep. PCR-free protocols do exist – require large volumes of input DNA Can result in false SNP calls Duplicates manifest themselves as high read depth support
20
Duplicates and False SNP Calls
21
Remove Duplicates Identify read-pairs where the outer ends map to the same position on the genome and remove all but 1 copy Samtools: samtools rmdup or samtools rmdupse Picard/GATK: MarkDuplicates
22
Local Realignment - indels
The trouble with mapping approaches
23
Local Realignment - indels
The trouble with mapping approaches
24
Local Realignment - indels
The trouble with mapping approaches
25
Local Realignment - indels
26
Local realignment in GATK
Uses information from known SNPs/indels (dbSNP, 1000 Genomes) Uses information from other reads Smith-Waterman exhaustive alignment on select reads Similar to GATK Haplotype Caller
27
Quality scores issued by sequencers are inaccurate and biased
Quality scores are critical for all downstream analysis Systematic biases are a major contributor to bad calls
28
Base Quality Recalibration
Sequence context refers to base composition skews
29
Base Quality Recalibration in GATK
Align subsample of reads from a lane to human reference Exclude all known dbSNP sites Assume all other mismatches are sequencing errors Compute a new calibration table based on mismatch rates per position on the read
30
Base Quality Recalibration
31
Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation - Germline variant detection - Somatic variant detection - VCF files
32
Germline Variant Detection
Mutations are hidden in the noise!
33
Germline Variant Detection
Mutations are hidden in the noise! Utilize GATK Haplotype Caller
34
Germline Variant Detection
Mutations are hidden in the noise! Utilize GATK Haplotype Caller Genotype jointly to maximize information
35
Germline Variant Detection
Mutations are hidden in the noise! Utilize GATK Haplotype Caller Genotype jointly to maximize information
36
Somatic Variant Detection
Genes and chromosomes can mutate in either somatic or germline tissue Mutation Detection
37
An Example of Germline Variants
Robinson et al. 2011
38
An Example of Somatic Variants
Normal Tumor
39
Somatic Variant Detection
But somatic variant detection can be EXTREMELY difficult Allelic fractions do not scale to ploidy
40
Somatic Variant Detection
But somatic variant detection can be EXTREMELY difficult Multiple additional sources of errors Low depth and/or tumor contaminated normal Noise vs Event
41
MuTect2 Somatic caller that attempts to account for and model all of these sources of errors
42
Variant Call Format (VCF)
VCF is a standardized format for storing DNA polymorphism data SNPs, insertions, deletions and structural variants With rich annotations Indexed for fast data retrieval of variants from a range of positions Store variant information across many samples Record meta-data about the site dbSNP accession, filter status, validation status, Very flexible format
43
Example VCF
44
Workflow for Data Analysis
Read Generation Read Mapping BAM Processing Variant Calling Variant Annotation - Genome Annotation Databases - AVIA…
45
Annotation and Functional Prediction
46
dbSNP dbSNP is a free public archive for genetic variation within and across different species developed by NCBI Sherry, Genome Res. 1999
47
1000 Genomes Project 15 million SNPs
1 million short insertions/deletions 20,000 structural variants The 1000 Genomes Project Consortium, Nature 2010 (
48
COSMIC COSMIC is the most comprehensive resource for exploring impact of somatic mutations in human cancer Forbes, Nucleic Acids Research 2015
49
COSMIC
50
Lots lots more in AVIA!
51
Thank you! Any Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.