Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.

Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data

This talk  Refresher: Illumina sequencing  QC  What can go wrong  Useful QC statistics  Read mapping  Comparison of popular read mappers  Stampy  Indel and SNP calling  (Some results: 1000 Genomes indel calls)

7x Illumina GA-II 2x Roche 454 1x Illumina HiSeq 2000

1. Refresher: Illumina sequencing

Illumina sequencing

 8 lanes… x 120 tiles x108 bp x 2 reads… = about 48 Gb raw bp

Quality issues  Bases are identified by their fluorescent tag  Overlapping emission spectra  Single base per cycle: reversible terminator chemistry  Not perfect: fraction lags, fraction runs ahead: “Phasing”  Limits read length  Optimizing yield: cluster density  Higher densities mean more errors  Above an optimum, yield decreases  Partly signal processing issue: software improvements  Low amounts of initial DNA  Linker-linker hybrids; duplicated reads

Overlapping fluorescence spectra  C/A and G/T overlap  (Most common mutations are transitions, A-G and C-T) Rougemont et al, 2008

Refresher: Phred scores  Phred score = 10 log 10 ( probability of error )  10: 10% error probability  20: 1% error probability  30: 0.1% error probability (one in 1,000) 3 = 50%, 7 = 20% 13 = 5%, 17 = 2% 23 = 0.5%, 27 = 0.2% 33 = 0.05%, 37 = 0.02%

Phasing August 2009 June 2010

Cluster density & other improvements June 2010: August 2009:

Library complexity, duplicate reads  Some sequences are read several times:  Low amount of initial material, many PCR copies  Optical duplicates; secondary cluster seeding  Problem for variant calling  Any PCR error will be seen twice: evidence for variant  Rate of duplicates is rarely >5%  Criterion: both ends of a PE read map to matching location  Can occur by chance, but low probability, except for very high coverage  Post processing: duplicate removal  Standard processing step (e.g. Samtools, Picard)  Useful statistic:  Duplicate fraction is approximately additive across lanes (same library)  2x duplication fraction ≈ fraction of the library that was sequenced

Library complexity, duplicate reads Fraction α of all molecules is sequenced Number of times a PCR copy is sequenced: Poisson( α ) Expected fraction of duplicates: e - α -1+ α As a fraction of all reads sequenced: (e - α -1+ α )/ α = ½ α + … n =0123… Poissone-αe-α α e-αα e-α α 2 e - α /2 α 3 e - α /6 … Duplicates:0012…

Sequencing QC

QC statistics

QC statistics - coverage

QC statistics – quality scores

GATK recalibration tool

3. Read mapping

Read mapping  First processing step after sequencing:  Read mapping (most times)  Assembly (no reference sequence; specialized analyses)  Quality of mapping determines downstream results  Accessible genome  Biases (ref vs. variant)  Sensitivity (divergent reference; snps, indels, SV)  Specificity (calibration of mapping quality)

Read mapper comparison  Read mappers: Maq BWA Eland Novoalign Stampy  Criteria:  Sensitivity (overall; divergent reference; variants)  Specificity (mapping quality calibration)  Speed

Sensitivity

Sensitivity - indels

Sensitivity – Divergent reference

Specificity – ROC curves ROC - indels

Performance on real data Proportion mapped to within 10kb of mate

Efficiency

Stampy – first part of algorithm read 15 bp subsequence Remove rev-comp symmetry 29 bit word 4 bytes x 2 29 entry (2 Gb) hash table candidate positions open addressing, cache-friendly

Second part: Fast candidate alignment Single-instruction- multiple-data (SIMD), parallel execution Affine gap penalties. Linear-time and constant memory algorithm: DP table in registers. Maximum indel size 15 bp.

Third part: Modeling mapping failures  Pseudo-bayesian posterior (using candidates, rather than all mapping positions)  Failure to find the correct candidate (2 or more mismatches in every 15bp subsequence)  Sequence not in reference (is sequence match better than expected best random match?)

4. SNP and indel calling

SNP calling  General idea:  Works quite well! Some caveats: Include mapping quality: P(read|g) = P(read | wrong map) P(wrong map) + P(read | g, correct map) P(correct map) Mapping errors are dependent: don’t include mapQ<10 Base errors are not uniform (A/C/G/T): assume worst case (all identical) Assumes no anomalies (seg dups; alignments; indel/SV; …)  Hard problem: be conservative Expected SNP rate (human): 10 -3 /nt. FPR of 10 -5 required for 1% FDR Filtering is required to achieve good FDR – or all data features must be adequately modeled

Indel calling  General idea:  Differences with SNP calling:  Pseudo-Bayes: cannot consider all possible variants/genotypes Generate large set of candidates Filter using goodness-of-fit test  Illumina reads do not have an explicit indel error model

Indel error model Homopolymer run length

Wrap up  GA-II produces large amounts of good data  Artefacts do occur, keep a look at QC statistics  Choice of mapper influences yield and quality  Variant calling:  Bayesian approaches work well  Some assumptions (independence) not met, hard to model  Filtering remains necessary

Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.

Similar presentations

Presentation on theme: "Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.

Similar presentations

Presentation on theme: "Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data."— Presentation transcript:

Similar presentations

About project

Feedback