Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.

Similar presentations


Presentation on theme: "Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data."— Presentation transcript:

1 Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data

2 This talk  Refresher: Illumina sequencing  QC  What can go wrong  Useful QC statistics  Read mapping  Comparison of popular read mappers  Stampy  Indel and SNP calling  (Some results: 1000 Genomes indel calls)

3 7x Illumina GA-II 2x Roche 454 1x Illumina HiSeq 2000

4 1. Refresher: Illumina sequencing

5 Illumina sequencing

6

7

8  8 lanes… x 120 tiles x108 bp x 2 reads… = about 48 Gb raw bp

9 2. QC

10 Quality issues  Bases are identified by their fluorescent tag  Overlapping emission spectra  Single base per cycle: reversible terminator chemistry  Not perfect: fraction lags, fraction runs ahead: “Phasing”  Limits read length  Optimizing yield: cluster density  Higher densities mean more errors  Above an optimum, yield decreases  Partly signal processing issue: software improvements  Low amounts of initial DNA  Linker-linker hybrids; duplicated reads

11 Overlapping fluorescence spectra  C/A and G/T overlap  (Most common mutations are transitions, A-G and C-T) Rougemont et al, 2008

12 Refresher: Phred scores  Phred score = 10 log 10 ( probability of error )  10: 10% error probability  20: 1% error probability  30: 0.1% error probability (one in 1,000) 3 = 50%, 7 = 20% 13 = 5%, 17 = 2% 23 = 0.5%, 27 = 0.2% 33 = 0.05%, 37 = 0.02%

13 Phasing August 2009 June 2010

14 Cluster density & other improvements June 2010: August 2009:

15 Library complexity, duplicate reads  Some sequences are read several times:  Low amount of initial material, many PCR copies  Optical duplicates; secondary cluster seeding  Problem for variant calling  Any PCR error will be seen twice: evidence for variant  Rate of duplicates is rarely >5%  Criterion: both ends of a PE read map to matching location  Can occur by chance, but low probability, except for very high coverage  Post processing: duplicate removal  Standard processing step (e.g. Samtools, Picard)  Useful statistic:  Duplicate fraction is approximately additive across lanes (same library)  2x duplication fraction ≈ fraction of the library that was sequenced

16 Library complexity, duplicate reads Fraction α of all molecules is sequenced Number of times a PCR copy is sequenced: Poisson( α ) Expected fraction of duplicates: e - α -1+ α As a fraction of all reads sequenced: (e - α -1+ α )/ α = ½ α + … n =0123… Poissone-αe-α α e-αα e-α α 2 e - α /2 α 3 e - α /6 … Duplicates:0012…

17 Sequencing QC

18 QC statistics

19 QC statistics - coverage

20 QC statistics – quality scores

21 GATK recalibration tool

22 3. Read mapping

23 Read mapping  First processing step after sequencing:  Read mapping (most times)  Assembly (no reference sequence; specialized analyses)  Quality of mapping determines downstream results  Accessible genome  Biases (ref vs. variant)  Sensitivity (divergent reference; snps, indels, SV)  Specificity (calibration of mapping quality)

24 Read mapper comparison  Read mappers: Maq BWA Eland Novoalign Stampy  Criteria:  Sensitivity (overall; divergent reference; variants)  Specificity (mapping quality calibration)  Speed

25 Sensitivity

26 Sensitivity - indels

27 Sensitivity – Divergent reference

28 Specificity – ROC curves ROC - indels

29 Performance on real data Proportion mapped to within 10kb of mate

30 Efficiency

31 Stampy – first part of algorithm read 15 bp subsequence Remove rev-comp symmetry 29 bit word 4 bytes x 2 29 entry (2 Gb) hash table candidate positions open addressing, cache-friendly

32 Second part: Fast candidate alignment Single-instruction- multiple-data (SIMD), parallel execution Affine gap penalties. Linear-time and constant memory algorithm: DP table in registers. Maximum indel size 15 bp.

33 Third part: Modeling mapping failures  Pseudo-bayesian posterior (using candidates, rather than all mapping positions)  Failure to find the correct candidate (2 or more mismatches in every 15bp subsequence)  Sequence not in reference (is sequence match better than expected best random match?)

34 4. SNP and indel calling

35 SNP calling  General idea:  Works quite well! Some caveats: Include mapping quality: P(read|g) = P(read | wrong map) P(wrong map) + P(read | g, correct map) P(correct map) Mapping errors are dependent: don’t include mapQ<10 Base errors are not uniform (A/C/G/T): assume worst case (all identical) Assumes no anomalies (seg dups; alignments; indel/SV; …)  Hard problem: be conservative Expected SNP rate (human): 10 -3 /nt. FPR of 10 -5 required for 1% FDR Filtering is required to achieve good FDR – or all data features must be adequately modeled

36 Indel calling  General idea:  Differences with SNP calling:  Pseudo-Bayes: cannot consider all possible variants/genotypes Generate large set of candidates Filter using goodness-of-fit test  Illumina reads do not have an explicit indel error model

37 Indel error model Homopolymer run length

38 Wrap up  GA-II produces large amounts of good data  Artefacts do occur, keep a look at QC statistics  Choice of mapper influences yield and quality  Variant calling:  Bayesian approaches work well  Some assumptions (independence) not met, hard to model  Filtering remains necessary


Download ppt "Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data."

Similar presentations


Ads by Google