Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.

Slides:



Advertisements
Similar presentations
Huong Le Department of Molecular & Clinical Genetics, Royal Prince Alfred Hospital Click mouse to move to the next slide.
Advertisements

GPU and machine learning solutions for comparative genomics Usman Roshan Department of Computer Science New Jersey Institute of Technology.
RNAseq.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
DNAseq analysis Bioinformatics Analysis Team
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
Next Generation Sequencing, Assembly, and Alignment Methods
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Introduction to Short Read Sequencing Analysis
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Some new sequencing technologies. Molecular Inversion Probes.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Considerations for Analyzing Targeted NGS Data BRCA Tim Hague,CTO.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
CS 6293 Advanced Topics: Current Bioinformatics
NGS Data Generation Dr Laura Emery. Overview The NGS data explosion Sequencing technologies An example of a sequencing workflow Bioinformatics challenges.
Considerations for Analyzing Targeted NGS Data BRCA Tim Hague,CTO.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Steve Newhouse 28 Jan  Practical guide to processing next generation sequencing data  No details on the inner workings of the software/code &
Whole Exome Sequencing for Variant Discovery and Prioritisation
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Detecting copy number variations using paired-end sequence data Nick Furlotte CS224 May 29, 2009.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Identifying Reversible Functions From an ROBDD Adam MacDonald.
Introduction to Short Read Sequencing Analysis
KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.
Massive Parallel Sequencing
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Introduction to Illumina Sequencing
From Reads to Results Exome-seq analysis at CCBR
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
Canadian Bioinformatics Workshops
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Next-generation sequencing technology
DNA Sequencing Second generation techniques
Lesson: Sequence processing
Sequencing technologies
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Illumina Processing Steven Leonard
Next-generation sequencing technology
Sequencing technology and assembly
The FASTQ format and quality control
EMC Galaxy Course November 24-25, 2014
Department of Computer Science
Independent scientist
2nd (Next) Generation Sequencing
Next-generation sequencing - Mapping short reads
A critical evaluation of HTQC: a fast quality control toolkit for Illumina sequencing data Chandan Pal, PhD student Sahlgrenska Academy Institute of.
Next-generation DNA sequencing
Next-generation sequencing - Mapping short reads
CS 6293 Advanced Topics: Translational Bioinformatics
Cache - Optimization.
Canadian Bioinformatics Workshops
Basic Local Alignment Search Tool
Presentation transcript:

Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data

This talk  Refresher: Illumina sequencing  QC  What can go wrong  Useful QC statistics  Read mapping  Comparison of popular read mappers  Stampy  Indel and SNP calling  (Some results: 1000 Genomes indel calls)

7x Illumina GA-II 2x Roche 454 1x Illumina HiSeq 2000

1. Refresher: Illumina sequencing

Illumina sequencing

 8 lanes… x 120 tiles x108 bp x 2 reads… = about 48 Gb raw bp

2. QC

Quality issues  Bases are identified by their fluorescent tag  Overlapping emission spectra  Single base per cycle: reversible terminator chemistry  Not perfect: fraction lags, fraction runs ahead: “Phasing”  Limits read length  Optimizing yield: cluster density  Higher densities mean more errors  Above an optimum, yield decreases  Partly signal processing issue: software improvements  Low amounts of initial DNA  Linker-linker hybrids; duplicated reads

Overlapping fluorescence spectra  C/A and G/T overlap  (Most common mutations are transitions, A-G and C-T) Rougemont et al, 2008

Refresher: Phred scores  Phred score = 10 log 10 ( probability of error )  10: 10% error probability  20: 1% error probability  30: 0.1% error probability (one in 1,000) 3 = 50%, 7 = 20% 13 = 5%, 17 = 2% 23 = 0.5%, 27 = 0.2% 33 = 0.05%, 37 = 0.02%

Phasing August 2009 June 2010

Cluster density & other improvements June 2010: August 2009:

Library complexity, duplicate reads  Some sequences are read several times:  Low amount of initial material, many PCR copies  Optical duplicates; secondary cluster seeding  Problem for variant calling  Any PCR error will be seen twice: evidence for variant  Rate of duplicates is rarely >5%  Criterion: both ends of a PE read map to matching location  Can occur by chance, but low probability, except for very high coverage  Post processing: duplicate removal  Standard processing step (e.g. Samtools, Picard)  Useful statistic:  Duplicate fraction is approximately additive across lanes (same library)  2x duplication fraction ≈ fraction of the library that was sequenced

Library complexity, duplicate reads Fraction α of all molecules is sequenced Number of times a PCR copy is sequenced: Poisson( α ) Expected fraction of duplicates: e - α -1+ α As a fraction of all reads sequenced: (e - α -1+ α )/ α = ½ α + … n =0123… Poissone-αe-α α e-αα e-α α 2 e - α /2 α 3 e - α /6 … Duplicates:0012…

Sequencing QC

QC statistics

QC statistics - coverage

QC statistics – quality scores

GATK recalibration tool

3. Read mapping

Read mapping  First processing step after sequencing:  Read mapping (most times)  Assembly (no reference sequence; specialized analyses)  Quality of mapping determines downstream results  Accessible genome  Biases (ref vs. variant)  Sensitivity (divergent reference; snps, indels, SV)  Specificity (calibration of mapping quality)

Read mapper comparison  Read mappers: Maq BWA Eland Novoalign Stampy  Criteria:  Sensitivity (overall; divergent reference; variants)  Specificity (mapping quality calibration)  Speed

Sensitivity

Sensitivity - indels

Sensitivity – Divergent reference

Specificity – ROC curves ROC - indels

Performance on real data Proportion mapped to within 10kb of mate

Efficiency

Stampy – first part of algorithm read 15 bp subsequence Remove rev-comp symmetry 29 bit word 4 bytes x 2 29 entry (2 Gb) hash table candidate positions open addressing, cache-friendly

Second part: Fast candidate alignment Single-instruction- multiple-data (SIMD), parallel execution Affine gap penalties. Linear-time and constant memory algorithm: DP table in registers. Maximum indel size 15 bp.

Third part: Modeling mapping failures  Pseudo-bayesian posterior (using candidates, rather than all mapping positions)  Failure to find the correct candidate (2 or more mismatches in every 15bp subsequence)  Sequence not in reference (is sequence match better than expected best random match?)

4. SNP and indel calling

SNP calling  General idea:  Works quite well! Some caveats: Include mapping quality: P(read|g) = P(read | wrong map) P(wrong map) + P(read | g, correct map) P(correct map) Mapping errors are dependent: don’t include mapQ<10 Base errors are not uniform (A/C/G/T): assume worst case (all identical) Assumes no anomalies (seg dups; alignments; indel/SV; …)  Hard problem: be conservative Expected SNP rate (human): /nt. FPR of required for 1% FDR Filtering is required to achieve good FDR – or all data features must be adequately modeled

Indel calling  General idea:  Differences with SNP calling:  Pseudo-Bayes: cannot consider all possible variants/genotypes Generate large set of candidates Filter using goodness-of-fit test  Illumina reads do not have an explicit indel error model

Indel error model Homopolymer run length

Wrap up  GA-II produces large amounts of good data  Artefacts do occur, keep a look at QC statistics  Choice of mapper influences yield and quality  Variant calling:  Bayesian approaches work well  Some assumptions (independence) not met, hard to model  Filtering remains necessary