Download presentation
Presentation is loading. Please wait.
1
Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department
2
Next-gen data
3
Read length read length [bp] 0 100200300 ~200-450 (variable) 25-70 (fixed) 25-50 (fixed) 20-60 (variable) 400
4
Paired fragment-end reads fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency Korbel et al. Science 2007 paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends) instrumental for structural variation discovery circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity
5
Representational biases this affects genome resequencing (deeper starting read coverage is needed) will have major impact is on counting applications “dispersed” coverage distribution
6
Amplification errors many reads from clonal copies of a single fragment early PCR errors in “clonal” read copies lead to false positive allele calls early amplification error gets propagated into every clonal copy
7
Read quality
8
Error rate (Solexa)
9
Error rate (454)
10
Per-read errors (Solexa)
11
Per read errors (454)
12
Applications
13
Genome resequencing for variation discovery SNPs short INDELs structural variations the most immediate application area
14
Genome resequencing for mutational profiling Organismal reference sequence likely to change “classical genetics” and mutational analysis
15
De novo genome sequencing Lander et al. Nature 2001 difficult problem with short reads promising, especially as reads get longer
16
Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) natural applications for next-gen. sequencers
17
Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 high-throughput, but short reads pose challenges
18
Transcriptome sequencing: expression profiling Jones-Rhoads et al. PLoS Genetics, 2007 Cloonan et al. Nature Methods, 2008 high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays
19
Analysis software (resequencing)
20
Individual resequencing (iii) read assembly REF (ii) read mapping IND (i) base calling IND (iv) SNP and short INDEL calling (vi) data validation, hypothesis generation (v) SV calling
21
The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers
22
1. Base calling base sequence base quality (Q-value) sequence diverse chemistry & sequencing error profiles
23
454 pyrosequencer error profile multiple bases in a homo-polymeric run are incorporated in a single incorporation test the number of bases must be determined from a single scalar signal the majority of errors are INDELs
24
454 base quality values the native 454 base caller assigns too low base quality values
25
PYROBAYES: determine base number
26
PYROBAYES: Performance better correlation between assigned and measured quality values higher fraction of high-quality bases
27
Base quality value calibration Raw Illumina reads (1000G data)
28
Recalibrated base quality values (Illumina) Recalicrated Illumina reads (1000G data)
29
… and they give you the picture on the box 2. Read mapping Read mapping is like doing a jigsaw puzzle… …you get the pieces… Unique pieces are easier to place than others…
30
Non-uniqueness of reads confounds mapping Reads from repeats cannot be uniquely mapped back to their true region of origin RepeatMasker does not capture all micro-repeats, i.e. repeats at the scale of the read length
31
Strategies to deal with non-unique mapping Non-unique read mapping: optionally either only report uniquely mapped reads or report all map locations for each read (mapping quality values for all mapped reads are being implemented) 0.8 0.190.01 read mapping to multiple loci requires the assignment of alignment probabilities (mapping qualities)
32
Longer reads are easier to map 454 FLX (1000G data)
33
Paired-end reads help unique read placement fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency Korbel et al. Science 2007 circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity PE MP PE reads are now the standard for genome resequencing
34
MOSAIK
35
INDEL alleles/errors – gapped alignments 454
36
Aligning multiple read types together ABI/capillary 454 FLX 454 GS20 Illumina Alignment and co- assembly of multiple reads types permits simultaneous analysis of data from multiple sources and error characteristics
37
Aligner speed
38
3. Polymorphism / mutation detection sequencing error polymorphism
39
Allele calling in “trad” sequences capillary sequences: either clonal or diploid traces
40
Allele calling in next-gen data SNP INS New technologies are perfectly suitable for accurate SNP calling, and some also for short- INDEL detection
41
Human genome polymorphism projects common SNPs
42
Human genome polymorphism discovery
43
The 1000 Genomes Project
44
New challenges for SNP calling deep alignments of 100s / 1000s of individuals trio sequences
45
Rare alleles in 100s / 1,000s of samples
46
Allele discovery is a multi-step sampling process Population SamplesReads Allele detection
47
Capturing the allele in the sample
48
Allele calling in deep sequence data aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac Q30Q40Q50Q60 10.01 0.10.5 20.821.0 3
49
Allele calling in the reads base call sample size individual read coverage base quality
50
More samples or deeper coverage / sample? Shallower read coverage from more individuals … …or deeper coverage from fewer samples? simulation analysis by Aaron Quinlan
51
Analysis indicates a balance
52
SNP calling in trios the child inherits one chromosome from each parent there is a small probability for a mutation in the child
53
SNP calling in trios aatgtagtaAgtacctac aatgtagtaCgtacctac aatgtagtaAgtacctac aatgtagtaCgtacctac mother father child P=0.79 P=0.86
54
Determining genotype directly from sequence AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA individual 1 individual 3 individual 2 A/C C/CC/C A/A
55
4. Structural variation discovery
56
SV events from PE read mapping patterns
57
Deletion: Aberrant positive mapping distance
58
Copy number estimation from depth of coverage
59
Spanner – a hybrid SV/CNV detection tool Navigation bar Fragment lengths in selected region Depth of coverage in selected region
60
5. Data visualization 1.aid software development: integration of trace data viewing, fast navigation, zooming/panning 2.facilitate data validation (e.g. SNP validation): simultanous viewing of multiple read types, quality value displays 3.promote hypothesis generation: integration of annotation tracks
61
Data visualization
62
New analysis tools are needed 1.Tailoring existing tools for specialized applications (e.g. read mappers for transcriptome sequencing) 2.Analysis pipelines and viewers that focus on the essential results e.g. the few mutations in a mutant, or compare 1000 genome sequences (but hide most details) 3.Work-bench style tools to support downstream analysis
63
Data storage and data standards
64
What level of data to store? images traces base quality values base-called reads
65
Data standards Sequence Read Format, SRF (Asim Siddiqui, UBC) ssrformat@ubc.ca Assembly format working group http://assembly.bc.edu Genotype Likelihood Format (Richard Durbin, Sanger)
66
Summary
67
Conclusions: next-gen sequencing software Next-generation sequencing is a boon for mass-scale human resequencing, whole-genome mutational profiling, expression analysis and epigenetic studies Informatics tools already effective for basic applications There is a need both for “generic” analysis tools e.g. flexible read aligners and for specialized tools tailored to specific applications (e.g. expression profiling) Move toward tools that focus on biological analysis Most challenges are technical in nature (e.g. data storage, useful data formats, fast read mapping)
68
Roche / 454 system pyrosequencing technology variable read-length the only new technology with >100bp reads
69
Illumina / Solexa Genome Analyzer fixed-length short-read sequencer very high throughput read properties are very close to traditional capillary sequences low INDEL error rate
70
AB / SOLiD system ACGT A C G T 2 nd Base 1 st Base 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 fixed-length short-reads very high throughput 2-base encoding system color-space informatics
71
Helicos / Heliscope system short-read sequencer single molecule sequencing no amplification variable read-length error rate reduced with 2- pass template sequencing
72
Data characteristics
73
Data standards different data storage needs (archival, transfer, processing) often poses contradictory requirements (e.g. normalized vs. non-normalized storage of assembly, alignment, read, image data) even different analysis goals often call for different optimal storage / data access strategies (e.g. paired-end read analysis for SV detection vs. SNP calling) requirements include binary formats, fast sequential and / or random access, and flexible indexing (e.g. an entire genome assembly can no longer reside in RAM)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.