Download presentation
Presentation is loading. Please wait.
1
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February 6. 2008
2
T1. Roche / 454 FLX system pyrosequencing technology variable read-length the only new technology with >100bp reads tested in many published applications supports paired-end read protocols with up to 10kb separation size
3
T2. Illumina / Solexa Genome Analyzer fixed-length short-read sequencer read properties are very close traditional capillary sequences very low INDEL error rate tested in many published applications paired-end read protocols support short (<600bp) separation
4
T3. AB / SOLiD system ACGT A C G T 2 nd Base 1 st Base 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 fixed-length short-read sequencer employs a 2-base encoding system that can be used for error reduction and improving SNP calling accuracy requires color-space informatics published applications underway / in review paired-end read protocols support up to 10kb separation size
5
T4. Helicos / Heliscope system experimental short-read sequencer system single molecule sequencing no amplification variable read-length error rate reduced with 2- pass template sequencing
6
A1. Variation discovery: SNPs and short-INDELs 1. sequence alignment 2. dealing with non-unique mapping 3. looking for allelic differences
7
A2. Structural variation detection structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations copy number (for amplifications, deletions) from depth of read coverage
8
A3. Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. Robertson et al. Nature Methods, 2007
9
A4. Novel transcript discovery (genes) Inferred exon 1 novel genes / exons Inferred exon 2 novel transcripts in known genes Known exon 1Known exon 2 Known exon 1Known exon 2
10
A5. Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006
11
A6. Expression profiling by tag counting aligned reads Jones-Rhoads et al. PLoS Genetics, 2007 gene
12
A7. De novo organismal genome sequencing assembled sequence contigs short reads longer reads read pairs Lander et al. Nature 2001
13
C1. Read length read length [bp] 0 100200300 ~250 (var) 25-40 (fixed) 25-35 (fixed) 20-35 (var)
14
When does read length matter? short reads often sufficient where the entire read length can be used for mapping: SNPs, short-INDELs, SVs CHIP-SEQ short RNA discovery counting (mRNA miRNA) longer reads are needed where one must use parts of reads for mapping: de novo sequencing novel transcript discovery aacttagacttaca gacttacatacgta Known exon 1Known exon 2 accgattactatacta
15
C2. Read error rate error rate dictates how many errors the aligner should tolerate error rate typically 0.4 - 1% the more errors the aligner must tolerate, the lower the fraction of the reads that can be uniquely aligned applications where, in addition, specific alleles are essential, error rate is even more important
16
C3. Error rate grows with each cycle this phenomenon limits useful read length
17
C4. Substitutions vs. INDEL errors SNP discovery may require higher coverage for allele confirmation INDELs can be discovered with very high confidence! gapped alignment necessary good SNP discovery accuracy short-INDEL discovery difficult
18
C5. Quality values are important for allele calling PHRED base quality values represent the estimated likelihood of sequencing error and help us pick out true alternate alleles inaccurate or not well calibrated base quality values hinder allele calling Q-values should be accurate … and high!
19
Quality values should be well-calibrated assigned base quality value should be calibrated to represent the actual base quality value in every sequencing cycle
20
C6. Representational biases / library complexity fragmentation biases amplification biases PCR sequencing biases sequencing low/no representation high representation
21
Dispersal of read coverage this affects variation discovery (deeper starting read coverage is needed) it has major impact is on counting applications
22
Amplification errors many reads from clonal copies of a single fragment early PCR errors in “clonal” read copies lead to false positive allele calls early amplification error gets propagated onto every clonal copy
23
C7. Paired-end reads fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity Korbel et al. Science 2007 paired-end read can improve read mapping accuracy (if unique map positions are required for both ends) or efficiency (if fragment length constraint is used to rescue non-uniquely mapping ends)
24
Paired-end reads for SV discovery longer fragments increase the chance of spanning SV breakpoints and/or entire events SV breakpoint detection sensitivity & resolution depend on the width of the fragment length distribution (most 2kb deletions would be detected at 10% std but missed at 30% std) longer fragments tend to have wider fragment length distributions
25
C8. Technologies / properties / applications Technology Roche/454Illumina/SolexaAB/SOLiD Read properties Read length250bp20-40bp25-35bp Error rate<0.5%<1.0%<0.5% Dominant error typeINDELSUB Paired-end reads availableyes Paired-end separation< 10kb (3kb optimal)100 - 600bp500bp - 10kb (3kb optimal) Applications SNP discovery○●● short-INDEL discovery ●○ SV discovery○○● CHIP-SEQ○●● small RNA/gene discovery○●● mRNA Xcript discovery●○○ Expression profiling○●● De novo sequencing● ??
26
Thanks http://bioinformatics.bc.edu/marthlab Derek Barnett Eric Tsung Aaron Quinlan Damien Croteau-Chonka Weichun Huang Michael Stromberg Chip Stewart Michele Busby MOSAIK talk Thursday, 7:40PM Michael Egholm David Bentley Francisco de la Vega Kristen Stoops Ed Thayer Clive Brown Elaine Mardis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.