Download presentation
Presentation is loading. Please wait.
Published byChad Gray Modified over 9 years ago
1
GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA
2
Outline Motivation NGS Issues and Requirements Pair-HMM Memory Optimizations Results Conclusion
3
Motivation Mutation Detection: SNP discovery HapMap and resequencing Species Identification Bisulfite Sequencing Epigenetic influences RNA editing
4
Error Rates* InstrumentRun TimeMb/runBases/re ad Primary Error Type Error Rate (%) 3730xl (Capillary) 2 h0.06650Substitution0.1-1 454 FLX+18-20 h900700Indel1 Illumina HiSeq2000 10 days≤ 600,000100+100Substitution≥0.1 Ion Torrent – 318 chip 2 h>1000>100Indel~1 PacBio RS0.5-2h5-10860-1100CG Deletions 16 * Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011
5
Pair-HMM
6
Pair-HMM (Mathematics) Match Gap (in both directions)
7
Pair-HMM (M) atacgact a 1.000.00 g 0.680.00 t 0.320.680.00 a 0.320.680.00 g 1.000.00 a 1.000.00 c 1.000.00 c 1.00
8
Pair-HMM (X) atacgact a 0.00 g 0.310.00 t a g a c c
9
Pair-HMM (Y) atacgact a 0.00 g t a 0.310.00 g a c c
10
Pair-HMM ACGT a 1.000.00 g 0.680.31 t 0.320.00 0.68 a 0.990.00 g 1.000.00 a 1.000.00 c 1.000.00 c 1.000.00
11
Expected Results CHRPOSTOTACGTSNP?PVAL chrX175523417.000.00 170.00N chrX175523518.000.0018.000.00 N chrX175523619.009.990.009.000.01Y:g->a/g2.54e-08 chrX175523719.500.00 19.50N chrX175523819.500.00 19.500.00N chrX175523946.000.0119.490.00 N
12
Why Inline SNP Calling? Post-Processing Disk space, less memory Inline Requires more memory Less disk space Can include specifics probabilities for each read
13
Previous Optimizations Two methods for speeding up mapping: 1. Entire genome on one machine 2. Split memory among different machines ○ Must normalize across all genome portions ○ MPI reduction
14
Previous Optimizations
15
Memory Requirements Human Genome (3gb) HashMap ≈ 12GB 4 bits/character = 1.5GB 5 floating point values per base (plus N) = sizeof(float)*5 * 3GB=60GB Also stores total for easy computation = sizeof(float) * 3GB = 12GB Total of ≈ 90GB per run
16
Three Memory Optimizations Normal (no optimization) Integer discretization Centroid discretization
17
Integer Discretization Only need one floating point value (for total) and 1 byte/nucleotide. “Parts per 255” Biggest hit: Going into and out of “integer space”
18
Integer Discretization Added from r i : 1.00.000.680.310.010.00 Step 1: Convert from Integer Space Step 2: Add from r i to Genome Step 3: Convert back to Integer Space Genome TotalACGTN 12.032317123 TotalACGTN 12.00.1510.90.330.560.15 TotalACGTN 13.00.1511.60.640.570.15 TotalACGTN 13.0222813112
19
Centroid Discretization Many states not used: [255, 255, 255, 255, 255] [0, 0, 0, 0, 0] Many states not biologically relevant SNP transition (common) vs transversion (not likely) MSA uses this compression to perform fast alignment of one-to-many alignment
20
Centroid Discretization (cont)
21
Benefits Doesn’t waste impossible or infrequently used space Much smaller memory footprint Drawbacks: Slight overhead in converting from centroid to floating point spaces Rounding error (how significant?)
22
Speed Comparison
23
Optimization Stats (chrX) OptimizationMemoryMem %WallclockTPFP Normal4.76GB100%04:25:551309127 CharDisc2.58GB54.2%04:36:586770 CentDisc2.01GB42.2%04:27:291669058
24
Conclusion For high error rates, HMM approach is ideal, but requires more memory Distributing the genome across processors doesn’t scale linearly Discretization methods provide good memory reductions (up to 42%) Centroid discretization performs poorly Integer discretization can be used when available memory is low
25
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.