Presentation is loading. Please wait.

Presentation is loading. Please wait.

GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA.

Similar presentations


Presentation on theme: "GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA."— Presentation transcript:

1 GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA

2 Outline  Motivation NGS Issues and Requirements Pair-HMM  Memory Optimizations  Results  Conclusion

3 Motivation Mutation Detection:  SNP discovery HapMap and resequencing  Species Identification  Bisulfite Sequencing Epigenetic influences RNA editing

4 Error Rates* InstrumentRun TimeMb/runBases/re ad Primary Error Type Error Rate (%) 3730xl (Capillary) 2 h0.06650Substitution0.1-1 454 FLX+18-20 h900700Indel1 Illumina HiSeq2000 10 days≤ 600,000100+100Substitution≥0.1 Ion Torrent – 318 chip 2 h>1000>100Indel~1 PacBio RS0.5-2h5-10860-1100CG Deletions 16 * Data current as of May 2011: Glenn, Travis C, “Field guide to next-generation DNA sequencers,” Molecular Ecology Resources, vol 11, pp 759-769, 2011

5 Pair-HMM

6 Pair-HMM (Mathematics)  Match  Gap (in both directions)

7 Pair-HMM (M) atacgact a 1.000.00 g 0.680.00 t 0.320.680.00 a 0.320.680.00 g 1.000.00 a 1.000.00 c 1.000.00 c 1.00

8 Pair-HMM (X) atacgact a 0.00 g 0.310.00 t a g a c c

9 Pair-HMM (Y) atacgact a 0.00 g t a 0.310.00 g a c c

10 Pair-HMM ACGT a 1.000.00 g 0.680.31 t 0.320.00 0.68 a 0.990.00 g 1.000.00 a 1.000.00 c 1.000.00 c 1.000.00

11 Expected Results CHRPOSTOTACGTSNP?PVAL chrX175523417.000.00 170.00N chrX175523518.000.0018.000.00 N chrX175523619.009.990.009.000.01Y:g->a/g2.54e-08 chrX175523719.500.00 19.50N chrX175523819.500.00 19.500.00N chrX175523946.000.0119.490.00 N

12 Why Inline SNP Calling?  Post-Processing Disk space, less memory  Inline Requires more memory Less disk space Can include specifics probabilities for each read

13 Previous Optimizations  Two methods for speeding up mapping: 1. Entire genome on one machine 2. Split memory among different machines ○ Must normalize across all genome portions ○ MPI reduction

14 Previous Optimizations

15 Memory Requirements  Human Genome (3gb) HashMap ≈ 12GB 4 bits/character = 1.5GB 5 floating point values per base (plus N) = sizeof(float)*5 * 3GB=60GB Also stores total for easy computation = sizeof(float) * 3GB = 12GB  Total of ≈ 90GB per run

16 Three Memory Optimizations  Normal (no optimization)  Integer discretization  Centroid discretization

17 Integer Discretization  Only need one floating point value (for total) and 1 byte/nucleotide.  “Parts per 255”  Biggest hit: Going into and out of “integer space”

18 Integer Discretization Added from r i : 1.00.000.680.310.010.00  Step 1: Convert from Integer Space  Step 2: Add from r i to Genome  Step 3: Convert back to Integer Space Genome TotalACGTN 12.032317123 TotalACGTN 12.00.1510.90.330.560.15 TotalACGTN 13.00.1511.60.640.570.15 TotalACGTN 13.0222813112

19 Centroid Discretization  Many states not used: [255, 255, 255, 255, 255] [0, 0, 0, 0, 0]  Many states not biologically relevant SNP transition (common) vs transversion (not likely)  MSA uses this compression to perform fast alignment of one-to-many alignment

20 Centroid Discretization (cont)

21  Benefits Doesn’t waste impossible or infrequently used space Much smaller memory footprint  Drawbacks: Slight overhead in converting from centroid to floating point spaces Rounding error (how significant?)

22 Speed Comparison

23 Optimization Stats (chrX) OptimizationMemoryMem %WallclockTPFP Normal4.76GB100%04:25:551309127 CharDisc2.58GB54.2%04:36:586770 CentDisc2.01GB42.2%04:27:291669058

24 Conclusion  For high error rates, HMM approach is ideal, but requires more memory Distributing the genome across processors doesn’t scale linearly  Discretization methods provide good memory reductions (up to 42%) Centroid discretization performs poorly Integer discretization can be used when available memory is low

25 Questions


Download ppt "GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA."

Similar presentations


Ads by Google