Download presentation
Presentation is loading. Please wait.
Published byNorma Moody Modified over 9 years ago
1
SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno
2
How SHRiMP works: Stage 1: Map reads to target genome Stage 2: Compute statistics
3
Read Mapping Three phases Very fast k-mer scan (index reads, scan genome) Fast, vectorized Smith-Waterman to confirm Slow, complete backtracking S-W for top ‘n’ hits
4
Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … AACTGTACCAGTGAG
5
Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … AACTGTaccagtgag AACTGT
6
Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … aACTGTAccagtgag AACTGT ACTGTA
7
Read Mapping: Phase 1 Create an index of size 4 (k-mer length ) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This is our k-mer to read index … aaCTGTACcagtgag AACTGT ACTGTA CTGTAC
8
Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … accTGTACCagtgag AACTGT ACTGTA CTGTAC TGTACC
9
Read Mapping: Phase 1 Create a hash table of size 4^(k-mer length) 4 bases – ignore all else (‘N’, ‘X’, wobble codes…) This becomes our kmer to read index … AACTGT ACTGTA CTGTAC TGTACC Read 7 Read 32 Read 18 Read 12 Read 13 Read 12 Read 7 Read 15
10
Read Mapping: Phase 1 Once we’ve indexed all reads, just scan the genome by k-mer Genome Reads
11
Read Mapping: Phase 1 Remember the k-mer hits within a given interval (window) When sufficient hits, look more closely “Look more closely” means calculate a fast Smith- Waterman score
12
Technicalities We don’t always use full k-mers (q-grams). We actually support ‘spaced seeds’, but the algorithm doesn’t change much. For each spaced seed, ‘compress out’ the k-mer and use it as the hash index
13
Read Mapping: Phase 2 Smith-Waterman is very expensive NxM matrix isn’t too big for short reads and windows, but… We call the vectorized code millions of times We don’t want a bottleneck – aim for no more than 50% of the total runtime We only want one score as quickly as possible
14
Read Mapping: Phase 2 Cell being computed Previously computed cells A C T A G A C T T G TCCAGTTCCAGT
15
Read Mapping: Phase 2 Each forward-facing diagonal in S-W matrix depends on: Small constant # of previous diagonals Small constant # of scalars We can compute entire diagonals in parallel Our speed-up is proportional to the diagonal size
16
Read Mapping: Phase 2 + - - - + Current Previous Penultimate A C T A G A C T T G TCCAGTTCCAGT T G A C C T + - - - +
17
Read Mapping: Phase 2 Most commodity processors have vector instructions Remember the MMX brouhaha? SIMD – Single Instruction, Multiple Data 4 12 8 7 2 9 15 3 6 21 23 10 +=
18
Read Mapping: Phase 2 + - - - + Current Previous Penultimate A C T A G A C T T G TCCAGTTCCAGT T G A C C T + - - - +
19
Read Mapping: Phase 2 Match scores typically use a scoring matrix ScoringMatrix[SeqA[i]][SeqB[j]] But this doesn’t scale: Individual cell scores become a bottleneck Can precompute a ‘query profile’ (expensive), or… If we only care about strict match/mismatch we can use logical bit-wise operations SIMD instructions work here (fully parallel)
20
Read Mapping: Phase 2 Results: Our vectorized S-W is as fast, or faster than other very complicated SIMD implementations 500 million+ matrix cells/second on Core 2 machines Even with small seeds, S-W accounts for at most half of the total run time
21
Read Mapping: Phase 3 Recap: K-mer scan selects areas of reasonable similarity Vectorized S-W (dis)confirms similarity Best ‘n’ hits per read are given a full alignment with backtrace
22
Read Mapping: Phase 3 Letter-space alignments are simple: K-mer scan, Vectorized S-W, Full S-W in letters, give user pretty output What about AB SOLiD colour-space? Biologists want to see A,C,G,T, not 0,1,2,3… Dealing with strange SOLiD properties… Our solution: K-mer scan, Vectorized S-W in colour-space Full S-W in letter-space, but we can’t just convert
23
AB Di-base Reads We think in terms of nucleotides: A, C, G, and T’s. AB’s NGS machine outputs 4 colours One colour per pair of bases: T T G A G C G T T C T 0 1 2 2 3 3 1 0 2 0123T 1032G 2301C 3210A TGCA
24
AB Di-base Reads A G CT 00 00 11 2 2 33 0123T 1032G 2301C 3210A TGCA
25
SOLiD Translations Given the following read, there are 4 translations (we need an initial base): 012233102 AACTCGCAAG CCAGATACCT GGTCTATGGA TTGAGCGTTC
26
SOLiD Translations Reads begin with a known primer (‘T’) 012233102 AACTCGCAAG CCAGATACCT GGTCTATGGA TTGAGCGTTC
27
SOLiD Translations What happens if a read error occurs? The right translation was: T T G A G C G T T C 010233102 AACCTATGGA CCAAGCGTTC GGTTCGCAAG TTGGATACCT
28
Colour-space Smith-Waterman There are four unique translations for every read An error will cause us to change frames (different translation) Why not do a S-W across all four letter-space translations with some error penalty?
29
Colour-space Smith-Waterman Think of 4 S-W matrices stacked above one another If we have 1 read error, but otherwise perfect match, we’ll use 2 matrices Genome Read Frame 1Frame 2Frame 3Frame 4 Letter
30
Colour-space Smith-Waterman End result: G: 1123724 TA-ACCACGGTCACACTTGCATCAC 1123701 || |||||||||| |||X||||||| T: TACACCACGGTCAGACTtGCATCAC R: 0 T0311101130121221211313211 24 Should be ‘0’
31
Statistics After reads are mapped, mull over the results For each read: P(hit by pure chance – not a valid hit) P(hit generated by genome – valid hit) P(hit is best of all for particular read)
32
Results Speed Simple k-mer scan is very fast Important when seeds are bigger (less S-W) Vectorized S-W is fast Important when seeds are smaller (more S-W) Generally well-balanced run time Big seeds make k-mer scan the bottleneck (this is good - it’s really fast) Easily parallelised – just divide the reads over CPUs
33
Results C. Savingyi 22M 25bp reads 173Mb genome S-W would take at least a few thousand CPU days SHRiMP runs in about 50 CPU days with fairly small seeds (length 8, weight 7) SNP, indel, error rates correspond well to known averages for this organism
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.