Download presentation
Presentation is loading. Please wait.
Published byBerniece Dennis Modified over 8 years ago
1
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig Presenter: Erkan Okuyan
2
Motivation Massive amount of sequencing data (Illumina – 454 - SOLID) (short reads - with high error rate) Assembly processes sensitive to errors in reads thus sequencing errors needs to be corrected Size of error correction problem is computationally demanding
3
Definitions - Let R = {r 1, r 2,…,r k } be a set of k reads with |r i | = L - Let r i be in {A, C, G, T} L for all 1 ≤ i ≤ k. - Let m (multiplicity) and l (length) satisfy m>1 and l<L Definition1 (Solid and Weak): An l-tuple (a DNA string of length l) is called solid with respect to R and m if it is a substring of at least m reads in R and weak otherwise. –m-way replicated l-tuple is probably a correct l-tuple Definition2 (Spectrum): The spectrum of R with respect to m and l, denoted as T m,l (R), is the set of all solid l-tuples with respect to R and m. –Spectrum T m,l (R) is the set of all correct l-tuples
4
Definitions - Let R = {r 1, r 2,…,r k } be a set of k reads with |r i | = L - Let r i be in {A, C, G, T} L for all 1 ≤ i ≤ k. - Let m (multiplicity) and l (length) satisfy m>1 and l<L Definition3 (T-string): A DNA string s is called a T m,l (R)- string if every l-tuple in s is an element of T m,l (R). Definition4 (SAP): Given a DNA string s and spectrum T m,l (R). Find a T m,l (R)-string s* in the set of T m,l (R)-strings that minimizes the distance function d(s,s*).
5
CUDA (Compute Unified Device Architecture) Serial Code (host) Parallel Kernel (device) KernelA >>(args); Serial Code (host) Parallel Kernel (device) KernelB >>(args); Integrated host+device app program –Serial or modestly parallel parts in host C code –Highly parallel parts in device SPMD kernel C code
6
CUDA Execution A GPU device –Is a coprocessor to the CPU or host –Has its own DRAM (device memory) –Runs many threads in parallel Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads –GPU threads are extremely lightweight –Very little creation overhead –GPU needs 1000s of threads for full efficiency
7
Parallel Error Correction with CUDA Each kernel thread is responsible for correction of a single read r i. Voting based algorithm –First Step: Calculation of voting matrix –Second Step:Single-Mutation fixing/trimming/discarding
8
Step1: Voting Matrix Calculation
9
Step2: Fixing/Trimming/Discarding Reads
10
Fast Membership Tests First algorithm(kernel) dominates time –(L-l). (l+3. p. l) membership tests required where p is the number of l-tuples that do not belong in the spectrum. –Space efficient Bloom filter speeds up membership test of spectrum Compute bloom filter on CPU and store it on texture memory (fast read only cache) on device
11
Bloom Filter Probabilistic data structure –No false negatives –Small percentage of false positives –Space efficient and fast Uses a bit array B of length m and d hash functions –to insert x, we set B[h i (x)] = 1, for i=1,…,d –to query y, we check if B[h i (y)] all equal 1, for i=1,…,d
12
Bloom Filter Example a and b are inserted to a m=10 n=2 d=3 bloom filter Query of c on bloom filter returns false since some bits are 0. Query of d on bloom filter returns true since all bits are 1 (False positive).
13
Overall Algorithm 1)Pre-Computation on the CPU: Program the Bloom filter (counting bloom filter) bit-vector by hashing each l-tuple present on read R. 2)Data transfer from CPU to GPU: Allocate memory/transfer Bloom filter and reads. 3)Execute CUDA kernel. 4)Data transfer from GPU to CPU: Transfer the set of corrected/trimmed reads.
14
Performance Evaluation System Parameters –Nvidia Geforce GTX 280 with 1GB memory –AMD Opteron dual core 2.2Ghz CPU with 2GB memory Datasets –Artificial Sets (1%, 2%, 3% error rates) Yeast Chromosomes (S.cer5, S.cer7) Bacterial Genomes (H.inf, E.col) –Real Set Staphylococcus Aureus strain MW2 (H.Aci) (error rate ~1%)
15
Performance Evaluation
17
Discussion/Conclusion (GOOD) Runtime savings of 10 to 19 times reported. Bigger datasets is not an issue as long as Bloom filter fits in texture memory. (More than one round of read-load/read-correct approach) Possible to even further parallelize on distributed memory GPU farms.
18
Discussion/Conclusion (BAD) Does not exploit fast shared memory within thread blocks (i.e. each read r i does not really have to be handled by a single thread, voting matrix can be constructed in parallel) thus further speed-up is possible. Predetermined read length L is a bit restrictive.
19
Thank You
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.