Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Speed, Accurate and Efficient way to identify the DNA.

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Skew Handling in Aggregate Streaming Queries on GPUs Georgios Koutsoumpakis 1, Iakovos Koutsoumpakis 1 and Anastasios Gounaris 2 1 Uppsala University,

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

Sujayyendhiren RS, Kaiqi Xiong and Minseok Kwon Rochester Institute of Technology Motivation Experimental Setup in ProtoGENI Conclusions and Future Work.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

HPCLatAm 2013 HPCLatAm 2013 Permutation Index and GPU to Solve efficiently Many Queries AUTORES  Mariela Lopresti  Natalia Miranda  Fabiana Piccoli.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Cuda application-Case study 2015/10/24 1. Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and.

GPU Architecture and Programming

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

QCAdesigner – CUDA HPPS project

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

The Bloom Paradox Ori Rottenstreich Joint work with Isaac Keslassy Technion, Israel.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Computer Engg, IIT(BHU)

GPU-based iterative CT reconstruction

CS427 Multicore Architecture and Parallel Computing

The Variable-Increment Counting Bloom Filter

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

NVIDIA Fermi Architecture

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig Presenter: Erkan Okuyan

Motivation Massive amount of sequencing data (Illumina – SOLID) (short reads - with high error rate) Assembly processes sensitive to errors in reads thus sequencing errors needs to be corrected Size of error correction problem is computationally demanding

Definitions - Let R = {r 1, r 2,…,r k } be a set of k reads with |r i | = L - Let r i be in {A, C, G, T} L for all 1 ≤ i ≤ k. - Let m (multiplicity) and l (length) satisfy m>1 and l<L Definition1 (Solid and Weak): An l-tuple (a DNA string of length l) is called solid with respect to R and m if it is a substring of at least m reads in R and weak otherwise. –m-way replicated l-tuple is probably a correct l-tuple Definition2 (Spectrum): The spectrum of R with respect to m and l, denoted as T m,l (R), is the set of all solid l-tuples with respect to R and m. –Spectrum T m,l (R) is the set of all correct l-tuples

Definitions - Let R = {r 1, r 2,…,r k } be a set of k reads with |r i | = L - Let r i be in {A, C, G, T} L for all 1 ≤ i ≤ k. - Let m (multiplicity) and l (length) satisfy m>1 and l<L Definition3 (T-string): A DNA string s is called a T m,l (R)- string if every l-tuple in s is an element of T m,l (R). Definition4 (SAP): Given a DNA string s and spectrum T m,l (R). Find a T m,l (R)-string s* in the set of T m,l (R)-strings that minimizes the distance function d(s,s*).

CUDA (Compute Unified Device Architecture) Serial Code (host) Parallel Kernel (device) KernelA >>(args); Serial Code (host) Parallel Kernel (device) KernelB >>(args); Integrated host+device app program –Serial or modestly parallel parts in host C code –Highly parallel parts in device SPMD kernel C code

CUDA Execution A GPU device –Is a coprocessor to the CPU or host –Has its own DRAM (device memory) –Runs many threads in parallel Data-parallel portions of an application are expressed as device kernels which run on many threads Differences between GPU and CPU threads –GPU threads are extremely lightweight –Very little creation overhead –GPU needs 1000s of threads for full efficiency

Parallel Error Correction with CUDA Each kernel thread is responsible for correction of a single read r i. Voting based algorithm –First Step: Calculation of voting matrix –Second Step:Single-Mutation fixing/trimming/discarding

Step1: Voting Matrix Calculation

Step2: Fixing/Trimming/Discarding Reads

Fast Membership Tests First algorithm(kernel) dominates time –(L-l). (l+3. p. l) membership tests required where p is the number of l-tuples that do not belong in the spectrum. –Space efficient Bloom filter speeds up membership test of spectrum Compute bloom filter on CPU and store it on texture memory (fast read only cache) on device

Bloom Filter Probabilistic data structure –No false negatives –Small percentage of false positives –Space efficient and fast Uses a bit array B of length m and d hash functions –to insert x, we set B[h i (x)] = 1, for i=1,…,d –to query y, we check if B[h i (y)] all equal 1, for i=1,…,d

Bloom Filter Example a and b are inserted to a m=10 n=2 d=3 bloom filter Query of c on bloom filter returns false since some bits are 0. Query of d on bloom filter returns true since all bits are 1 (False positive).

Overall Algorithm 1)Pre-Computation on the CPU: Program the Bloom filter (counting bloom filter) bit-vector by hashing each l-tuple present on read R. 2)Data transfer from CPU to GPU: Allocate memory/transfer Bloom filter and reads. 3)Execute CUDA kernel. 4)Data transfer from GPU to CPU: Transfer the set of corrected/trimmed reads.

Performance Evaluation System Parameters –Nvidia Geforce GTX 280 with 1GB memory –AMD Opteron dual core 2.2Ghz CPU with 2GB memory Datasets –Artificial Sets (1%, 2%, 3% error rates) Yeast Chromosomes (S.cer5, S.cer7) Bacterial Genomes (H.inf, E.col) –Real Set Staphylococcus Aureus strain MW2 (H.Aci) (error rate ~1%)

Performance Evaluation

Discussion/Conclusion (GOOD) Runtime savings of 10 to 19 times reported. Bigger datasets is not an issue as long as Bloom filter fits in texture memory. (More than one round of read-load/read-correct approach) Possible to even further parallelize on distributed memory GPU farms.

Discussion/Conclusion (BAD) Does not exploit fast shared memory within thread blocks (i.e. each read r i does not really have to be handled by a single thread, voting matrix can be constructed in parallel) thus further speed-up is possible. Predetermined read length L is a bit restrictive.

Thank You