GPU Acceleration of Pyrosequencing Noise Removal Dept. of Computer Science and Engineering University of South Carolina Yang Gao, Jason D. Bakos Heterogeneous.

Slides:

Advertisements

Similar presentations

Speed, Accurate and Efficient way to identify the DNA.

Advertisements

Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Sequence comparison: Local alignment

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.

Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.

An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Sunpyo Hong, Hyesoon Kim

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

Employing compression solutions under openacc

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

Sequence comparison: Local alignment

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Lecture 5: GPU Compute Architecture

Lecture 5: GPU Compute Architecture for the last time

Sequence Alignment 11/24/2018.

A Dynamically Reconfigurable Automata Processor Overlay

High-Level Synthesis of a Genomic Database Search Engine

Sahand Kashani, Stuart Byma, James Larus 2019/02/16

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

GPU Acceleration of Pyrosequencing Noise Removal Dept. of Computer Science and Engineering University of South Carolina Yang Gao, Jason D. Bakos Heterogeneous and Reconfigurable Computing Lab (HeRC) SAAHPC’12

Agenda Background Needleman-Wunsch GPU Implementation Optimization steps Results Symposium on Application Accelerators in High-Performance Computing 2

Roche 454 Symposium on Application Accelerators in High-Performance Computing 3 GS FLX Titanium XL+ Typical Throughput700 Mb Run Time23 hours Read LengthUp to 1,000 bp Reads per Run~1,000,000 shotgun

From Genomics to Metagenomics Symposium on Application Accelerators in High-Performance Computing 4

Why AmpliconNoise? Symposium on Application Accelerators in High-Performance Computing 5 C. Quince, A. Lanzn, T. Curtis, R. Davenport, N. Hall,I. Head, L.Read, and W. Sloan, “Accurate determination of microbial diversity from 454 pyrosequencing data,” Nature Methods, vol. 6, no. 9, pp. 639–641, Pyrosequencing in Metagenomics has no consensus sequences  Overestimation of the number of operational taxonomic units (OTUs)

SeqDist Symposium on Application Accelerators in High-Performance Computing 6 Clustering method to “merge” the sequences with minor differences SeqDist –How to define the distance between two potential sequences? –Pairwise Needleman-Wunsch and Why? c sequence 1: A G G T C C A G C A T Sequence Alignment Between two short sequences sequence 2: A C C T A G C C A A T short sequences number C: Sequences Distance Computation

Agenda Background Needleman-Wunsch GPU Implementation Optimization steps Results Symposium on Application Accelerators in High-Performance Computing 7

Needleman-Wunsch –Based on penalties for: Adding gaps to sequence 1 Adding gaps to sequence 2 Character substitutions (based on table) Symposium on Application Accelerators in High-Performance Computing 8 sequence 1: A _ _ _ _ G G T C C A G C A T sequence 2: A C C T A G C C A A T sequence 1: A G G T C C A G C A T sequence 2: A _ _ _ C C T A G C C A A T sequence 1: A G G T C C A G C A T sequence 2: A C C T A G C C A A T sequence 1: A G G T C C A G C A T sequence 2: A C C T A G C C A A T AGCT A G7-5-3 C -590 T-4-308

Needleman-Wunsch Symposium on Application Accelerators in High-Performance Computing 9 –Construct a score matrix, where: Each cell (i,j) represents score for a partial alignment state AB CD D = best score, among: 1.Add gap to sequence 1 from B state 2.Add gap to sequence 2 from C state 3.Substitute from A state Final score is in lower-right cell Sequence 1 Sequence 2 AGGTCCAGCAT AB ACD C C T A G C C A A T

Needleman-Wunsch Symposium on Application Accelerators in High-Performance Computing 10 AGGTCCAGCAT DLLLLLLLLLLL AUDDLLLLLLLLL CUUDDDLLLLLLL CUUDDDDLLLLLL TUUUDDDLLLLLL AUUUUUULLLLLL GUUUUUUDLLLLL CUUUUDDDDLLLL CUUUUUDDDLLLL AUUUUUDDDLLDL AUUUUUUUDLLDD TUUUUUUUUUDDD match gap s1 match gap s2 match substitute match Compute move matrix, which records which option was chosen for each cell Trace back to get for alignment length AmpliconNoise: Divide score by alignment length L: left U: upper D: diagnal

Needleman-Wunsch Symposium on Application Accelerators in High-Performance Computing 11 Computation Workload ~(800x800) N-W Matrix Construction ~(400 to 1600) steps N-W Matrix Trace Back About (100,000 x 100,000)/2 Matrices

Agenda Background Needleman-Wunsch GPU Implementation Optimization steps Results Symposium on Application Accelerators in High-Performance Computing 12

Previous Work Symposium on Application Accelerators in High-Performance Computing 13 Finely parallelize single alignment into multiple threads One thread per cell on the diagonal Disadvantages: Complex kernel Unusual memory access pattern Hard to trace back

Our Method: One Thread/Alignment Symposium on Application Accelerators in High-Performance Computing 14 Block Stride

Grid Organization Symposium on Application Accelerators in High-Performance Computing 15 … Sequences 0-31 … Sequences … Sequences 0-31 … Sequences Block 0 Block 1 … … Sequences … Sequences Block 44 Example: –320 sequences –Block size=32 –n = 320/32 = 10 threads/block –(n 2 -n)/2 = 45 blocks Objective: –Evaluate different block sizes …

Agenda Background Needleman-Wunsch GPU Implementation Optimization steps Results Symposium on Application Accelerators in High-Performance Computing 16

Optimization Procedure Optimization Aim: to have more matrices been built concurrently Available variables –Kernel size(in registers) –Block size(in threads) –Grid size(in block or in warp) Constraints –SM resources(max schedulable warps, registers, share memory) –GPU resources(SMs, on-board memory size, memory bandwidth) Symposium on Application Accelerators in High-Performance Computing 17

Kernel Size Make sure the kernel as simple as possible (decrease the register usage)  our final outcome is a 40 registers kernel. Symposium on Application Accelerators in High-Performance Computing 18 Constraints max support warps registers share memory SMs on-board memory memory bandwidth Fixed Parameters Kernel Size40 Block Size- Grid Size-

Block Size Block size alternatives Symposium on Application Accelerators in High-Performance Computing 19 Constraints max support warps registers share memory SMs on-board memory memory bandwidth Fixed Parameters Kernel Size40 Block Size64 Grid Size-

The ideal warp number per SM is 12 W/30 SMs => 360 warps In our grouped sequence design the warps number has to be multiple of 32. Symposium on Application Accelerators in High-Performance Computing 20 BlocksWarpsTime(s) Constraints max support warps registers share memory SMs on-board memory memory bandwidth Fixed Parameters Kernel Size40 Block Size64 Grid Size160

Stream Kernel for Trace Back Aim: To have both matrix construction and trace back working simultaneously without losing performance when transferring. This strategy is not adopted due to lack of memory Trace-back is performed on GPU Symposium on Application Accelerators in High-Performance Computing 21 MC TR TB TIME GPUBUSCPU MC TR MC TR TB MC TR TB Stream1 Stream2 Stream1 Constraints max support warps registers share memory SMs on-board memory memory bandwidth

Register Usage Optimization Symposium on Application Accelerators in High-Performance Computing 22 Constraints max support warps registers share memory SMs on-board memory memory bandwidth Fixed Parameters Kernel Size32 Block Size64 Grid Size192 Kernel Size40  32 Grid Size160  192 Occupancy37.5%  50% Performance Improvement <2% How to decrease the register usage –Ptxas –maxregcount Low performance reason: overhead for register spilling

Other Optimizations Multiple-GPU –4-GPU Implementation –MPI flavor multiple GPU compatible Share Memory –Save previous “move” in the left side –Replace one global memory read by shared memory read Symposium on Application Accelerators in High-Performance Computing 23 AB CD

Agenda Background Needleman-Wunsch GPU Implementation Optimization steps Results Symposium on Application Accelerators in High-Performance Computing 24

Results Symposium on Application Accelerators in High-Performance Computing 25 CPU: Core i7 980

Results Symposium on Application Accelerators in High-Performance Computing 26 Cluster: 40Gb/s Inifiniband Xeon X5660 FCUPs: floating point cell update per second Number of Ranks in our cluster

Conclusion GPUs are a good match for performing high throughput batch alignments for metagenomics Three Fermi GPUs achieve equivalent performance to 16-node cluster, where each node contains 16 processors Performance is bounded by memory bandwidth Global memory size limits us to 50% SM utilization Symposium on Application Accelerators in High-Performance Computing 27

Thank you! Yang Gao Jason D. Bakos