A massively parallel solution to a massively parallel problem HPC4NGS Workshop, May 21-22, 2012 Principe Felipe Research Center, Valencia Brian Lam, PhD.

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

A many-core GPU architecture.. Price, performance, and evolution.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Panda: MapReduce Framework on GPU’s and CPU’s

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Codeplay CEO © Copyright 2012 Codeplay Software Ltd 45 York Place Edinburgh EH1 3HP United Kingdom Visit us at The unique challenges of.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.

Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.

GPU Architecture and Programming

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Introduction to Research 2011 Introduction to Research 2011 Ashok Srinivasan Florida State University Images from ORNL, IBM, NVIDIA.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.

1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

From Turing Machine to Global Illumination Chun-Fa Chang National Taiwan Normal University.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

General Purpose computing on Graphics Processing Units

Computer Graphics Graphics Hardware

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.

GPU Architecture and Its Application

Computer Graphics Graphics Hardware

Multicore and GPU Programming

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

Presentation transcript:

A massively parallel solution to a massively parallel problem HPC4NGS Workshop, May 21-22, 2012 Principe Felipe Research Center, Valencia Brian Lam, PhD

 Next-generation sequencing and current challenges  What is GPGPU computing?  The use of GPGPU in NGS bioinformatics  How it works – the BarraCUDA project  System requirements for GPU computing  What if I want to develop my own GPU code?  What to look out for in the near future?  Conclusions

 Next-generation sequencing and current challenges  What is GPGPU computing?  The use of GPGPU in NGS bioinformatics  How it works – the BarraCUDA project  System requirements for GPU computing  What if I want to develop my own GPU code?  What to look out for in the near future?  Conclusions

“DNA sequencing includes several methods and technologies that are used for determining the order of the nucleotide bases A, G, C, T in a molecule of DNA.” - Wikipedia

 Commercialised in 2005 by 454 Life Science  Sequencing in a massively parallel fashion

 An example: Whole genome sequencing  Extract genomic DNA from blood/mouth swabs  Break into small DNA fragments of bp  Attach DNA fragments to a surface (flow cells/slides/microtitre plates) at a high density  Perform concurrent “cyclic sequencing reaction” to obtain the sequence of each of the attached DNA fragments An Illumina HiSeq 2000 can interrogate 825K spots / mm 2

GTCCTGA TATTTTT ATTCNGG Not to scale

Billions of short DNA sequences, also called sequence reads ranging from 25 to 400 bp

Sequence Alignment Image Analysis Base Calling Image Analysis Base Calling Variant Calling, Peak calling

Source: Genologics

 Next-generation sequencing and current challenges  What is Many-core/GPGPU computing?  The use of GPGPU in NGS bioinformatics  How it works – the BarraCUDA project  System requirements for GPU computing  What if I want to develop my own GPU code?  What to look out for in the near future?  Conclusions

 A physical die that contains a ‘large number’ of processing  i.e. computation can be done in a massively parallel manner  Modern graphics cards (GPUs) consist of hundreds to thousands of computing cores

 GPUs are fast, but there is a catch:  SIMD – Single instruction, multiple data VS  CPUs are powerful multi-purpose processors  MIMD – Multiple Instructions, multiple data

 The very same (low level) instruction is applied to multiple data at the same time  e.g. a GTX680 can do addition to 1536 data point at a time, versus 16 on a 16-core CPU.  Branching results in serialisations  The ALUs on GPU are usually much more primitive compared to their CPU counterparts.

 Scientific computing often deal with large amount of data, and in many occasions, applying the same set of instructions to these data.  Examples: ▪ Monte Carlo simulations ▪ Image analysis ▪ Next-generation sequencing data analysis

 Low capital cost and energy efficient  Dell 12-core workstation £5,000, ~1kW  Dell 40-core computing cluster £ 20,000+, ~6kW  NVIDIA Geforce GTX680 (1536 cores): £400, <0.2kW  NVIDIA C2070 (448 cores): £1000, 0.2kW  Many supercomputers also contain multiple GPU nodes for parallel computations

 Examples:  CUDASW++  6.3X  MUMmerGPU  3.5X  GPU-HMMer  x

 Next-generation sequencing and current challenges  What is Many-core/GPGPU computing?  The use of GPGPU in NGS bioinformatics  How it works – the BarraCUDA project  System requirements for GPU computing  What if I want to develop my own GPU code?  What to look out for in the near future?  Conclusions

 Ion-torrent server (T7500 workstation) uses GPUs for base-calling  MummerGPU – comparisons among genomes  BarraCUDA, Soap3 – short read alignments

 Next-generation sequencing and current challenges  What is Many-core/GPGPU computing?  The use of GPGPU in NGS bioinformatics  How it works – the BarraCUDA project  System requirements for GPU computing  What if I want to develop my own GPU code?  What to look out for in the near future?  Conclusions

Sequence Alignment Image Analysis Base Calling Image Analysis Base Calling Variant Calling, Peak calling

Sequence alignment is a crucial step in the bioinformatics pipeline for downstream analyses This step often takes many CPU hours to perform Usually done on HPC clusters

 The main objective of the BarraCUDA project is to develop a software that runs on GPU/ many-core architectures  i.e. to map sequence reads the same way as they come out from the NGS instrument

Genome Read library CPU Copy read library to GPU Copy genome to GPU Copy read library to GPU Copy genome to GPU CPU Copy alignment results to CPU Write to disk Copy alignment results to CPU Write to disk Alignment Results GPU Alignment

 Originally intended for data compression, performs reversible transformation of a string  In 2000, Ferragina and Manzini introduced BWT-based index data structure for fast substring matching at O(n)  Sub-string matching is performed in a tree traversal-like manner  Used in major sequencing read mapping programs e.g. BWA, Bowtie, Soap2

matching substring ‘banan’

, Modified from Li & Durbin Bioinformatics 2009, 25:14,

BWT_exactmatch(READ,i,k,l){ if (i < 0) then return RESULTS; k = C(READ[i]) + O(READ[i],k-1)+1; l = C(READ[i]) + O(READ[i],l); if (k <= l) then BWT_exactmatch(READ,i-1,k,l); } main(){ Calculate reverse BWT string B from reference string X Calculate arrays C(.) and O (.,.) from B Load READS from disk For every READ in READS do{ i = |READ|;  Position k = 0;  Lower bound l = |X|;  Upper bound BWT_exactmatch(READ,i,k,l); } Write RESULTS to disk } Modified from Li & Durbin Bioinformatics 2009, 25:14,

 Simple data parallelism  Used mainly the GPU for matching

__device__BWT_GPU_exactmatch(W,i,k,l){ if (i < 0) then return RESULTS; k = C(W[i]) + O(W[i],k-1)+1; l = C(W[i]) + O(W[i],l); if (k <= l) then BWT_GPU_exactmatch(W,i-1,k,l); } __global__GPU_kernel(){ W = READS[thread_no]; i = |W|;  Position k = 0;  Lower bound l = |X|;  Upper bound BWT_GPU_exactmatch(W,i,k,l); } main(){ Calculate reverse BWT string B from reference string X Calculate array C(.) and O(.,.) from B Load READS from disk Copy B, C(.) and O(.) to GPU Copy READS to GPU Launch GPU_kernel with > concurrent threads COPY Alignment Results back from GPU Write RESULTS to disk }

 Very fast indeed, using a Tesla C2050, we can match 25 million 100bp reads to the BWT in just over 1 min.  But… is this relevant?

matching substring ‘anb’ where ‘b’ is subsituted with an ‘a’

Search space complexity = O(9 n )! Klus et al., BMC Res Notes :27

 It _________worked!  10% faster than 8x X5472 3GHz  BWA uses a greedy breadth-first search approach (takes up to 40MB per thread)  Not enough workspace for thousands of concurrent kernel threads 4KB) – i.e. reduced accuracy – NOT GOOD ENOUGH! partially

hit

 The very same (low level) instruction is applied to multiple data at the same time  e.g. a GTX680 can do addition to 1536 data point at a time, versus 16 on a 16-core CPU.  Branching results in serialisations  The ALUs on GPU are usually much more primitive compared to their CPU counterparts.

A A B B hit A A B B CPU thread queue: Thread 1.1Thread 1.2 Thread 1

Klus et al., BMC Res Notes :27

Time Taken (min) 0 0 Klus et al., BMC Res Notes :27

 Simple data parallelism  Used mainly the GPU for matching

Klus et al., BMC Res Notes :27

 Next-generation sequencing and current challenges  What is Many-core/GPGPU computing?  The use of GPGPU in NGS bioinformatics  How it works – the BarraCUDA project  System requirements for GPU computing  What if I want to develop my own GPU code?  What to look out for in the near future?  Conclusions

 Hardware  The system must have at least one decent GPU ▪ NVIDIA Geforce 210 will not work!  One or more PCIe x16 slots  A decent power supply with appropriate power connectors ▪ 550W for one Tesla C2075, and + 225W for each additional cards ▪ Don’t use any Molex converters!  Ideally, dedicate cards for computation and use a separate cheap card for display

 Software  CUDA ▪ CUDA toolkit ▪ Appropriate NVIDIA drivers  OpenCL ▪ CUDA toolkit (NVIDIA) ▪ AMD APP SDK (AMD) ▪ Appropriate AMD/NVIDIA drivers

For example: CUDAtoolkit v4.2

 DEMO

 Next-generation sequencing and current challenges  What is Many-core/GPGPU computing?  The use of GPGPU in NGS bioinformatics  How it works – the BarraCUDA project  System requirements for GPU computing  What if I want to develop my own GPU code?  What to look out for in the near future?  Conclusions

 In the end, how much effort have we put in so far?  3300 lines of new code for the alignment core ▪ Compared to 1000 lines in BWA ▪ Still on going!  1000 lines for SA to linear space conversion

 CUDA  (in general) is faster and more complete  Tied to NVIDIA hardware  OpenCL  Still new, AMD and Apple are pushing hard on this  Can run on a variety of hardware including CPUs

 NVIDIA developer zone   OpenCL developer zone  ges/default.aspx

 Different hardware has different capabilities  GT200 is very different from GF100, in terms of cache arrangement and concurrent kernel executions  GF100 is also different from GK110 where the latter can do dynamic parallelism  Low level data management is often required,  e.g. earmarking data for memory type, coalescing memory access

 The easy way out!  A set of compiler directives  It allows programmers to use ‘accelerator’ without explicitly writing GPU code  Supported by CAPS, Cray, NVIDIA, PGI

Source: NVIDIA

 Next-generation sequencing and current challenges  What is Many-core/GPGPU computing?  The use of GPGPU in NGS bioinformatics  How it works – the BarraCUDA project  System requirements for GPU computing  What if I want to develop my own GPU code?  What to look out for in the near future?  Conclusions

 The game is changing quickly  CPUs are also getting more cores : AMD 16-core Opteron 6200 series  Many-core platforms are evolving  Intel MIC processors (50 x86 cores)  NVIDIA Kepler platform (1536 cores)  AMD Radeon 7900 series (2048 cores)  SIMD nodes are becoming more common in supercomputers  OpenCL

 Next-generation sequencing and current challenges  What is Many-core/GPGPU computing?  The use of GPGPU in NGS bioinformatics  How it works – the BarraCUDA project  System requirements for GPU computing  What if I want to develop my own GPU code?  What to look out for in the near future?  Conclusions

 Few software are available for NGS bioinformatics yet, more to come  BarraCUDA is one of the first attempts to accelerate NGS bioinformatics pipeline  Significant amount of coding is usually required, but more programming tools are becoming available  Many-core is still evolving rapidly and things can be very different in the next 2-3 years

IMS-MRL, Cambridge Giles Yeo Petr Klus Simon Lam NIHR-CBRC, Cambridge Ian McFarlane Cassie Ragnauth Whittle Lab, Cambridge Graham Pullan Tobias Brandvik Microbiology, University College Cork Dag Lyberg Gurdon Institute, Cambridge Nicole Cheung HPCS, Cambridge Stuart Rankin NVIDIA Corporation Thomas Bradley Timothy Lanfear