High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Lecture 6: Multicore Systems

Suffix Trees Construction and Applications João Carreira 2008.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Two implementation issues Alphabet size Generalizing to multiple strings.

Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.

Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith.

Rapid Global Alignments How to align genomic sequences in (more or less) linear time.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Next Generation Sequencing, Assembly, and Alignment Methods

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

CS 171: Introduction to Computer Science II

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Next generation read mapping on GPUs Cole Trapnell.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Synthesizable, Space and Time Efficient Algorithms for String Editing Problem. Vamsi K. Kundeti.

B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Binary Trees Chapter 6.

GPGPU platforms GP - General Purpose computation using GPU

By Dominik Seifert B Overview Data Alignment Hashtable MurMurHash Function “Stupid Parallel Hashing” Lookup The Complete Algorithm The Little.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Identifying Reversible Functions From an ROBDD Adam MacDonald.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Computer Graphics Graphics Hardware

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.

Extracted directly from:

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Chun-Yuan Lin Assistant Professor Department of Computer Science and Information Engineering Chang Gung University Experiences for computational biology.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

SHRiMP: Accurate Mapping of Short Reads in Letter- and Colour-spaces Stephen Rumble, Phil Lacroute, …, Arend Sidow, Michael Brudno.

Cuda application-Case study 2015/10/24 1. Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Memory Hierarchy: Terminology Hit: data appears in some block in the upper level (example: Block X)  Hit Rate : the fraction of memory access found in.

Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.

Sunpyo Hong, Hyesoon Kim

BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

My Coordinates Office EM G.27 contact time:

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Real-Time Ray Tracing Stefan Popov.

Lecture 2: Intro to the simd lifestyle and GPU internals

CSE 373 Data Structures and Algorithms

6- General Purpose GPU Programming

Presentation transcript:

High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by Steve Rumble

Motivation  NGS technologies produce a ton of data AB SOLiD: 22e6 25-mers Others are even worse…  How does 200e6 50-mers sound?  Algorithms have been pushed hard, but typically assume same workstation CPU  Wozniak and others showed S-W could be well-parallelised on special H/W. What of other algorithms/hardware?

Motivation  GPUs have recently evolved general purpose programmability (GPGPU)  E.g.: nVidia 8800 GTX 16 multiprocessors  8 processors each  => 128 stream processors 768MB onboard 1.35GHz clock Almost a year old now…

Short GPU Overview  Highly parallel execution (hundreds of simultaneous operations)  Hundreds of gigaflops per chip!  Large on-board memories (up to 2GB)  Limitations: No recursion (no stacks) Each multiprocessor’s constituent processors execute same instruction  Thread Divergence due to conditionals hurts… No direct host memory access Small caches (locality is key) High memory latency No dynamic memory allocation (why one would ever do that, I don’t know)

Short GPU Overview  GPGPU environments Previously had to reduce problems to graphics primitives… no more Simplified C-like programming  Paper has very little detail, but they make it sound enticingly simple… Each processor runs the same ‘kernel’

Muh-muh-muh… MUMmer!  Maximal Unique Match  Find longest match for each subsequence of a read (of reasonable length)  Employs Suffix Trees

MUMmerGPU  Plug-and-play replacement for MUMmer  MUMmer is not ‘arithmetic intensive’ Is the GPU a good fit?  Six-step process 1) Build Suffix Tree of reference genome (Ukkonen’s alg. – O(n)) on host CPU 2) Suffix Tree -> GPU Memory 3) Queries -> GPU Memory 4) Kick off the GPU… 5) Results -> Host Memory 6) Final processing on Host CPU

Suffix Trees  We want to find the longest subsequence of a string (query) quickly Suffix Trees permit O(m) string search, m = string length Space complexity is O(n)  But constants are apparently pretty big

Suffix Trees  Definition: Node edges have a node label  A string subsequence  Non-empty (but can be terminating) A path label is the sequence formed by traversing from root to leaf 1-1 correspondence of suffixes of S to path labels Internal nodes have at least 2 children n leaf nodes – one for each suffix of S

Suffix Trees  O(n) space n leaf nodes => at most n – 1 internal nodes => n + (n – 1) + 1 = 2n nodes (worst case) n = 3 n – 1 = root = 6 nodes

Suffix Trees  Example: TORONTO$ ‘$’ is terminating character T ORONTO$ O$ NTO$ RONTO$ O $ NTO$

Suffix Trees  Example: TORONTO$ Searching for ‘ONT’ T ORONTO$ O$ NTO$ RONTO$ O $ NTO$

Suffix Trees  Example: TORONTO$ Searching for ‘ONT’ T ORONTO$ O$ NTO$ RONTO$ O $ NTO$

Suffix Trees  Example: TORONTO$ Searching for ‘ONT’ T ORONTO$ O$ NTO$ RONTO$ O $ NTO$

Suffix Trees  Example: TORONTO$ Searching for ‘ONT’ T ORONTO$ O$ NTO$ RONTO$ O $ NTO$ ‘ONT’ at position 3 in S

Suffix Trees  MUMmer wants to find all maximal unique matches for all suffixes: E.g., for query ACCGTGCGTC, we want:  ACCGTGCGTC  CCGTGCGTC  CGTGCGTC  GTGCGTC  …  Up to some reasonable limit… Don’t want to go back to root of tree each time…

Suffix Trees  Suffix Links All internal, non-root nodes have a suffix link to another node  If x is a single character and a is a (possibly empty) string (subsequence), then the path from the root to a node v spelling ax (path-label is ax) has a suffix link to node v’, whose path-label is a.  Got that?

Suffix Trees  Example: TORONTO$ Suffix Links… Don’t backtrack (bad ex.) T ORONTO$ O$ NTO$ RONTO$ O $ NTO$

Suffix Trees  Example: BANANA$ Better example of Suffix Links A $ NA BANANA$ NA$ $ 24 $

Suffix Trees  Example: BANANA$ Searching for suffixes of ‘ANANA’ A $ NA BANANA$ NA$ $ 24 $

Suffix Trees  Example: BANANA$ Searching for suffixes of ‘ANANA’ A $ NA BANANA$ NA$ $ 24 $

Suffix Trees  Example: BANANA$ Searching for suffixes of ‘ANANA’ A $ NA BANANA$ NA$ $ 24 $

Suffix Trees  Example: BANANA$ Searching for suffixes of ‘ANANA’ A $ NA BANANA$ NA$ $ 24 $

Suffix Trees  Example: BANANA$ Searching for suffixes of ‘ANANA’ A $ NA BANANA$ NA$ $ 24 $

Suffix Trees  Example: BANANA$ Searching for suffixes of ‘ANANA’ A $ NA BANANA$ NA$ $ 24 $

Memory Limitations  Suffix trees take up a fair bit of memory  GPUs have 100’s of MBs, but this is still small  Divide the target sequence into ‘k’ segments with overlaps

Cache Optimisation  Memory latency high, cache performance crucial We’re walking a tree here, not crunching numbers down an array  Can store read-only data in 2D textures; nVidia caching scheme optimises access  Re-order and squish tree nodes into ‘texel blocks’ such that: Nodes near root are level-ordered (BFS) Nodes further down are ordered with descendants

Cache Optimisation Texture cache organized in 2x2 blocks. Try to place all children of a node are in the same cache block Shamelessly cribbed from:

Cache Optimisation  Reference Sequence stored in 4x2 16 blocks of a 2D array Sequence: A B C D E F G H … ………. A E B F C G D H ………. α Φ β Χ Γ Ψ Δ Ω Why? It worked well.

Cache Optimisation  Memory layouts heuristically determined nVidia cache details not public  Cache optimisation improves execution speed ‘by several fold’.

Conclusions  GPGPU isn’t just good for ‘arithmetic intensive’ applications  5-11x speed-up for NGS data

Conclusions  Fine Print: 5-11x is for the Suffix Tree kernel on the GPU Reality is different! 3.5x speed-up for real data in terms of total application runtime. Pretty constant across read lengths ( bp)  Careful management of memory layout is crucial Authors claim several-fold performance increase (could be difference between some improvement and none)

Conclusions  Runtime dominated by serial parts of MUMmer

Food for Thought  8800 GTX costs ~$400, uses watts  Quad Core 2 chip runs ~$250, uses watts  Each core approx. 2x faster than their test CPU  MUMmerGPU maximally 3.5x faster than test CPU  What have we won here?

Food for Thought  Confusing reports “Fast Exact String Matching on the GPU” (Schatz, Trapnell) claims up to 35x improvement  Earlier course paper (early/mid-2007) Why from 35x down to 5-11x with MUMmerGPU?

My Impressions…  (…whatever they’re worth)  GPU is not a clear win (in this case) Suffix trees seem unsuited:  Cache locality trouble  O(n) footprint, but multiplicative constants are still substantial Host CPUs seem to be as good or better (in $ and watts)

My Impressions…  GPGPU’s aren’t a great fit here At least for this algorithm… MUMmerGPU isn’t the order-of-magnitude win it claims to be  But this is a first-generation, general- purpose chip geared toward number-crunching, not pointer- traversing I don’t think we’ve seen the last (nor the best) of GPUs…