High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by Steve Rumble

Motivation  NGS technologies produce a ton of data AB SOLiD: 22e6 25-mers Others are even worse…  How does 200e6 50-mers sound?  Algorithms have been pushed hard, but typically assume same workstation CPU  Wozniak and others showed S-W could be well-parallelised on special H/W. What of other algorithms/hardware?

Motivation  GPUs have recently evolved general purpose programmability (GPGPU)  E.g.: nVidia 8800 GTX 16 multiprocessors  8 processors each  => 128 stream processors 768MB onboard 1.35GHz clock Almost a year old now…

Short GPU Overview  Highly parallel execution (hundreds of simultaneous operations)  Hundreds of gigaflops per chip!  Large on-board memories (up to 2GB)  Limitations: No recursion (no stacks) Each multiprocessor’s constituent processors execute same instruction  Thread Divergence due to conditionals hurts… No direct host memory access Small caches (locality is key) High memory latency No dynamic memory allocation (why one would ever do that, I don’t know)

Short GPU Overview  GPGPU environments Previously had to reduce problems to graphics primitives… no more Simplified C-like programming  Paper has very little detail, but they make it sound enticingly simple… Each processor runs the same ‘kernel’

Muh-muh-muh… MUMmer!  Maximal Unique Match  Find longest match for each subsequence of a read (of reasonable length)  Employs Suffix Trees

MUMmerGPU  Plug-and-play replacement for MUMmer  MUMmer is not ‘arithmetic intensive’ Is the GPU a good fit?  Six-step process 1) Build Suffix Tree of reference genome (Ukkonen’s alg. – O(n)) on host CPU 2) Suffix Tree -> GPU Memory 3) Queries -> GPU Memory 4) Kick off the GPU… 5) Results -> Host Memory 6) Final processing on Host CPU

Suffix Trees  We want to find the longest subsequence of a string (query) quickly Suffix Trees permit O(m) string search, m = string length Space complexity is O(n)  But constants are apparently pretty big

Suffix Trees  Definition: Node edges have a node label  A string subsequence  Non-empty (but can be terminating) A path label is the sequence formed by traversing from root to leaf 1-1 correspondence of suffixes of S to path labels Internal nodes have at least 2 children n leaf nodes – one for each suffix of S

Suffix Trees  O(n) space n leaf nodes => at most n – 1 internal nodes => n + (n – 1) + 1 = 2n nodes (worst case) n = 3 n – 1 = 2 3 + 2 + root = 6 nodes

Suffix Trees  Example: TORONTO$ ‘$’ is terminating character T ORONTO$ O$ NTO$ RONTO$ 6 4 05 2 31 O $ NTO$

Suffix Trees  Example: TORONTO$ Searching for ‘ONT’ T ORONTO$ O$ NTO$ RONTO$ 6 4 05 2 31 O $ NTO$

Suffix Trees  Example: TORONTO$ Searching for ‘ONT’ T ORONTO$ O$ NTO$ RONTO$ 6 4 05 2 31 O $ NTO$ ‘ONT’ at position 3 in S

Suffix Trees  MUMmer wants to find all maximal unique matches for all suffixes: E.g., for query ACCGTGCGTC, we want:  ACCGTGCGTC  CCGTGCGTC  CGTGCGTC  GTGCGTC  …  Up to some reasonable limit… Don’t want to go back to root of tree each time…

Suffix Trees  Suffix Links All internal, non-root nodes have a suffix link to another node  If x is a single character and a is a (possibly empty) string (subsequence), then the path from the root to a node v spelling ax (path-label is ax) has a suffix link to node v’, whose path-label is a.  Got that?

Suffix Trees  Example: TORONTO$ Suffix Links… Don’t backtrack (bad ex.) T ORONTO$ O$ NTO$ RONTO$ 6 4 05 2 31 O $ NTO$

Suffix Trees  Example: BANANA$ Better example of Suffix Links A $ NA 1 0 5 3 BANANA$ NA$ $ 24 $

Suffix Trees  Example: BANANA$ Searching for suffixes of ‘ANANA’ A $ NA 1 0 5 3 BANANA$ NA$ $ 24 $

Memory Limitations  Suffix trees take up a fair bit of memory  GPUs have 100’s of MBs, but this is still small  Divide the target sequence into ‘k’ segments with overlaps

Cache Optimisation  Memory latency high, cache performance crucial We’re walking a tree here, not crunching numbers down an array  Can store read-only data in 2D textures; nVidia caching scheme optimises access  Re-order and squish tree nodes into ‘texel blocks’ such that: Nodes near root are level-ordered (BFS) Nodes further down are ordered with descendants

Cache Optimisation 1 2345 678910111213 14151617181921232022242526272829 02468101214 13579111315 1618202224262830 1719212325272931 Texture cache organized in 2x2 blocks. Try to place all children of a node are in the same cache block Shamelessly cribbed from: http://www.cbcb.umd.edu/software/cmatch/FastExactStringMatching.ppt

Cache Optimisation  Reference Sequence stored in 4x2 16 blocks of a 2D array Sequence: A B C D E F G H … ………. A E B F C G D H ………. α Φ β Χ Γ Ψ Δ Ω Why? It worked well.

Cache Optimisation  Memory layouts heuristically determined nVidia cache details not public  Cache optimisation improves execution speed ‘by several fold’.

Conclusions  GPGPU isn’t just good for ‘arithmetic intensive’ applications  5-11x speed-up for NGS data

Conclusions  Fine Print: 5-11x is for the Suffix Tree kernel on the GPU Reality is different! 3.5x speed-up for real data in terms of total application runtime. Pretty constant across read lengths (35-700+ bp)  Careful management of memory layout is crucial Authors claim several-fold performance increase (could be difference between some improvement and none)

Conclusions  Runtime dominated by serial parts of MUMmer

Food for Thought  8800 GTX costs ~$400, uses 100-150 watts  Quad Core 2 chip runs ~$250, uses 100-130 watts  Each core approx. 2x faster than their test CPU  MUMmerGPU maximally 3.5x faster than test CPU  What have we won here?

Food for Thought  Confusing reports “Fast Exact String Matching on the GPU” (Schatz, Trapnell) claims up to 35x improvement  Earlier course paper (early/mid-2007) Why from 35x down to 5-11x with MUMmerGPU?

My Impressions…  (…whatever they’re worth)  GPU is not a clear win (in this case) Suffix trees seem unsuited:  Cache locality trouble  O(n) footprint, but multiplicative constants are still substantial Host CPUs seem to be as good or better (in $ and watts)

My Impressions…  GPGPU’s aren’t a great fit here At least for this algorithm… MUMmerGPU isn’t the order-of-magnitude win it claims to be  But this is a first-generation, general- purpose chip geared toward number-crunching, not pointer- traversing I don’t think we’ve seen the last (nor the best) of GPUs…

High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Similar presentations

Presentation on theme: "High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Similar presentations

Presentation on theme: "High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by."— Presentation transcript:

Similar presentations

About project

Feedback