P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche and Alexander Tiskin Department of Computer Science University of Warwick May 09/2006
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 2 Outline 1.Introduction LLCS Computation The BSP Model 2.Problem Definition and Algorithms Standard Algorithm Parallel Algorithm 3.Experiments Experiment Setup Predictions Speedup
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 3 Motivation Computing the (Length of the) Longest Common Subsequence is representative of a class of dynamic programming algorithms. Hence, we want to Examine the suitability of high-level BSP programming for such problems Compare different BSP libraries on different systems See what happens when there is good sequential performance Examine performance predictability
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 4 Related Work Sequential dynamic programming algorithm (Hirschberg, ’75) Crochemore, Iliopoulos, Pinzon, Reid: A fast and practical bit-vector algorithm for the Longest Common Subsequence problem (2001) Alves, Cáceres, Dehne: Parallel dynamic programming for solving the string editing problem on a CGM/BSP (2002). Garcia, Myoupo, Semé: A coarse-grained multicomputer algorithm for the longest common subsequence problem (2003).
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 5 Our Work Combination of bit-parallel algorithms and fast BSP-style communication A BSP performance model and predictions Comparison using different libraries on different systems Estimation of block size parameter before calculation for better speedup
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 6 The BSP Model p identical processor/memory pairs (computing nodes) Computation speed f on every node Arbitrary interconnection network, latency l, bandwidth gap g
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 7 BSP Programs SPMD execution, takes place in supersteps Communication may be delayed until the end of the superstep Time/Cost Formula : T = f ·W + g · H + l · S Bytes will be used as a base unit for communication size
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 8 Problem Definition Let X = x 1 x 2...x m and Y = y 1 y 2...y n be two strings on a finite alphabet Subsequence U of string X: U can be obtained by deleting zero or more elements from X i.e. U = x i 1 x i 2...x i k and i q < i q+1 for all q with 1 ≤ q < k. Strings X and Y : LCS (X, Y) is any string which is subsequence of both X and Y and has maximum possible length. Length of these sequences: LLCS (X, Y).
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 9 Sequential Algorithm Dynamic programming matrix L 0..m,0..n L i,j = LLCS( x 1 x 2 …x i, y 1 y 2 …y j ) The values in this matrix can be computed in O(mn) time and space
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 10 Parallel Algorithm Based on a simple parallel algorithm for grid DAG computation Dynamic programming matrix L is partitioned into a grid of rectangular blocks of size (m/G)×(n/G) (G : grid size) Blocks in a wavefront can be processed in parallel Assumptions: ▫ Strings of equal length m = n ▫ Ratio = G/p is an integer
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 11 Parallel Cost Model Input/output data distribution is block-cyclic ► Can keep data for block-columns locally Running time: ► Parameter a can be used to tune performance
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 12 Bit-Parallel Algorithms Bit-parallel computation processes entries of L in parallel ( : machine word size) This leads to substantial speedup for the sequential computation phase and slightly lower communication cost per superstep.
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 13 Systems Used Measurements on parallel machines at the Centre for Scientific Computing: aracari : IBM cluster, 64 × 2-way SMP Pentium3 1.4 GHz/128 GB of memory (Interconnection Network: Myrinet 2000, MPI: mpich-gm) argus : Linux cluster, 31 × 2-way SMP Pentium4 Xeon 2.6 GHz processors/62 GB of memory (Interconnection Network: 100Mbit Ethernet, MPI: mpich-p4) skua : SGI Altix shared memory machine, 56 × Itanium GHz processors / 112 GB of memory (MPI: SGI native)
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 14 BSP Libraries Used The Oxford BSP Toolset on top of MPI ( oxtool/ ) PUB on top of MPI (except on the SGI) ( wwwcs.uni-paderborn.de/~bsp/ ) A simple BSPlib implementation based on MPI(-2)
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 15 Input and Parameters Input strings generated randomly of equal length Predictability examined for string lengths between 8192 and 65536, grid size parameter a between 1 and 5 Values of l, g measured by timing random permutations
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 16 Experimental Values of f and f´ Simple Algorithm (f) skua ns/op 130 M op/s argus ns/op 61 M op/s aracari ns/op 86 M op/s Bit-Parallel Algorithm (f´) skua ns/op 4.5 G op/s argus ns/op 2.9 G op/s aracari ns/op 1.8 G op/s
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 17 Predictions Good results on distributed memory systems aracari/ MPI – 32 Processors
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 18 Predictions Slightly worse results on shared memory ( skua, MPI, p=32)
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 19 Problems when Predicting Performance Results for PUB less accurate on shared memory Setup costs only covered by parameter l ► difficult to measure ► Problems on the shared memory machine when communication size is small PUB has performance break-in when communication size reaches a certain value Busy communication network can create ‘spikes’
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 20 Predictions for the Bit-Parallel Version Good results on distributed memory systems Results on the SGI have larger prediction error because local computations use block sizes for which f´is not stable
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 21 Speedup Results (LLCS, aracari )
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 22 Speedup for the Bit-Parallel Version Speedup slightly lower than for the standard version However, overall running times for same problem sizes are shorter Can expect parallel speedup for larger problem sizes
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 23 Speedup for the Bit-Parallel Version argus, p=10 skua, p=32
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 24 Result Summary
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 25 Summary and Outlook Summary ▫ High-level BSP programming is efficient for the dynamic programming problem we considered. ▫ Implementations benefit from a low latency implementation (The Oxford BSP toolset/PUB) ▫ Very good predictability Outlook ▫ Different modeling of bandwidth allows better predictions ▫ Lower latency possible by using subgroup synchronization ▫ Extraction of LCS possible, using post processing step or other algorithm...