Exploiting Recent SIMD Architectural Advances for Irregular Applications Linchuan Chen, Peng Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University, USA
Motivation - Applications Irregular Data Accesses are very Common Graph applications Unstructured grid Particle simulations Sparse matrix operations And more ...
Motivation - SIMD Hardware Evolution More Powerful Parallelism Wider SIMD Lanes Massive MIMD Cores Challenging to Exploit for Applications with Irregular Data Access Memory access locality Write conflicts
Intel Xeon Phi Large-scale Parallelism Wide SIMD Lanes 61 cores Each supports 4 hyper-threads Wide SIMD Lanes 512 bit lane width = 16 floats / 8 doubles More Flexible SIMD Programming with Gather/Scatter Enables programming for irregular memory access
Intel Xeon Phi - Gather/Scatter Performance SIMD access locality influenced by access range Performance of gather/scatter fall below load/store significantly at access ranges of higher than 256
Contributions A General Optimization Methodology The Steps Based on a sparse matrix view Data access patterns identification made easy The Steps Data locality enhancement through matrix tiling Data access pattern identification Write conflict removal at both SIMD and MIMD levels Subsets Studied Irregular reductions Graph algorithms SpMM
Sparse Matrix View of Irregular Reductions Sequential computation steps: Read node elements Do computation Write to reduction array
Parallel Optimization – Irregular Reductions Step 1: Matrix Tiling The sparse matrix is tiled into squares (tiles) A tile is stored as COO (rows, cols, vals) Each tile is exclusively processed by one thread in SIMD Step 2: Access Pattern Identification Writes happen in both row and column directions
Optimization – irregular reductions Step 3: Conflict Removal Conflicts SIMD level: different lanes might write to same locations MIMD level: different threads might write to same locations Conflict Removal at SIMD Level – Grouping Divide the non-zeros in each tile into conflict-free groups In each conflict group, every two elements have distinct row IDs and distinct column IDs Conflict Removal at MIMD Level Group tiles in the same way Each parallel step: all threads process one conflict-free tile group
Optimization – irregular reductions Execution (SIMD processing of a conflict-free group) Load row IDs and column IDs, using aligned load Load non-zero values, using aligned load Gather node data according to row IDs Gather node data according to column IDs Compute in SIMD Update reduction array using row IDs with scatter Update reduction array using column IDs with scatter rid, cid, val
Sparse Matrix View of Graph Algorithms Sequential computation steps Load values of source vertices Load values of edges Compute Update destination vertices
Optimization – graph algorithms Locality Enhancement Tiling - same as irregular reductions Access Pattern Different with irregular reductions Only write in one direction Will influence the grouping policy Conflict Removal Conflict-free grouping: only need to consider column IDs
SpMM Sparse Matrix-Matrix Multiplication C = A X B Access Two Sparse Matrices A, B, C are in CSR Straightforward SIMD Implementation Load a vector of elements from A For each element from A, load a vector of elements from B Conduct a SIMD multiplication Store results to hash table using column IDs as keys Very irregular memory access Needs unpacking of vector data B
Optimization – SpMM Storage format: Partition sparse matrix into 4 x 4 tiles Each tile containing non- zero(s) is stored as a dense matrix Tiles containing non-zero(s) are indexed in CSR format SIMD Multiplication between Tiles
Tile-wise Multiplication: Conflict-free SIMD Processing In each iteration: Load Va from A, gather Vb from B using an index mapping such that elements in same position in Va and Vb have same color Conduct SIMD multiplication to get result Vr Conduct a permutation (with gather) to Vr - align elements with corresponding elements (same color) in partial reduction result Vc, and add Vr to Vc Locate the result tile Tc from the hash table for the corresponding row A SIMD add is conducted to reduce Vc to the Tc
MIMD Level Parallelism C = A X B Treat each tile as an element Divide the workload based on the rows of A, for the threads For each tile in A, scale (in SIMD) with the elements in the corresponding row of B Store the scaling results to hash table (in SIMD)
Results Platform Applications Intel Xeon Phi SE10P coprocessor 61 cores, 1.1 GHz, 4 hyper threads per core 8 GB GDDR5 Intel ICC 13.1.0, -O3 enabled Applications Irregular reductions Moldyn, Euler Graph algorithms Bellman-Ford, PageRank SpMM float and double used
Irregular Reductions Single Thread: different execution approaches Serial Single thread scalar, row by row processing Serial Tiling (optimal tile size) Scalar, on tiled data SIMD Naive Row by row processing, SIMD SIMD Tiling (Our) (optimal tile size) SIMD processing, on tiled data 9.05x 7.47x 2.09x 2.01x 6/4/2015
Irregular Reductions Overall Speedup (Over Single Thread Scalar) Moldyn: up to 290x speedup, with 45-3.0r, 1024 tile size, 80 threads Euler: up to 64x speedup, kron_g500-logn19, 2048 tile size, 200 threads
Graph Algorithms Single Thread: different execution approaches 7.4x
Graph Algorithms MIMD + SIMD Performance Bellman-Ford: up to 467x speedup, with soc-Pokec, 512 tile size, 200 threads PageRank: up to 341x speedup, kron_g500-logn19, 128 tile size, 244 threads
SpMM Single Thread Performance (clustered datasets) SIMD Tiling Speedups: Over Serial 6.58x ~ 8.07x 4.75x ~ 7.70x Over MKL 1.5x ~ 2.89x 1.01x ~ 1.97x
SpMM Overall Performance Max Speedups over Serial: float double Our Approach (Solid Lines) 392.49x 334.34x MKL (Dashed Lines) 139.68x 156.23x
Conclusions Effective and General Methodology for SIMDizing Irregular Applications Irregular reductions Graph algorithms SpMM Sparse Matrix View Optimization made easy Common steps can be applied to different subclasses Performance High efficiency in both SIMD and MIMD Utilization
Thanks for your attention! Q? Linchuan Chen chen.1996@osu.edu Peng Jiang jiang.952@osu.edu Gagan Agrawal agrawal@cse.ohio-state.edu Thanks for your attention! Q?