Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linchuan Chen, Peng Jiang and Gagan Agrawal

Similar presentations


Presentation on theme: "Linchuan Chen, Peng Jiang and Gagan Agrawal"— Presentation transcript:

1 Exploiting Recent SIMD Architectural Advances for Irregular Applications
Linchuan Chen, Peng Jiang and Gagan Agrawal Department of Computer Science and Engineering The Ohio State University, USA

2 Motivation - Applications
Irregular Data Accesses are very Common Graph applications Unstructured grid Particle simulations Sparse matrix operations And more ...

3 Motivation - SIMD Hardware Evolution
More Powerful Parallelism Wider SIMD Lanes Massive MIMD Cores Challenging to Exploit for Applications with Irregular Data Access Memory access locality Write conflicts

4 Intel Xeon Phi Large-scale Parallelism Wide SIMD Lanes
61 cores Each supports 4 hyper-threads Wide SIMD Lanes 512 bit lane width = 16 floats / 8 doubles More Flexible SIMD Programming with Gather/Scatter Enables programming for irregular memory access

5 Intel Xeon Phi - Gather/Scatter Performance
SIMD access locality influenced by access range Performance of gather/scatter fall below load/store significantly at access ranges of higher than 256

6 Contributions A General Optimization Methodology The Steps
Based on a sparse matrix view Data access patterns identification made easy The Steps Data locality enhancement through matrix tiling Data access pattern identification Write conflict removal at both SIMD and MIMD levels Subsets Studied Irregular reductions Graph algorithms SpMM

7 Sparse Matrix View of Irregular Reductions
Sequential computation steps: Read node elements Do computation Write to reduction array

8 Parallel Optimization – Irregular Reductions
Step 1: Matrix Tiling The sparse matrix is tiled into squares (tiles) A tile is stored as COO (rows, cols, vals) Each tile is exclusively processed by one thread in SIMD Step 2: Access Pattern Identification Writes happen in both row and column directions

9 Optimization – irregular reductions
Step 3: Conflict Removal Conflicts SIMD level: different lanes might write to same locations MIMD level: different threads might write to same locations Conflict Removal at SIMD Level – Grouping Divide the non-zeros in each tile into conflict-free groups In each conflict group, every two elements have distinct row IDs and distinct column IDs Conflict Removal at MIMD Level Group tiles in the same way Each parallel step: all threads process one conflict-free tile group

10 Optimization – irregular reductions
Execution (SIMD processing of a conflict-free group) Load row IDs and column IDs, using aligned load Load non-zero values, using aligned load Gather node data according to row IDs Gather node data according to column IDs Compute in SIMD Update reduction array using row IDs with scatter Update reduction array using column IDs with scatter rid, cid, val

11 Sparse Matrix View of Graph Algorithms
Sequential computation steps Load values of source vertices Load values of edges Compute Update destination vertices

12 Optimization – graph algorithms
Locality Enhancement Tiling - same as irregular reductions Access Pattern Different with irregular reductions Only write in one direction Will influence the grouping policy Conflict Removal Conflict-free grouping: only need to consider column IDs

13 SpMM Sparse Matrix-Matrix Multiplication C = A X B
Access Two Sparse Matrices A, B, C are in CSR Straightforward SIMD Implementation Load a vector of elements from A For each element from A, load a vector of elements from B Conduct a SIMD multiplication Store results to hash table using column IDs as keys Very irregular memory access Needs unpacking of vector data B

14 Optimization – SpMM Storage format:
Partition sparse matrix into 4 x 4 tiles Each tile containing non- zero(s) is stored as a dense matrix Tiles containing non-zero(s) are indexed in CSR format SIMD Multiplication between Tiles

15 Tile-wise Multiplication: Conflict-free SIMD Processing
In each iteration: Load Va from A, gather Vb from B using an index mapping such that elements in same position in Va and Vb have same color Conduct SIMD multiplication to get result Vr Conduct a permutation (with gather) to Vr - align elements with corresponding elements (same color) in partial reduction result Vc, and add Vr to Vc Locate the result tile Tc from the hash table for the corresponding row A SIMD add is conducted to reduce Vc to the Tc

16 MIMD Level Parallelism
C = A X B Treat each tile as an element Divide the workload based on the rows of A, for the threads For each tile in A, scale (in SIMD) with the elements in the corresponding row of B Store the scaling results to hash table (in SIMD)

17 Results Platform Applications Intel Xeon Phi SE10P coprocessor
61 cores, 1.1 GHz, 4 hyper threads per core 8 GB GDDR5 Intel ICC , -O3 enabled Applications Irregular reductions Moldyn, Euler Graph algorithms Bellman-Ford, PageRank SpMM float and double used

18 Irregular Reductions Single Thread: different execution approaches
Serial Single thread scalar, row by row processing Serial Tiling (optimal tile size) Scalar, on tiled data SIMD Naive Row by row processing, SIMD SIMD Tiling (Our) (optimal tile size) SIMD processing, on tiled data 9.05x 7.47x 2.09x 2.01x 6/4/2015

19 Irregular Reductions Overall Speedup (Over Single Thread Scalar)
Moldyn: up to 290x speedup, with r, 1024 tile size, 80 threads Euler: up to 64x speedup, kron_g500-logn19, 2048 tile size, 200 threads

20 Graph Algorithms Single Thread: different execution approaches 7.4x

21 Graph Algorithms MIMD + SIMD Performance
Bellman-Ford: up to 467x speedup, with soc-Pokec, 512 tile size, 200 threads PageRank: up to 341x speedup, kron_g500-logn19, 128 tile size, 244 threads

22 SpMM Single Thread Performance (clustered datasets)
SIMD Tiling Speedups: Over Serial 6.58x ~ 8.07x 4.75x ~ 7.70x Over MKL 1.5x ~ 2.89x 1.01x ~ 1.97x

23 SpMM Overall Performance Max Speedups over Serial: float double
Our Approach (Solid Lines) 392.49x 334.34x MKL (Dashed Lines) 139.68x 156.23x

24 Conclusions Effective and General Methodology for SIMDizing Irregular Applications Irregular reductions Graph algorithms SpMM Sparse Matrix View Optimization made easy Common steps can be applied to different subclasses Performance High efficiency in both SIMD and MIMD Utilization

25 Thanks for your attention! Q?
Linchuan Chen Peng Jiang Gagan Agrawal Thanks for your attention! Q?


Download ppt "Linchuan Chen, Peng Jiang and Gagan Agrawal"

Similar presentations


Ads by Google