Download presentation
Presentation is loading. Please wait.
1
Exploiting Superword Level Parallelism with Multimedia Instruction Sets
Samuel Larsen and Saman Amarasinghe, MIT CSAIL Presenter: Baixu Chen, Rui Hou, Yifan Zhao
2
Outline Introduction SLP Compiler Algorithm and Example Evaluation
Vector Parallelism Superword Level Parallelism SLP Compiler Algorithm and Example Evaluation Conclusion 2/16/2019 Exploiting SLP
3
Introduction Vector Parallelism (VP):
Single instruction multiple data (SIMD) Some vector registers can hold several values at any one time, and then perform same operation on each value by one pass Example In vectorized iteration, each load will load 3 adjacent values into one vector register, and then one vector add will be executed instead of 3 individual scalar add operation. So execution ideally will be 3x speedup for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i += 3) { A[i] += B[i] + 2; A[i+1] += B[i+1] + 2; A[i+2] += B[i+2] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; } original loop unrolling loop vectorized iteration 2/16/2019 Exploiting SLP
4
Introduction SLP vs Vector Parallelism
VP: Identify vectorizable instruction by unrolling the loop for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; Identify vectorizable instruction by finding isomorphic instructions, which share similar instruction structure a = b + c * z[i+0]; d = e + f * z[i+1]; g = h + i * z[i+2]; a d g = b e f c i z[i+0] z[i+1] z[i+2] + * SIMD 2/16/2019 Exploiting SLP
5
Introduction Superword Level Parallelism (SLP):
Generally applicable - SLP is not restricted on parallelization of loops Find independent and isomorphic instructions within same basic block Goal: 1) gain more speedup via parallelization; 2) minimize the cost of packing and unpacking Prefer operating on adjacent memory, whose cost of packing is minimum SLP vs Vector Parallelism for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; a = b + c * z[i+0]; d = e + f * z[i+1]; g = h + i * z[i+2]; a d g b e f c f i z[i+0] z[i+1] z[i+2] = + * SIMD SIMD 2/16/2019 Exploiting SLP
6
Introduction Superword Level Parallelism (SLP):
Generally applicable - SLP is not restricted on parallelization of loops Find independent and isomorphic instructions within same basic block Goal: 1) gain more speedup via parallelization; 2) minimize the cost of packing and unpacking Prefer operating on adjacent memory, whose cost of packing is minimum SLP vs Vectorization Parallelism vs ILP ILP SLP VP
7
Outline Introduction SLP Compiler Algorithm and Example Evaluation
Vector Parallelism Superword Level Parallelism SLP Compiler Algorithm and Example Evaluation Conclusion 2/16/2019 Exploiting SLP
8
Loop Unrolling Allows SLP to also cover loop vectorization
Unroll factor: SIMD datapath size / scalar variable size Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization Loop unrolling for (i = 0...8) { A[i] += B[i] + 2; } for (i = 0...4) { A[i] += B[i] + 2; A[i+1] += B[i+1] + 2; } A[i:i+2] += B[i:i+2] + 2; sizeof(A[i]) == 4 bytes datapath == 8 bytes Unroll count == 2 SLP vectorization 2/16/2019 Exploiting SLP
9
Alignment Analysis SLP vectorization emits wider loads/stores
A[i]+=B[i]+2: load 4 bytes A[i:i+2]+=B[i:i+2]+2: load 8 bytes Goal: emits aligned SIMD loads (aligned to datapath size) Unaligned loads -> slower (most Intel chips); illegal (ARM chips) Alignment analysis calculates address modulo info 0x1001 mod 4 = 1, 0x1004 mod 8 = 4, etc. Modulo = 0 -> aligned! Merge 2 loads only if resulting in an aligned load load A[i], 4 byte; load A[i+1], 4 byte -> load A[i], 8 byte Only does so if A[i] is aligned to 8 bytes Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization 2/16/2019 Exploiting SLP
10
Pre-optimization Emitting three-address code (flattened instructions)
Each instruction has at most 3 operands Isomorphic instructions are more identifiable in this form Classical optimizations LICM, Dead code elim., Const. Propagation, etc. Don’t optimize instructions that we don’t need! Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization 2/16/2019 Exploiting SLP
11
Identifying Adjacent Memory References
Find pairs of adjacent, independent memory loads Adjacent: A[i] and A[i+1] Independent: no overlap, load to different registers Packset: a set of pairs of instruction Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J + I Packset L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] 2/16/2019 Exploiting SLP
12
Extending Packset Add more instructions into packset…
Use existing packed data as source operands (def-use) Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J +I Packset def-use 2/16/2019 Exploiting SLP
13
Extending Packset Add more instructions into packset…
Use existing packed data as source operands (def-use) Produce need source operands in packed form (use-def) Packset (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J +I use-def L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I L: (2) C = E * 3 R: (5) G = F * 6 L: (5) G = F * 6 R: (8) J = I * 8 2/16/2019 Exploiting SLP
14
Extending Packset -- Cost model
Estimate a cost before/after optimization Perform optimization only when profitable Packing and unpacking has cost Packing: scalar registers -> SIMD register Unpacking: SIMD register -> scalar registers A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 C D = A B + 2 3 A B A B Packing Unpacking 2/16/2019 Exploiting SLP
15
Extending Packset -- Cost model
Using a vector value compensates for its packing costs 2n instructions -> 1 instruction Want to generate longer use chain Choose Packset instruction wisely Contiguous memory loads/stores -> no packing/unpacking cost Prioritize min-cost instructions when making choices long tmp1, tmp2, A[], B[] tmp1 = B[i+0] + D[i+0]; tmp2 = B[i+1] + D[i+1]; A[i+0] = tmp1; A[i+1] = tmp2; tmp1 tmp2 = B[i+0] B[i+1] + D[i+0] D[i+1] A[i+0] A[i+1] <- No packing/unpacking cost! 2/16/2019 Exploiting SLP
16
Combination Combine all profitable pairs into larger groups
Two groups can be combined when the left statement of one is the same as the right statement of the other Prevent a statement from appearing in more than one group in final Packset Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization Packset (1) B = A[i+0] (4) F = A[i+1] (7) I = A[i+2] (3) D = C + B (6) H = G + F (9) K = J + I (2) C = E * 3 (5) G = F * 6 (8) J = I * 8 Packset L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I L: (2) C = E * 3 R: (5) G = F * 6 L: (5) G = F * 6 R: (8) J = I * 8 2/16/2019 Exploiting SLP
17
Scheduling Previous steps may break inter-group dependencies!
Groups are processed unordered Moving instructions around risks breaking dependencies Schedule the groups to guarantee correctness a -> b: some instructions in group a depends on some in group b. Cycle: circular dependency between a and b Break one of them (e.g., a), move a’s dependent instructions after b. Finally, emit one SIMD instruction for each group Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization x = a[i+0] + k1; y = a[i+0] + k2; q = b[i+0] + y; r = b[i+1] + k3; s = b[i+2] + k4; z = a[i+2] + s; x = a[i+0] + k1; y = a[i+0] + k2; z = a[i+2] + s; q = b[i+0] + y; r = b[i+1] + k3; s = b[i+2] + k4; 2/16/2019 Exploiting SLP
18
Outline Introduction SLP Compiler Algorithm and Example Evaluation
Vectorization parallelism Superword level parallelism SLP Compiler Algorithm and Example Evaluation Conclusion 2/16/2019 Exploiting SLP
19
SLP vs. Vector Parallelism
Evaluation SLP Availability SLP compiler implemented in SUIF SUIF: a compiler infrastructure Tested on two benchmark suites SPEC95fp Multimedia kernels Performance measured three ways SLP availability Measured by percentage of dynamic instructions eliminated from a sequential program after parallelization n-1 instructions needed to pack/unpack n values into a single SIMD register Compared to vector parallelism SLP vs. Vector Parallelism 2/16/2019 Exploiting SLP
20
Evaluation SLP compiler implemented in SUIF
SUIF: a compiler infrastructure Tested on two benchmark suites SPEC95fp Multimedia kernels Performance measured three ways SLP availability Compared to vector parallelism Speedup on AltiVec Motorola MPC7400 microprocessor with the AltiVec instruction set Speedup on AltiVec 6.7 2/16/2019 Exploiting SLP
21
Conclusion Multimedia architectures abundant SLP is the right paradigm
Need automatic compilation SLP is the right paradigm 20% dynamic instruction savings from non-vectorizable code sequences in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from Future Work Beyond basic blocks 2/16/2019 Exploiting SLP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.