Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Exploiting Superword Level Parallelism with Multimedia Instruction Sets
Samuel Larsen and Saman Amarasinghe, MIT CSAIL Presenter: Baixu Chen, Rui Hou, Yifan Zhao

Outline Introduction SLP Compiler Algorithm and Example Evaluation
Vector Parallelism Superword Level Parallelism SLP Compiler Algorithm and Example Evaluation Conclusion 2/16/2019 Exploiting SLP

Introduction Vector Parallelism (VP):
Single instruction multiple data (SIMD) Some vector registers can hold several values at any one time, and then perform same operation on each value by one pass Example In vectorized iteration, each load will load 3 adjacent values into one vector register, and then one vector add will be executed instead of 3 individual scalar add operation. So execution ideally will be 3x speedup for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i += 3) { A[i] += B[i] + 2; A[i+1] += B[i+1] + 2; A[i+2] += B[i+2] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; } original loop unrolling loop vectorized iteration 2/16/2019 Exploiting SLP

Introduction SLP vs Vector Parallelism
VP: Identify vectorizable instruction by unrolling the loop for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; Identify vectorizable instruction by finding isomorphic instructions, which share similar instruction structure a = b + c * z[i+0]; d = e + f * z[i+1]; g = h + i * z[i+2]; a d g = b e f c i z[i+0] z[i+1] z[i+2] + * SIMD 2/16/2019 Exploiting SLP

Introduction Superword Level Parallelism (SLP):
Generally applicable - SLP is not restricted on parallelization of loops Find independent and isomorphic instructions within same basic block Goal: 1) gain more speedup via parallelization; 2) minimize the cost of packing and unpacking Prefer operating on adjacent memory, whose cost of packing is minimum SLP vs Vector Parallelism for (i = 0; i < 9; ++i) { A[i] += B[i] + 2; } for (i = 0; i < 3; i+=3) { A[i..i+2] += B[i..i+2] + 2; a = b + c * z[i+0]; d = e + f * z[i+1]; g = h + i * z[i+2]; a d g b e f c f i z[i+0] z[i+1] z[i+2] = + * SIMD SIMD 2/16/2019 Exploiting SLP

Introduction Superword Level Parallelism (SLP):
Generally applicable - SLP is not restricted on parallelization of loops Find independent and isomorphic instructions within same basic block Goal: 1) gain more speedup via parallelization; 2) minimize the cost of packing and unpacking Prefer operating on adjacent memory, whose cost of packing is minimum SLP vs Vectorization Parallelism vs ILP ILP SLP VP

Vector Parallelism Superword Level Parallelism SLP Compiler Algorithm and Example Evaluation Conclusion 2/16/2019 Exploiting SLP

Loop Unrolling Allows SLP to also cover loop vectorization
Unroll factor: SIMD datapath size / scalar variable size Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization Loop unrolling for (i = 0...8) { A[i] += B[i] + 2; } for (i = 0...4) { A[i] += B[i] + 2; A[i+1] += B[i+1] + 2; } A[i:i+2] += B[i:i+2] + 2; sizeof(A[i]) == 4 bytes datapath == 8 bytes Unroll count == 2 SLP vectorization 2/16/2019 Exploiting SLP

Alignment Analysis SLP vectorization emits wider loads/stores
A[i]+=B[i]+2: load 4 bytes A[i:i+2]+=B[i:i+2]+2: load 8 bytes Goal: emits aligned SIMD loads (aligned to datapath size) Unaligned loads -> slower (most Intel chips); illegal (ARM chips) Alignment analysis calculates address modulo info 0x1001 mod 4 = 1, 0x1004 mod 8 = 4, etc. Modulo = 0 -> aligned! Merge 2 loads only if resulting in an aligned load load A[i], 4 byte; load A[i+1], 4 byte -> load A[i], 8 byte Only does so if A[i] is aligned to 8 bytes Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization 2/16/2019 Exploiting SLP

Pre-optimization Emitting three-address code (flattened instructions)
Each instruction has at most 3 operands Isomorphic instructions are more identifiable in this form Classical optimizations LICM, Dead code elim., Const. Propagation, etc. Don’t optimize instructions that we don’t need! Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization 2/16/2019 Exploiting SLP

Identifying Adjacent Memory References
Find pairs of adjacent, independent memory loads Adjacent: A[i] and A[i+1] Independent: no overlap, load to different registers Packset: a set of pairs of instruction Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J + I Packset L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] 2/16/2019 Exploiting SLP

Extending Packset Add more instructions into packset…
Use existing packed data as source operands (def-use) Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J +I Packset def-use 2/16/2019 Exploiting SLP

Extending Packset Add more instructions into packset…
Use existing packed data as source operands (def-use) Produce need source operands in packed form (use-def) Packset (1) B = A[i+0] (2) C = E * 3 (3) D = C + B (4) F = A[i+1] (5) G = F * 6 (6) H = G + F (7) I = A[i+2] (8) J = I * 8 (9) K = J +I use-def L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I L: (2) C = E * 3 R: (5) G = F * 6 L: (5) G = F * 6 R: (8) J = I * 8 2/16/2019 Exploiting SLP

Extending Packset -- Cost model
Estimate a cost before/after optimization Perform optimization only when profitable Packing and unpacking has cost Packing: scalar registers -> SIMD register Unpacking: SIMD register -> scalar registers A = f() B = g() C = A + 2 D = B + 3 E = C / 5 F = D * 7 C D = A B + 2 3 A B A B Packing Unpacking 2/16/2019 Exploiting SLP

Extending Packset -- Cost model
Using a vector value compensates for its packing costs 2n instructions -> 1 instruction Want to generate longer use chain Choose Packset instruction wisely Contiguous memory loads/stores -> no packing/unpacking cost Prioritize min-cost instructions when making choices long tmp1, tmp2, A[], B[] tmp1 = B[i+0] + D[i+0]; tmp2 = B[i+1] + D[i+1]; A[i+0] = tmp1; A[i+1] = tmp2; tmp1 tmp2 = B[i+0] B[i+1] + D[i+0] D[i+1] A[i+0] A[i+1] <- No packing/unpacking cost! 2/16/2019 Exploiting SLP

Combination Combine all profitable pairs into larger groups
Two groups can be combined when the left statement of one is the same as the right statement of the other Prevent a statement from appearing in more than one group in final Packset Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization Packset (1) B = A[i+0] (4) F = A[i+1] (7) I = A[i+2] (3) D = C + B (6) H = G + F (9) K = J + I (2) C = E * 3 (5) G = F * 6 (8) J = I * 8 Packset L: (1) B = A[i+0] R: (4) F = A[i+1] L: (4) F = A[i+1] R: (7) I = A[i+2] L: (3) D = C + B R: (6) H = G + F L: (6) H = G + F R: (9) K = J + I L: (2) C = E * 3 R: (5) G = F * 6 L: (5) G = F * 6 R: (8) J = I * 8 2/16/2019 Exploiting SLP

Scheduling Previous steps may break inter-group dependencies!
Groups are processed unordered Moving instructions around risks breaking dependencies Schedule the groups to guarantee correctness a -> b: some instructions in group a depends on some in group b. Cycle: circular dependency between a and b Break one of them (e.g., a), move a’s dependent instructions after b. Finally, emit one SIMD instruction for each group Basicblock PackSet seeding PackSet extension Group combination Scheduling SIMD instructions Loop unrolling Alignment analysis Pre-optimization x = a[i+0] + k1; y = a[i+0] + k2; q = b[i+0] + y; r = b[i+1] + k3; s = b[i+2] + k4; z = a[i+2] + s; x = a[i+0] + k1; y = a[i+0] + k2; z = a[i+2] + s; q = b[i+0] + y; r = b[i+1] + k3; s = b[i+2] + k4; 2/16/2019 Exploiting SLP

Vectorization parallelism Superword level parallelism SLP Compiler Algorithm and Example Evaluation Conclusion 2/16/2019 Exploiting SLP

SLP vs. Vector Parallelism
Evaluation SLP Availability SLP compiler implemented in SUIF SUIF: a compiler infrastructure Tested on two benchmark suites SPEC95fp Multimedia kernels Performance measured three ways SLP availability Measured by percentage of dynamic instructions eliminated from a sequential program after parallelization n-1 instructions needed to pack/unpack n values into a single SIMD register Compared to vector parallelism SLP vs. Vector Parallelism 2/16/2019 Exploiting SLP

Evaluation SLP compiler implemented in SUIF
SUIF: a compiler infrastructure Tested on two benchmark suites SPEC95fp Multimedia kernels Performance measured three ways SLP availability Compared to vector parallelism Speedup on AltiVec Motorola MPC7400 microprocessor with the AltiVec instruction set Speedup on AltiVec 6.7 2/16/2019 Exploiting SLP

Conclusion Multimedia architectures abundant SLP is the right paradigm
Need automatic compilation SLP is the right paradigm 20% dynamic instruction savings from non-vectorizable code sequences in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from Future Work Beyond basic blocks 2/16/2019 Exploiting SLP

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Similar presentations

Presentation on theme: "Samuel Larsen and Saman Amarasinghe, MIT CSAIL"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Similar presentations

Presentation on theme: "Samuel Larsen and Saman Amarasinghe, MIT CSAIL"— Presentation transcript:

Similar presentations

About project

Feedback