Exploiting Superword Level Parallelism with Multimedia Instruction Sets Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp
Overview Problem statement New paradigm for parallelism SLP SLP extraction algorithm Results SLP vs. ILP and vector parallelism Conclusions Future work
Multimedia Extensions Additions to all major ISAs SIMD operations
Using Multimedia Extensions Library calls and inline assembly Difficult to program Not portable
Using Multimedia Extensions Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow!
Using Multimedia Extensions Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow! Need automatic compilation
Vector Compilation Pros: Successful for vector computers Large body of research
Vector Compilation Pros: Cons: Successful for vector computers Large body of research Cons: Involved transformations Targets loop nests
Superword Level Parallelism (SLP) Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis
Superword Level Parallelism (SLP) Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis Independent isomorphic operations New paradigm
1. Independent ALU Ops R = R + XR * 1.08327 G = G + XG * 1.89234 B = B + XB * 1.29835 R R XR 1.08327 G = G + XG * 1.89234 B B XB 1.29835
2. Adjacent Memory References R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R R G = G + X[i:i+2] B B
3. Vectorizable Loops for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]
3. Vectorizable Loops for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3] for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3]
4. Partially Vectorizable Loops for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)
4. Partially Vectorizable Loops for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1)
Exploiting SLP with SIMD Execution Benefit: Multiple ALU ops One SIMD op Multiple ld/st ops One wide mem op
Exploiting SLP with SIMD Execution Benefit: Multiple ALU ops One SIMD op Multiple ld/st ops One wide mem op Cost: Packing and unpacking Reshuffling within a register
Packing/Unpacking Costs C = A + 2 D = B + 3 C A 2 D B 3 = +
Packing/Unpacking Costs Packing source operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +
Packing/Unpacking Costs Packing source operands Unpacking destination operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = + E = C / 5 F = D * 7 C C D D
Optimizing Program Performance To achieve the best speedup: Maximize parallelization Minimize packing/unpacking
Optimizing Program Performance To achieve the best speedup: Maximize parallelization Minimize packing/unpacking Many packing possibilities Worst case: n ops n! configurations Different cost/benefit for each choice
Observation 1: Packing Costs can be Amortized Use packed result operands A = B + C D = E + F G = A - H I = D - J
Observation 1: Packing Costs can be Amortized Use packed result operands Share packed source operands A = B + C D = E + F A = B + C D = E + F G = A - H I = D - J G = B + H I = E + J
Observation 2: Adjacent Memory is Key Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth
Observation 2: Adjacent Memory is Key Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth Few packing possibilities Only one ordering exploits pre-packing
SLP Extraction Algorithm Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B
SLP Extraction Algorithm Identify adjacent memory references A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B
SLP Extraction Algorithm Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B
SLP Extraction Algorithm Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -
SLP Extraction Algorithm Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -
SLP Extraction Algorithm Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B C D E F 3 5 = * H J C D A B = -
SLP Extraction Algorithm Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B C D E F 3 5 = * H J C D A B = -
SLP Compiler Results SLP compiler implemented in SUIF Tested on two benchmark suites SPEC95fp Multimedia kernels Performance measured three ways: SLP availability Compared to vector parallelism Speedup on AltiVec
SLP Availability
SLP vs. Vector Parallelism
Speedup on AltiVec 6.7
SLP vs. Vector Parallelism Extracted with a simple analysis SLP is fine grain basic blocks
SLP vs. Vector Parallelism Extracted with a simple analysis SLP is fine grain basic blocks Superset of vector parallelism Unrolling transforms VP to SLP Handles partially vectorizable loops
SLP vs. Vector Parallelism } Basic block
SLP vs. Vector Parallelism Iterations
SLP vs. ILP Subset of instruction level parallelism
SLP vs. ILP Subset of instruction level parallelism SIMD hardware is simpler Lack of heavily ported register files
SLP vs. ILP Subset of instruction level parallelism SIMD hardware is simpler Lack of heavily ported register files SIMD instructions are more compact Reduces instruction fetch bandwidth
SLP and ILP SLP & ILP can be exploited together Many architectures can already do this
SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce
SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce Unroll the loop more times When ILP is due to loop level parallelism
Conclusions Multimedia architectures abundant Need automatic compilation
Conclusions Multimedia architectures abundant Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp
Conclusions Multimedia architectures abundant Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70
Conclusions Multimedia architectures abundant Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70 Found SLP in general-purpose codes
Future Work SLP analysis beyond basic blocks Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops
Future Work SLP analysis beyond basic blocks SLP architectures Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops SLP architectures Emphasis on SIMD Better packing/unpacking
Exploiting Superword Level Parallelism with Multimedia Instruction Sets Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp