Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Exploiting Superword Level Parallelism with Multimedia Instruction Sets
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology

Overview Problem statement New paradigm for parallelism  SLP
SLP extraction algorithm Results SLP vs. ILP and vector parallelism Conclusions Future work

Multimedia Extensions
Additions to all major ISAs SIMD operations

Using Multimedia Extensions
Library calls and inline assembly Difficult to program Not portable

Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow!

Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow! Need automatic compilation

Vector Compilation Pros: Successful for vector computers
Large body of research

Vector Compilation Pros: Cons: Successful for vector computers
Large body of research Cons: Involved transformations Targets loop nests

Superword Level Parallelism (SLP)
Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis

Superword Level Parallelism (SLP)
Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis Independent isomorphic operations New paradigm

1. Independent ALU Ops R = R + XR * 1.08327 G = G + XG * 1.89234
B = B + XB * R R XR G = G + XG * B B XB

2. Adjacent Memory References
R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R R G = G + X[i:i+2] B B

3. Vectorizable Loops for (i=0; i<100; i+=1)
A[i+0] = A[i+0] + B[i+0]

3. Vectorizable Loops for (i=0; i<100; i+=4)
A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3] for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3]

4. Partially Vectorizable Loops
for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

4. Partially Vectorizable Loops
for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1)

Exploiting SLP with SIMD Execution
Benefit: Multiple ALU ops  One SIMD op Multiple ld/st ops  One wide mem op

Exploiting SLP with SIMD Execution
Benefit: Multiple ALU ops  One SIMD op Multiple ld/st ops  One wide mem op Cost: Packing and unpacking Reshuffling within a register

Packing/Unpacking Costs
C = A + 2 D = B + 3 C A 2 D B 3 = +

Packing source operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +

Packing source operands Unpacking destination operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = + E = C / 5 F = D * 7 C C D D

Optimizing Program Performance
To achieve the best speedup: Maximize parallelization Minimize packing/unpacking

Optimizing Program Performance
To achieve the best speedup: Maximize parallelization Minimize packing/unpacking Many packing possibilities Worst case: n ops  n! configurations Different cost/benefit for each choice

Observation 1: Packing Costs can be Amortized
Use packed result operands A = B + C D = E + F G = A - H I = D - J

Observation 1: Packing Costs can be Amortized
Use packed result operands Share packed source operands A = B + C D = E + F A = B + C D = E + F G = A - H I = D - J G = B + H I = E + J

Observation 2: Adjacent Memory is Key
Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth

Observation 2: Adjacent Memory is Key
Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth Few packing possibilities Only one ordering exploits pre-packing

SLP Extraction Algorithm
Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

Identify adjacent memory references A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -

Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -

Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B C D E F 3 5 = * H J C D A B = -

SLP Compiler Results SLP compiler implemented in SUIF
Tested on two benchmark suites SPEC95fp Multimedia kernels Performance measured three ways: SLP availability Compared to vector parallelism Speedup on AltiVec

SLP Availability

SLP vs. Vector Parallelism

Speedup on AltiVec 6.7

Extracted with a simple analysis SLP is fine grain  basic blocks

Extracted with a simple analysis SLP is fine grain  basic blocks Superset of vector parallelism Unrolling transforms VP to SLP Handles partially vectorizable loops

} Basic block

Iterations

SLP vs. ILP Subset of instruction level parallelism

SLP vs. ILP Subset of instruction level parallelism
SIMD hardware is simpler Lack of heavily ported register files

SLP vs. ILP Subset of instruction level parallelism
SIMD hardware is simpler Lack of heavily ported register files SIMD instructions are more compact Reduces instruction fetch bandwidth

SLP and ILP SLP & ILP can be exploited together
Many architectures can already do this

SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete
Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce

SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete
Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce Unroll the loop more times When ILP is due to loop level parallelism

Conclusions Multimedia architectures abundant
Need automatic compilation

Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp

Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70

Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70 Found SLP in general-purpose codes

Future Work SLP analysis beyond basic blocks
Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops

Future Work SLP analysis beyond basic blocks SLP architectures
Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops SLP architectures Emphasis on SIMD Better packing/unpacking

Exploiting Superword Level Parallelism with Multimedia Instruction Sets
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Similar presentations

Presentation on theme: "Samuel Larsen Saman Amarasinghe Laboratory for Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Similar presentations

Presentation on theme: "Samuel Larsen Saman Amarasinghe Laboratory for Computer Science"— Presentation transcript:

Similar presentations

About project

Feedback