Download presentation
Presentation is loading. Please wait.
Published byValentine Stewart Modified over 5 years ago
1
Exploiting Superword Level Parallelism with Multimedia Instruction Sets
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology
2
Overview Problem statement New paradigm for parallelism SLP
SLP extraction algorithm Results SLP vs. ILP and vector parallelism Conclusions Future work
3
Multimedia Extensions
Additions to all major ISAs SIMD operations
4
Using Multimedia Extensions
Library calls and inline assembly Difficult to program Not portable
5
Using Multimedia Extensions
Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow!
6
Using Multimedia Extensions
Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow! Need automatic compilation
7
Vector Compilation Pros: Successful for vector computers
Large body of research
8
Vector Compilation Pros: Cons: Successful for vector computers
Large body of research Cons: Involved transformations Targets loop nests
9
Superword Level Parallelism (SLP)
Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis
10
Superword Level Parallelism (SLP)
Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis Independent isomorphic operations New paradigm
11
1. Independent ALU Ops R = R + XR * 1.08327 G = G + XG * 1.89234
B = B + XB * R R XR G = G + XG * B B XB
12
2. Adjacent Memory References
R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R R G = G + X[i:i+2] B B
13
3. Vectorizable Loops for (i=0; i<100; i+=1)
A[i+0] = A[i+0] + B[i+0]
14
3. Vectorizable Loops for (i=0; i<100; i+=4)
A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3] for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3]
15
4. Partially Vectorizable Loops
for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)
16
4. Partially Vectorizable Loops
for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1)
17
Exploiting SLP with SIMD Execution
Benefit: Multiple ALU ops One SIMD op Multiple ld/st ops One wide mem op
18
Exploiting SLP with SIMD Execution
Benefit: Multiple ALU ops One SIMD op Multiple ld/st ops One wide mem op Cost: Packing and unpacking Reshuffling within a register
19
Packing/Unpacking Costs
C = A + 2 D = B + 3 C A 2 D B 3 = +
20
Packing/Unpacking Costs
Packing source operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +
21
Packing/Unpacking Costs
Packing source operands Unpacking destination operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = + E = C / 5 F = D * 7 C C D D
22
Optimizing Program Performance
To achieve the best speedup: Maximize parallelization Minimize packing/unpacking
23
Optimizing Program Performance
To achieve the best speedup: Maximize parallelization Minimize packing/unpacking Many packing possibilities Worst case: n ops n! configurations Different cost/benefit for each choice
24
Observation 1: Packing Costs can be Amortized
Use packed result operands A = B + C D = E + F G = A - H I = D - J
25
Observation 1: Packing Costs can be Amortized
Use packed result operands Share packed source operands A = B + C D = E + F A = B + C D = E + F G = A - H I = D - J G = B + H I = E + J
26
Observation 2: Adjacent Memory is Key
Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth
27
Observation 2: Adjacent Memory is Key
Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth Few packing possibilities Only one ordering exploits pre-packing
28
SLP Extraction Algorithm
Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B
29
SLP Extraction Algorithm
Identify adjacent memory references A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B
30
SLP Extraction Algorithm
Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B
31
SLP Extraction Algorithm
Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -
32
SLP Extraction Algorithm
Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -
33
SLP Extraction Algorithm
Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B C D E F 3 5 = * H J C D A B = -
34
SLP Extraction Algorithm
Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B C D E F 3 5 = * H J C D A B = -
35
SLP Compiler Results SLP compiler implemented in SUIF
Tested on two benchmark suites SPEC95fp Multimedia kernels Performance measured three ways: SLP availability Compared to vector parallelism Speedup on AltiVec
36
SLP Availability
37
SLP vs. Vector Parallelism
38
Speedup on AltiVec 6.7
39
SLP vs. Vector Parallelism
Extracted with a simple analysis SLP is fine grain basic blocks
40
SLP vs. Vector Parallelism
Extracted with a simple analysis SLP is fine grain basic blocks Superset of vector parallelism Unrolling transforms VP to SLP Handles partially vectorizable loops
41
SLP vs. Vector Parallelism
} Basic block
42
SLP vs. Vector Parallelism
Iterations
43
SLP vs. ILP Subset of instruction level parallelism
44
SLP vs. ILP Subset of instruction level parallelism
SIMD hardware is simpler Lack of heavily ported register files
45
SLP vs. ILP Subset of instruction level parallelism
SIMD hardware is simpler Lack of heavily ported register files SIMD instructions are more compact Reduces instruction fetch bandwidth
46
SLP and ILP SLP & ILP can be exploited together
Many architectures can already do this
47
SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete
Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce
48
SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete
Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce Unroll the loop more times When ILP is due to loop level parallelism
49
Conclusions Multimedia architectures abundant
Need automatic compilation
50
Conclusions Multimedia architectures abundant
Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp
51
Conclusions Multimedia architectures abundant
Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70
52
Conclusions Multimedia architectures abundant
Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70 Found SLP in general-purpose codes
53
Future Work SLP analysis beyond basic blocks
Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops
54
Future Work SLP analysis beyond basic blocks SLP architectures
Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops SLP architectures Emphasis on SIMD Better packing/unpacking
55
Exploiting Superword Level Parallelism with Multimedia Instruction Sets
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.