Presentation is loading. Please wait.

Presentation is loading. Please wait.

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Similar presentations


Presentation on theme: "Samuel Larsen Saman Amarasinghe Laboratory for Computer Science"— Presentation transcript:

1 Exploiting Superword Level Parallelism with Multimedia Instruction Sets
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology

2 Overview Problem statement New paradigm for parallelism  SLP
SLP extraction algorithm Results SLP vs. ILP and vector parallelism Conclusions Future work

3 Multimedia Extensions
Additions to all major ISAs SIMD operations

4 Using Multimedia Extensions
Library calls and inline assembly Difficult to program Not portable

5 Using Multimedia Extensions
Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow!

6 Using Multimedia Extensions
Library calls and inline assembly Difficult to program Not portable Different extensions to the same ISA MMX and SSE SSE vs. 3DNow! Need automatic compilation

7 Vector Compilation Pros: Successful for vector computers
Large body of research

8 Vector Compilation Pros: Cons: Successful for vector computers
Large body of research Cons: Involved transformations Targets loop nests

9 Superword Level Parallelism (SLP)
Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis

10 Superword Level Parallelism (SLP)
Small amount of parallelism Typically 2 to 8-way Exists within basic blocks Uncovered with a simple analysis Independent isomorphic operations New paradigm

11 1. Independent ALU Ops R = R + XR * 1.08327 G = G + XG * 1.89234
B = B + XB * R R XR G = G + XG * B B XB

12 2. Adjacent Memory References
R = R + X[i+0] G = G + X[i+1] B = B + X[i+2] R R G = G + X[i:i+2] B B

13 3. Vectorizable Loops for (i=0; i<100; i+=1)
A[i+0] = A[i+0] + B[i+0]

14 3. Vectorizable Loops for (i=0; i<100; i+=4)
A[i+0] = A[i+0] + B[i+0] A[i+1] = A[i+1] + B[i+1] A[i+2] = A[i+2] + B[i+2] A[i+3] = A[i+3] + B[i+3] for (i=0; i<100; i+=4) A[i:i+3] = B[i:i+3] + C[i:i+3]

15 4. Partially Vectorizable Loops
for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

16 4. Partially Vectorizable Loops
for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L) L = A[i+1] – B[i+1] D = D + abs(L) for (i=0; i<16; i+=2) L0 L1 = A[i:i+1] – B[i:i+1] D = D + abs(L0) D = D + abs(L1)

17 Exploiting SLP with SIMD Execution
Benefit: Multiple ALU ops  One SIMD op Multiple ld/st ops  One wide mem op

18 Exploiting SLP with SIMD Execution
Benefit: Multiple ALU ops  One SIMD op Multiple ld/st ops  One wide mem op Cost: Packing and unpacking Reshuffling within a register

19 Packing/Unpacking Costs
C = A + 2 D = B + 3 C A 2 D B 3 = +

20 Packing/Unpacking Costs
Packing source operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = +

21 Packing/Unpacking Costs
Packing source operands Unpacking destination operands A A B B A = f() B = g() C = A + 2 D = B + 3 C A 2 D B 3 = + E = C / 5 F = D * 7 C C D D

22 Optimizing Program Performance
To achieve the best speedup: Maximize parallelization Minimize packing/unpacking

23 Optimizing Program Performance
To achieve the best speedup: Maximize parallelization Minimize packing/unpacking Many packing possibilities Worst case: n ops  n! configurations Different cost/benefit for each choice

24 Observation 1: Packing Costs can be Amortized
Use packed result operands A = B + C D = E + F G = A - H I = D - J

25 Observation 1: Packing Costs can be Amortized
Use packed result operands Share packed source operands A = B + C D = E + F A = B + C D = E + F G = A - H I = D - J G = B + H I = E + J

26 Observation 2: Adjacent Memory is Key
Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth

27 Observation 2: Adjacent Memory is Key
Large potential performance gains Eliminate ld/str instructions Reduce memory bandwidth Few packing possibilities Only one ordering exploits pre-packing

28 SLP Extraction Algorithm
Identify adjacent memory references A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

29 SLP Extraction Algorithm
Identify adjacent memory references A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

30 SLP Extraction Algorithm
Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B

31 SLP Extraction Algorithm
Follow def-use chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -

32 SLP Extraction Algorithm
Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B H J C D A B = -

33 SLP Extraction Algorithm
Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B C D E F 3 5 = * H J C D A B = -

34 SLP Extraction Algorithm
Follow use-def chains A B = X[i:i+1] A = X[i+0] C = E * 3 B = X[i+1] H = C – A D = F * 5 J = D - B C D E F 3 5 = * H J C D A B = -

35 SLP Compiler Results SLP compiler implemented in SUIF
Tested on two benchmark suites SPEC95fp Multimedia kernels Performance measured three ways: SLP availability Compared to vector parallelism Speedup on AltiVec

36 SLP Availability

37 SLP vs. Vector Parallelism

38 Speedup on AltiVec 6.7

39 SLP vs. Vector Parallelism
Extracted with a simple analysis SLP is fine grain  basic blocks

40 SLP vs. Vector Parallelism
Extracted with a simple analysis SLP is fine grain  basic blocks Superset of vector parallelism Unrolling transforms VP to SLP Handles partially vectorizable loops

41 SLP vs. Vector Parallelism
} Basic block

42 SLP vs. Vector Parallelism
Iterations

43 SLP vs. ILP Subset of instruction level parallelism

44 SLP vs. ILP Subset of instruction level parallelism
SIMD hardware is simpler Lack of heavily ported register files

45 SLP vs. ILP Subset of instruction level parallelism
SIMD hardware is simpler Lack of heavily ported register files SIMD instructions are more compact Reduces instruction fetch bandwidth

46 SLP and ILP SLP & ILP can be exploited together
Many architectures can already do this

47 SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete
Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce

48 SLP and ILP SLP & ILP can be exploited together SLP & ILP may compete
Many architectures can already do this SLP & ILP may compete Occurs when parallelism is scarce Unroll the loop more times When ILP is due to loop level parallelism

49 Conclusions Multimedia architectures abundant
Need automatic compilation

50 Conclusions Multimedia architectures abundant
Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp

51 Conclusions Multimedia architectures abundant
Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70

52 Conclusions Multimedia architectures abundant
Need automatic compilation SLP is the right paradigm 20% non-vectorizable in SPEC95fp SLP extraction successful Simple, local analysis Provides speedups from 1.24 – 6.70 Found SLP in general-purpose codes

53 Future Work SLP analysis beyond basic blocks
Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops

54 Future Work SLP analysis beyond basic blocks SLP architectures
Packing maintained across blocks Loop invariant packing Fill unused slots with speculative ops SLP architectures Emphasis on SIMD Better packing/unpacking

55 Exploiting Superword Level Parallelism with Multimedia Instruction Sets
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology


Download ppt "Samuel Larsen Saman Amarasinghe Laboratory for Computer Science"

Similar presentations


Ads by Google