Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

Similar presentations


Presentation on theme: "Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory."— Presentation transcript:

1 Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

2 Multimedia Extensions Short vector extensions in ILP processors  AltiVec, 3DNow!, SSE, etc.  Accelerate loops in multimedia & DSP codes  New designs have floating point support

3 Multimedia Extensions Vector resources do not overwhelm the scalar resources  Scalar: 2 FP ops / cycle  Vector: 4 FP ops / cycle Full vectorization may underutilize scalar resources ILP techniques do not target vector resources Need both Courtesy of International Business Machines Corporation. Unauthorized use not permitted.

4 Modulo Scheduling for (i=0; i<N; i++) { s = s + X[i] * Y[i]; } LOAD MULT ADD CycleSlot 1Slot 2Slot 3 1LOAD 2MULT 3LOAD ADD 4MULT ………… CycleSlot 1Slot 2Slot 3 II = 2 mod sched

5 for (i=0; i<N; i++) { s = s + S[i]; } for (i=0; i<N; i+=2) { S[i:i+1] = X[i:i+1] * Y[i:i+1]; } Traditional Vectorization for (i=0; i<N; i+=2) { S[i:i+1] = X[i:i+1] * Y[i:i+1]; } CycleSlot 1Slot 2Slot 3 1VLOAD 2 3VMUL 4VSTORE ………… for (i=0; i<N; i++) { s = s + S[i]; } CycleSlot 1Slot 2Slot 3 1LOAD 2 ADD 3 4 ………… CycleSlot 1Slot 2Slot 3 II = 2 mod sched II = 2 traditional +1

6 Vectorization without Distribution for (i=0; i<N; i+=2) { S = X[i:i+1] * Y[i:i+1]; s = s + S 0 s = s + S 1 } CycleSlot 1Slot 2Slot 3 1VLOAD 2 3VMUL 4VLOADADD 5VLOADADD 6VMUL CycleSlot 1Slot 2Slot 3 II = 2 mod sched II = 3 traditional II = 1.5 no distrib

7 Selective Vectorization for (i=0; i<N; i+=2) { S = X[i:i+1] * Y[i]:Y[i+1]; s = s + S 0 s = s + S 1 } CycleSlot 1Slot 2Slot 3 1VLOADLOAD 2 3VLOADLOAD 4VMULLOAD 5VLOADLOADADD 6VMULLOADADD CycleSlot 1Slot 2Slot 3 II = 2 mod sched II = 3 traditional II = 1.5 no distrib II = 1 selective

8 Complications Complex scheduling requirements  Particularly in statically scheduled machines Memory alignment Example assumes no communication cost In reality, explicit operations required  Often through memory  Reserve critical resources  Potential long latency Performance improvement still possible

9 Tomcatv main loop (50%)

10 Tomcatv (SpecFP 95) Issue Width6 Memory Units2 ALUs4 FPUs2 Vector Units1 Vector Length2* 1.7x Speedup over Modulo Scheduling TechniqueALUMEMFPUVEC Modulo Scheduling622460 Full Vectorization713046 Selective Vectorization7271927

11 Tomcatv (SpecFP 95)

12 Selective Vectorization Balance computation among resources  Minimize II when loop is modulo scheduled Carefully manage communication Incorporate alignment information  Software pipelining hides latency Adapt a 2-cluster partitioning heuristic  [Fidduccia & Matheyses ’82]  [Kernighan & Lin ’70]

13 Selective Vectorization LOAD MULT scalar vector cost ADD

14 Cost Function Projected II due to resources (ResMII)  Bin-packing approach [Rau MICRO ’94]  With some modifications Can ignore operation latency  Software pipelining hides latency  Vectorizable ops not on dependence cycles for (i=0; i<N; i++) { X[i+4] = X[i]; }

15 Evaluation SUIF front-end  Dependence analysis  Dataflow optimization Trimaran back-end  Modulo scheduler  Register allocator  VLIW Simulator Added vector ops Simulation Binary C or Fortran SUIF Front-endDependence AnalysisDataflow OptimizationSUIF to TrimaranSelective VectorizationModulo Scheduling

16 Evaluation Operands communicated through memory Software responsible for realignment Issue Width6 Memory Units2 ALUs4 FPUs2 Vector Units1 Vector Length2*

17 Evaluation SpecFP 92, 95, 2000  Easier to extract dependence information  Detectable data parallelism  64-bit data means vector length of 2  Considered amenable to vectorization & SWP Apply selective vectorization to DO loops  No control flow, no function calls Fully simulate with training sets

18 Traditional Vectorization

19 Vectorization without Distribution

20 Vectorization + Free Communication

21 Vectorization without Distribution

22 Selective Vectorization

23 tomcatv su2corswim mgrid

24 Communication Support Transfer through memory Register to register copy  Uses fewer issue slots  Frees memory resources Shared register file  Vector elements addressable in scalar ops  Requires no extra issue slots

25 Through Memory tomcatv su2corswim mgrid

26 Reg to Reg Transfer Support tomcatv su2corswim mgrid 1.2x improvement

27 Shared Register File tomcatv su2corswim mgrid 1.28x improvement

28 Related Work Traditional vectorization  Allen & Kennedy, Wolfe Software Pipelining  Rau’s iterative modulo scheduling Clustered VLIW  [Aleta MICRO34], [Codina PACT01], [Nystrom MICRO31], [Sanchez MICRO33], [Zalamea MICRO34]  Partitioning among clusters similar  Ours is also an instruction selection problem  No dedicated communication resources

29 Conclusion Targeting all FUs improves performance  Selective vectorization Vectorization better in the backend  Cost analysis more accurate Software pipeline vectorized loops  Good idea anyway  Facilitates selective vectorization  Hides communication and alignment latency


Download ppt "Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory."

Similar presentations


Ads by Google