Download presentation
Presentation is loading. Please wait.
Published bySibyl Terry Modified over 9 years ago
1
Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
2
Multimedia Extensions Short vector extensions in ILP processors AltiVec, 3DNow!, SSE, etc. Accelerate loops in multimedia & DSP codes New designs have floating point support
3
Multimedia Extensions Vector resources do not overwhelm the scalar resources Scalar: 2 FP ops / cycle Vector: 4 FP ops / cycle Full vectorization may underutilize scalar resources ILP techniques do not target vector resources Need both Courtesy of International Business Machines Corporation. Unauthorized use not permitted.
4
Modulo Scheduling for (i=0; i<N; i++) { s = s + X[i] * Y[i]; } LOAD MULT ADD CycleSlot 1Slot 2Slot 3 1LOAD 2MULT 3LOAD ADD 4MULT ………… CycleSlot 1Slot 2Slot 3 II = 2 mod sched
5
for (i=0; i<N; i++) { s = s + S[i]; } for (i=0; i<N; i+=2) { S[i:i+1] = X[i:i+1] * Y[i:i+1]; } Traditional Vectorization for (i=0; i<N; i+=2) { S[i:i+1] = X[i:i+1] * Y[i:i+1]; } CycleSlot 1Slot 2Slot 3 1VLOAD 2 3VMUL 4VSTORE ………… for (i=0; i<N; i++) { s = s + S[i]; } CycleSlot 1Slot 2Slot 3 1LOAD 2 ADD 3 4 ………… CycleSlot 1Slot 2Slot 3 II = 2 mod sched II = 2 traditional +1
6
Vectorization without Distribution for (i=0; i<N; i+=2) { S = X[i:i+1] * Y[i:i+1]; s = s + S 0 s = s + S 1 } CycleSlot 1Slot 2Slot 3 1VLOAD 2 3VMUL 4VLOADADD 5VLOADADD 6VMUL CycleSlot 1Slot 2Slot 3 II = 2 mod sched II = 3 traditional II = 1.5 no distrib
7
Selective Vectorization for (i=0; i<N; i+=2) { S = X[i:i+1] * Y[i]:Y[i+1]; s = s + S 0 s = s + S 1 } CycleSlot 1Slot 2Slot 3 1VLOADLOAD 2 3VLOADLOAD 4VMULLOAD 5VLOADLOADADD 6VMULLOADADD CycleSlot 1Slot 2Slot 3 II = 2 mod sched II = 3 traditional II = 1.5 no distrib II = 1 selective
8
Complications Complex scheduling requirements Particularly in statically scheduled machines Memory alignment Example assumes no communication cost In reality, explicit operations required Often through memory Reserve critical resources Potential long latency Performance improvement still possible
9
Tomcatv main loop (50%)
10
Tomcatv (SpecFP 95) Issue Width6 Memory Units2 ALUs4 FPUs2 Vector Units1 Vector Length2* 1.7x Speedup over Modulo Scheduling TechniqueALUMEMFPUVEC Modulo Scheduling622460 Full Vectorization713046 Selective Vectorization7271927
11
Tomcatv (SpecFP 95)
12
Selective Vectorization Balance computation among resources Minimize II when loop is modulo scheduled Carefully manage communication Incorporate alignment information Software pipelining hides latency Adapt a 2-cluster partitioning heuristic [Fidduccia & Matheyses ’82] [Kernighan & Lin ’70]
13
Selective Vectorization LOAD MULT scalar vector cost ADD
14
Cost Function Projected II due to resources (ResMII) Bin-packing approach [Rau MICRO ’94] With some modifications Can ignore operation latency Software pipelining hides latency Vectorizable ops not on dependence cycles for (i=0; i<N; i++) { X[i+4] = X[i]; }
15
Evaluation SUIF front-end Dependence analysis Dataflow optimization Trimaran back-end Modulo scheduler Register allocator VLIW Simulator Added vector ops Simulation Binary C or Fortran SUIF Front-endDependence AnalysisDataflow OptimizationSUIF to TrimaranSelective VectorizationModulo Scheduling
16
Evaluation Operands communicated through memory Software responsible for realignment Issue Width6 Memory Units2 ALUs4 FPUs2 Vector Units1 Vector Length2*
17
Evaluation SpecFP 92, 95, 2000 Easier to extract dependence information Detectable data parallelism 64-bit data means vector length of 2 Considered amenable to vectorization & SWP Apply selective vectorization to DO loops No control flow, no function calls Fully simulate with training sets
18
Traditional Vectorization
19
Vectorization without Distribution
20
Vectorization + Free Communication
21
Vectorization without Distribution
22
Selective Vectorization
23
tomcatv su2corswim mgrid
24
Communication Support Transfer through memory Register to register copy Uses fewer issue slots Frees memory resources Shared register file Vector elements addressable in scalar ops Requires no extra issue slots
25
Through Memory tomcatv su2corswim mgrid
26
Reg to Reg Transfer Support tomcatv su2corswim mgrid 1.2x improvement
27
Shared Register File tomcatv su2corswim mgrid 1.28x improvement
28
Related Work Traditional vectorization Allen & Kennedy, Wolfe Software Pipelining Rau’s iterative modulo scheduling Clustered VLIW [Aleta MICRO34], [Codina PACT01], [Nystrom MICRO31], [Sanchez MICRO33], [Zalamea MICRO34] Partitioning among clusters similar Ours is also an instruction selection problem No dedicated communication resources
29
Conclusion Targeting all FUs improves performance Selective vectorization Vectorization better in the backend Cost analysis more accurate Software pipeline vectorized loops Good idea anyway Facilitates selective vectorization Hides communication and alignment latency
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.