Download presentation
Presentation is loading. Please wait.
1
University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark, Amir Hormati, Scott Mahlke, Sami Yehia *, Krisztián Flautner * University of Michigan *ARM Ltd.
2
University of Michigan Electrical Engineering and Computer Science 2 Computational Efficiency Low power envelope More useful work/transistors Hardware accelerators Niagara II encryption engine Source: AMD Analyst Day 12/14/06
3
University of Michigan Electrical Engineering and Computer Science 3 How Are Accelerators Used? Control statically placed in binary CPU Accel. Program
4
University of Michigan Electrical Engineering and Computer Science 4 Problem With Static Control Not forward/backward compatible CPU Accel. Program CPU Accel.
5
University of Michigan Electrical Engineering and Computer Science 5 Solution: Virtualization Statically identify accelerated computation Abstract accelerator features Dynamically retarget binary Proc. Accel. Program Proc. Accel. Trans. Engineer/ Compiler
6
University of Michigan Electrical Engineering and Computer Science 6 Liquid SIMD Virtualize SIMD accelerators Why virtualize SIMD? –Intel MMX to SSE2 –ARM v6 to Neon –Wide vectors useful [Lin 06]
7
University of Michigan Electrical Engineering and Computer Science 7 SIMD Accelerator Assumptions Same instruction stream Separate pipeline – memory interface Fetch Decode Scalar Exec SIMD Exec Retire
8
University of Michigan Electrical Engineering and Computer Science 8 Use scalar ISA to represent SIMD operations –Compatibility, low overhead Key: easy to translate How to Virtualize Program Branch
9
University of Michigan Electrical Engineering and Computer Science 9 Virtualization Architecture Fetch Decode Execute Retire Accel. uCode Cache Trans.
10
University of Michigan Electrical Engineering and Computer Science 10 1. Data Parallel Operations for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = r3 & constant; C[i] = r4; } + & A B + & A B + & A B C
11
University of Michigan Electrical Engineering and Computer Science 11 1a. What If There’s No Scalar Equivalent? for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; cmp r3, #FF; r3 = movgt #FF;... } SADD A B Idioms can always be constructed
12
University of Michigan Electrical Engineering and Computer Science 12 2. Scalarizing Permutations & + for(i = 0; i < 8; i++) { … r1 = r2 + r3; tmp[i] = r1 } for(i = 0; i < 8; i++) { r1 = offset[i]; r2 = tmp[r1 + i] r3 = r2 & const … } offset = {4, 4, 4, 4, -4, -4, -4, -4} & + & +
13
University of Michigan Electrical Engineering and Computer Science 13 3. Scalarizing Reductions + for(i = 0; i < 8; i++) { … r1 = A[i]; r2 = r2 + r1; … }
14
University of Michigan Electrical Engineering and Computer Science 14 Applied to ARM Neon All instructions supported except… VTBL – indirect indexing v1 = vtbl v2, v3 Interleaved memory accesses Not needed in evaluated benchmarks v3 1 0 1 3 v2 v1 Mem
15
University of Michigan Electrical Engineering and Computer Science 15 Translation to SIMD Update induction variable Use inverse of defined translation rules for(i = 0; i < 8; i++) { r1 = A[i]; r2 = B[i]; r3 = r1 + r2; r4 = offset[i]; C[i + r4] = r3; } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = v3 & constant } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 } i += 4 for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v4 = offset[i]; } for(i = 0; i < 8; i += 4) { v1 = A[i]; v2 = B[i]; v3 = v1 + v2; v3 = shuffle v3; C[i] = v3; }
16
University of Michigan Electrical Engineering and Computer Science 16 Translator Design Translator: efficiency, speed, flexibility Proc. Accel. Program Proc. Accel. Trans. Engineer/ Compiler
17
University of Michigan Electrical Engineering and Computer Science 17 Evaluation Trimaran ARM Hand SIMDized loops SimpleScalar model ARM926 w/ Neon SIMD VHDL translator, 130nm std. cell
18
University of Michigan Electrical Engineering and Computer Science 18 Liquid SIMD Issues Code bloat –<1% overhead beyond baseline Register pressure –Not a problem Translator cost –0.2 mm 2 + 2KB cache Translation overhead
19
University of Michigan Electrical Engineering and Computer Science 19 Translation Overhead SPECfp MediaBenchKernels
20
University of Michigan Electrical Engineering and Computer Science 20 Summary Accelerators are more common and evolving –Costly binary migration SIMD virtualization using scalar ISA –One binary: forward/backward compatibility –Negligible overhead
21
University of Michigan Electrical Engineering and Computer Science 21 Questions ? ? ? ? ? ? ? ? ? ? ? ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.