11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan Clark ▪, Scott Mahlke ▪, and Krisztian Flautner * * ARM Ltd. ▪ Advanced Computer Architecture Lab, University of Michigan CASES 2005, September 24-27
222 University of Michigan Electrical Engineering and Computer Science Embedded Products Convergence Needs of performance for increasing application demands Embedded systems win through customization : more performance, low power, etc.. Traditional ISA customization and hardware specialization cannot cope with the increase of functionalities. One way : Transparent Instruction Set Customization 3.5G (HSDPA) WiMax NFC / RFID Stereo Headset Bluetooth/UWB Biometrics GPS TV out PC / Mac Memory card DMB (Digital Mobile Broadcast) 20 GB HD Concept Smart phone of 2008
333 University of Michigan Electrical Engineering and Computer Science Transparent Instruction Set Customization Transparent I1 I2 I3 I4 I5 Higher Frequency I1I1 I2I2 I3I3 I4I4 I5I5 OR… I5 I1 I2 I3 I4 Collapse Instructions (Customization) An alternative way to performance No ISA (or minor) change Baseline CPU unchanged Hardware generates control Eases software burden Forward compatible
444 University of Michigan Electrical Engineering and Computer Science Architecture Framework Compiler Standard Pipeline … BRL … BRL … Application Subgraph Execution Unit InputsOutputs Control Generation Instructions Augments Instruction Stream Subgraph
555 University of Michigan Electrical Engineering and Computer Science Pipeline Interface
666 University of Michigan Electrical Engineering and Computer Science LUT-based accelerator Addition/Subtraction inst1: EOR r6,r1,r2 inst2: AND r7,r4,r5 inst3: ORR r12,r6,r7 EOR AND ORR r1r2r2 r4r5r5 r12 r5r4r2r1(a^b) | (c&d) inst1: ADD r6,r1,r2 r6 i = r1 i r2 i Cin i-1 Cin i = r1 i.r2 i | Cin i-1.(r1 i r2 i ) A Carry Generator that is also programmable LUT-Based r12 r1 r2 r4 r5 32 LUT
777 University of Michigan Electrical Engineering and Computer Science Programmable Carry Functional Unit (PCFU) oooo oooo oooo oooo oooo oooo oooo ooo oooo oooo oooo oooo oooo oooo oooo oo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo L1 L2 L3 L4 L i = (g i,p i ) o (G,P)(G’,P’) (G | GP’,P.P’)
888 University of Michigan Electrical Engineering and Computer Science Configuration generation Output in1in2 Cin AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r Subgraph m-r3 m-r4 m-r5 p g Meta Register file pg Out =A AND B r1r2 Out =A B cin g = A.B p = A B r3 r4 Out =A B g LUT p LUT Carry Generator g1 p1 cin1 OutLUT Out in1 in2 in1 in2 Meta Function Unit LUT(r3) = LUT (r1) AND LUT (r2)
999 University of Michigan Electrical Engineering and Computer Science Design Space NN umber of Inputs NN umber of Outputs NN umber of Addition/Subtractions SS hift support AA t inputs AA t outputs g1 LUT – p1 LUT 16 Carry Generator 32 g1 32 p1 32 g2 LUT – p2 LUT 32 Carry Generator 32 g2 32 p cin1 cin2 OutLUT 32 in1 in2 in3 in4 in1 in2 in3 in4 in1 in2 in3 in4 32 g1 LUT – p1 LUT g2 LUT – p2 LUTOutLUT in1 in2 in3 in4 in5 in6 in1 in2 in3 in4 in5 in6 in1 in2 in3 in4 in5 in OutLUT2 32 Out2 in2 in3 in4 in1 Shifter Out Shifter Out
10 University of Michigan Electrical Engineering and Computer Science Evaluation Ported Trimaran compiler to ARM ISA Subgraph identification engine Synthesized with Synopsis standard cell library at 0.13µ SimpleScalar configured as ARM926EJ-S 5 stage pipe, 250 MHz 1 cycle 16k I/D caches Single issue Baseline: 1 cycle subgraph execution latency
11 University of Michigan Electrical Engineering and Computer Science Speedup – Baseline PCFU 4-inputs, 2-outputs PCFU design
12 University of Michigan Electrical Engineering and Computer Science Number of inputs/outputs Area is proportional
13 University of Michigan Electrical Engineering and Computer Science Number of addition/subtractions
14 University of Michigan Electrical Engineering and Computer Science Collapsing Emulation
15 University of Michigan Electrical Engineering and Computer Science Shift support
16 University of Michigan Electrical Engineering and Computer Science Design points 4I, 2O, 2A, None 4I, 3O, 2A, None 5I, 3O, 2A, None
17 University of Michigan Electrical Engineering and Computer Science Conclusions Transparent Instruction Set Customization needs Extracting computations from program Efficient Substrate to Map subgraphs PCFU LUT Based accelerators Flexible configurable accelerators Efficient configuration You can get up to 66% with a 6 input / 3 out / 2 Adder PCFU ... … but you get 62% with a 8 time smaller, ~40% faster PCFU
18 University of Michigan Electrical Engineering and Computer Science Q & A
19 University of Michigan Electrical Engineering and Computer Science Backups
20 University of Michigan Electrical Engineering and Computer Science PCFU Design Space Latency (ns) Area (cells) Speedup CCA(Michigan) CCA (R&D) PCFU (2 AS/4IN/2OUT) PCFU Logic only PCFU 1 ADD PCFU 2 ADD PCFU 3 ADD PCFU 2 IN PCFU 3 IN PCFU 4 IN PCFU 5 IN PCFU 6 IN PCFU (1 OUT) PCFU (2 OUT) PCFU (3 OUT) PCFU (Shift at inputs) PCFU(Shift at outputs)
21 University of Michigan Electrical Engineering and Computer Science LUT-based accelerator ADD r4,r1,r2 XOR r5,r3,r4 + r1r2 r3 r5 i = r3 i (r1 i r2 i cin i-1 ) cin i = (r1 i.r2 i ) OR (r1 i r2 i ).cin i-1 r4 r5 Cin i-1 r3 i r2 i r1 i r5 i r5i r1 i r2 i r3 i Cin i-1 32 LUT Closer to FPGA Bit level functions too complex Proposed Ripple Carry Scheme too slow May involve carry propagation network very complex also Hard to configure and have a reasonable latency in a GPP