Download presentation
Presentation is loading. Please wait.
1
11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan Clark ▪, Scott Mahlke ▪, and Krisztian Flautner * * ARM Ltd. ▪ Advanced Computer Architecture Lab, University of Michigan CASES 2005, September 24-27
2
222 University of Michigan Electrical Engineering and Computer Science Embedded Products Convergence Needs of performance for increasing application demands Embedded systems win through customization : more performance, low power, etc.. Traditional ISA customization and hardware specialization cannot cope with the increase of functionalities. One way : Transparent Instruction Set Customization 3.5G (HSDPA) WiMax NFC / RFID Stereo Headset Bluetooth/UWB Biometrics GPS TV out PC / Mac Memory card DMB (Digital Mobile Broadcast) 20 GB HD Concept Smart phone of 2008
3
333 University of Michigan Electrical Engineering and Computer Science Transparent Instruction Set Customization Transparent I1 I2 I3 I4 I5 Higher Frequency I1I1 I2I2 I3I3 I4I4 I5I5 OR… I5 I1 I2 I3 I4 Collapse Instructions (Customization) An alternative way to performance No ISA (or minor) change Baseline CPU unchanged Hardware generates control Eases software burden Forward compatible
4
444 University of Michigan Electrical Engineering and Computer Science Architecture Framework Compiler Standard Pipeline … BRL … BRL … Application Subgraph Execution Unit InputsOutputs Control Generation Instructions Augments Instruction Stream Subgraph
5
555 University of Michigan Electrical Engineering and Computer Science Pipeline Interface
6
666 University of Michigan Electrical Engineering and Computer Science LUT-based accelerator Addition/Subtraction inst1: EOR r6,r1,r2 inst2: AND r7,r4,r5 inst3: ORR r12,r6,r7 EOR AND ORR r1r2r2 r4r5r5 r12 r5r4r2r1(a^b) | (c&d) 00000 00011 00101 00110 01000 01011 01101 01110 10000 10011 10101 10110 11001 11011 11101 11111 inst1: ADD r6,r1,r2 r6 i = r1 i r2 i Cin i-1 Cin i = r1 i.r2 i | Cin i-1.(r1 i r2 i ) A Carry Generator that is also programmable LUT-Based 1111011001100110 32 r12 r1 r2 r4 r5 32 LUT
7
777 University of Michigan Electrical Engineering and Computer Science Programmable Carry Functional Unit (PCFU) 012345678 9 10111213141516171819202122232425262728293031 oooo oooo oooo oooo oooo oooo oooo ooo oooo oooo oooo oooo oooo oooo oooo oo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo L1 L2 L3 L4 L5 012345678 9 101112131415161718192021222324252627282930 31 i = (g i,p i ) o (G,P)(G’,P’) (G | GP’,P.P’)
8
888 University of Michigan Electrical Engineering and Computer Science Configuration generation Output in1in2 Cin 0101010101010101 0011001100110011 0000111100001111 AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r4 0001000100010001 Subgraph m-r3 m-r4 m-r5 p g 1 0 0 0 Meta Register file pg 0110100101101001 0110011001100110 0001000100010001 Out =A AND B r1r2 Out =A B cin g = A.B p = A B 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 0 r3 r4 Out =A B 0111100001111000 0 0 0 1 1 1 1 0 g LUT p LUT Carry Generator g1 p1 cin1 OutLUT Out in1 in2 in1 in2 Meta Function Unit LUT(r3) = LUT (r1) AND LUT (r2)
9
999 University of Michigan Electrical Engineering and Computer Science Design Space NN umber of Inputs NN umber of Outputs NN umber of Addition/Subtractions SS hift support AA t inputs AA t outputs g1 LUT – p1 LUT 16 Carry Generator 32 g1 32 p1 32 g2 LUT – p2 LUT 32 Carry Generator 32 g2 32 p2 32 64 32 cin1 cin2 OutLUT 32 in1 in2 in3 in4 in1 in2 in3 in4 in1 in2 in3 in4 32 g1 LUT – p1 LUT g2 LUT – p2 LUTOutLUT in1 in2 in3 in4 in5 in6 in1 in2 in3 in4 in5 in6 in1 in2 in3 in4 in5 in6 64 32 OutLUT2 32 Out2 in2 in3 in4 in1 Shifter Out Shifter Out
10
10 University of Michigan Electrical Engineering and Computer Science Evaluation Ported Trimaran compiler to ARM ISA Subgraph identification engine Synthesized with Synopsis standard cell library at 0.13µ SimpleScalar configured as ARM926EJ-S 5 stage pipe, 250 MHz 1 cycle 16k I/D caches Single issue Baseline: 1 cycle subgraph execution latency
11
11 University of Michigan Electrical Engineering and Computer Science Speedup – Baseline PCFU 4-inputs, 2-outputs PCFU design
12
12 University of Michigan Electrical Engineering and Computer Science Number of inputs/outputs Area is proportional
13
13 University of Michigan Electrical Engineering and Computer Science Number of addition/subtractions
14
14 University of Michigan Electrical Engineering and Computer Science Collapsing Emulation
15
15 University of Michigan Electrical Engineering and Computer Science Shift support
16
16 University of Michigan Electrical Engineering and Computer Science Design points 4I, 2O, 2A, None 4I, 3O, 2A, None 5I, 3O, 2A, None
17
17 University of Michigan Electrical Engineering and Computer Science Conclusions Transparent Instruction Set Customization needs Extracting computations from program Efficient Substrate to Map subgraphs PCFU LUT Based accelerators Flexible configurable accelerators Efficient configuration You can get up to 66% with a 6 input / 3 out / 2 Adder PCFU ... … but you get 62% with a 8 time smaller, ~40% faster PCFU
18
18 University of Michigan Electrical Engineering and Computer Science Q & A
19
19 University of Michigan Electrical Engineering and Computer Science Backups
20
20 University of Michigan Electrical Engineering and Computer Science PCFU Design Space Latency (ns) Area (cells) Speedup CCA(Michigan)4.322787481.8 CCA (R&D)7.07606345 1.8 PCFU (2 AS/4IN/2OUT)4.21713051.62 PCFU Logic only0.59260071.18 PCFU 1 ADD2.15636031.33 PCFU 2 ADD3.791346371.62 PCFU 3 ADD5.822749391.63 PCFU 2 IN3.03524371.49 PCFU 3 IN3.24688461.56 PCFU 4 IN3.791346371.62 PCFU 5 IN5.252148851.63 PCFU 6 IN5.474656301.63 PCFU (1 OUT)3.791346371.45 PCFU (2 OUT)4.21713051.62 PCFU (3 OUT)4.572301891.63 PCFU (Shift at inputs)5.021705291.75 PCFU(Shift at outputs)4.451580091.74
21
21 University of Michigan Electrical Engineering and Computer Science LUT-based accelerator ADD r4,r1,r2 XOR r5,r3,r4 + r1r2 r3 r5 i = r3 i (r1 i r2 i cin i-1 ) cin i = (r1 i.r2 i ) OR (r1 i r2 i ).cin i-1 r4 r5 Cin i-1 r3 i r2 i r1 i r5 i 00000 00011 00101 00110 01001 01010 01100 01111 10001 10010 10100 10111 11000 11011 11100 11110 0010100110010110 32 r5i r1 i r2 i r3 i Cin i-1 32 LUT Closer to FPGA Bit level functions too complex Proposed Ripple Carry Scheme too slow May involve carry propagation network very complex also Hard to configure and have a reasonable latency in a GPP
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.