11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.

11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan Clark ▪, Scott Mahlke ▪, and Krisztian Flautner * * ARM Ltd. ▪ Advanced Computer Architecture Lab, University of Michigan CASES 2005, September 24-27

222 University of Michigan Electrical Engineering and Computer Science Embedded Products Convergence  Needs of performance for increasing application demands  Embedded systems win through customization : more performance, low power, etc..  Traditional ISA customization and hardware specialization cannot cope with the increase of functionalities.  One way : Transparent Instruction Set Customization 3.5G (HSDPA) WiMax NFC / RFID Stereo Headset Bluetooth/UWB Biometrics GPS TV out PC / Mac Memory card DMB (Digital Mobile Broadcast) 20 GB HD Concept Smart phone of 2008

333 University of Michigan Electrical Engineering and Computer Science Transparent Instruction Set Customization Transparent I1 I2 I3 I4 I5 Higher Frequency I1I1 I2I2 I3I3 I4I4 I5I5 OR… I5 I1 I2 I3 I4 Collapse Instructions (Customization)  An alternative way to performance  No ISA (or minor) change  Baseline CPU unchanged  Hardware generates control  Eases software burden  Forward compatible

444 University of Michigan Electrical Engineering and Computer Science Architecture Framework Compiler Standard Pipeline … BRL … BRL … Application Subgraph Execution Unit InputsOutputs Control Generation Instructions Augments Instruction Stream Subgraph

555 University of Michigan Electrical Engineering and Computer Science Pipeline Interface

666 University of Michigan Electrical Engineering and Computer Science LUT-based accelerator  Addition/Subtraction inst1: EOR r6,r1,r2 inst2: AND r7,r4,r5 inst3: ORR r12,r6,r7 EOR AND ORR r1r2r2 r4r5r5 r12 r5r4r2r1(a^b) | (c&d) 00000 00011 00101 00110 01000 01011 01101 01110 10000 10011 10101 10110 11001 11011 11101 11111 inst1: ADD r6,r1,r2 r6 i = r1 i  r2 i  Cin i-1 Cin i = r1 i.r2 i | Cin i-1.(r1 i  r2 i ) A Carry Generator that is also programmable  LUT-Based 1111011001100110 32 r12 r1 r2 r4 r5 32 LUT

777 University of Michigan Electrical Engineering and Computer Science Programmable Carry Functional Unit (PCFU) 012345678 9 10111213141516171819202122232425262728293031 oooo oooo oooo oooo oooo oooo oooo ooo oooo oooo oooo oooo oooo oooo oooo oo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo oooo L1 L2 L3 L4 L5 012345678 9 101112131415161718192021222324252627282930 31 i = (g i,p i ) o (G,P)(G’,P’) (G | GP’,P.P’)

888 University of Michigan Electrical Engineering and Computer Science Configuration generation Output in1in2 Cin 0101010101010101 0011001100110011 0000111100001111 AND r3, r1, r2 ADD r4, r1, r2 XOR r5, r3, r4 0001000100010001 Subgraph m-r3 m-r4 m-r5 p g 1 0 0 0 Meta Register file pg 0110100101101001 0110011001100110 0001000100010001 Out =A AND B r1r2 Out =A  B  cin g = A.B p = A  B 1 0 0 1 0 1 1 0 0 1 1 0 1 0 0 0 r3 r4 Out =A  B 0111100001111000 0 0 0 1 1 1 1 0 g LUT p LUT Carry Generator g1 p1 cin1 OutLUT Out in1 in2 in1 in2 Meta Function Unit LUT(r3) = LUT (r1) AND LUT (r2)

999 University of Michigan Electrical Engineering and Computer Science Design Space NN umber of Inputs NN umber of Outputs NN umber of Addition/Subtractions SS hift support AA t inputs AA t outputs g1 LUT – p1 LUT 16 Carry Generator 32 g1 32 p1 32 g2 LUT – p2 LUT 32 Carry Generator 32 g2 32 p2 32 64 32 cin1 cin2 OutLUT 32 in1 in2 in3 in4 in1 in2 in3 in4 in1 in2 in3 in4 32 g1 LUT – p1 LUT g2 LUT – p2 LUTOutLUT in1 in2 in3 in4 in5 in6 in1 in2 in3 in4 in5 in6 in1 in2 in3 in4 in5 in6 64 32 OutLUT2 32 Out2 in2 in3 in4 in1 Shifter Out Shifter Out

10 University of Michigan Electrical Engineering and Computer Science Evaluation  Ported Trimaran compiler to ARM ISA  Subgraph identification engine  Synthesized with Synopsis standard cell library at 0.13µ  SimpleScalar configured as ARM926EJ-S  5 stage pipe, 250 MHz  1 cycle 16k I/D caches  Single issue  Baseline: 1 cycle subgraph execution latency

11 University of Michigan Electrical Engineering and Computer Science Speedup – Baseline PCFU  4-inputs, 2-outputs PCFU design

12 University of Michigan Electrical Engineering and Computer Science Number of inputs/outputs Area is proportional

13 University of Michigan Electrical Engineering and Computer Science Number of addition/subtractions

14 University of Michigan Electrical Engineering and Computer Science Collapsing Emulation

15 University of Michigan Electrical Engineering and Computer Science Shift support

16 University of Michigan Electrical Engineering and Computer Science Design points 4I, 2O, 2A, None 4I, 3O, 2A, None 5I, 3O, 2A, None

17 University of Michigan Electrical Engineering and Computer Science Conclusions  Transparent Instruction Set Customization needs  Extracting computations from program  Efficient Substrate to Map subgraphs  PCFU LUT Based accelerators  Flexible configurable accelerators  Efficient configuration  You can get up to 66% with a 6 input / 3 out / 2 Adder PCFU ... … but you get 62% with a 8 time smaller, ~40% faster PCFU

18 University of Michigan Electrical Engineering and Computer Science Q & A

19 University of Michigan Electrical Engineering and Computer Science Backups

20 University of Michigan Electrical Engineering and Computer Science PCFU Design Space Latency (ns) Area (cells) Speedup CCA(Michigan)4.322787481.8 CCA (R&D)7.07606345 1.8 PCFU (2 AS/4IN/2OUT)4.21713051.62 PCFU Logic only0.59260071.18 PCFU 1 ADD2.15636031.33 PCFU 2 ADD3.791346371.62 PCFU 3 ADD5.822749391.63 PCFU 2 IN3.03524371.49 PCFU 3 IN3.24688461.56 PCFU 4 IN3.791346371.62 PCFU 5 IN5.252148851.63 PCFU 6 IN5.474656301.63 PCFU (1 OUT)3.791346371.45 PCFU (2 OUT)4.21713051.62 PCFU (3 OUT)4.572301891.63 PCFU (Shift at inputs)5.021705291.75 PCFU(Shift at outputs)4.451580091.74

21 University of Michigan Electrical Engineering and Computer Science LUT-based accelerator ADD r4,r1,r2 XOR r5,r3,r4 +  r1r2 r3 r5 i = r3 i  (r1 i  r2 i  cin i-1 ) cin i = (r1 i.r2 i ) OR (r1 i  r2 i ).cin i-1 r4 r5 Cin i-1 r3 i r2 i r1 i r5 i 00000 00011 00101 00110 01001 01010 01100 01111 10001 10010 10100 10111 11000 11011 11100 11110 0010100110010110 32 r5i r1 i r2 i r3 i Cin i-1 32 LUT  Closer to FPGA  Bit level functions too complex  Proposed Ripple Carry Scheme too slow  May involve carry propagation network very complex also  Hard to configure and have a reasonable latency in a GPP

11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.

Similar presentations

Presentation on theme: "11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan.

Similar presentations

Presentation on theme: "11 University of Michigan Electrical Engineering and Computer Science Exploring the Design Space of LUT-based Transparent Accelerators Sami Yehia *, Nathan."— Presentation transcript:

Similar presentations

About project

Feedback