Presentation is loading. Please wait.

Presentation is loading. Please wait.

Architectural Exploration:

Similar presentations


Presentation on theme: "Architectural Exploration:"— Presentation transcript:

1 Architectural Exploration:
Area-Power tradeoff in a transmitter design Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 16, 2007

2 This lecture has two purposes
Illustrate how area-power tradeoff can be studied at a high-level for a realistic design Example: a transmitter Illustrate some features of BSV Static elaboration Combinational circuits Simple synchronous pipelines Valid bits as the Maybe type in BSV No prior understanding of a is necessary to follow this lecture February 16, 2007

3 Bluespec: Two-Level Compilation
(Objects, Types, Higher-order functions) Lennart Augustsson @Sandburst Type checking Massive partial evaluation and static elaboration Level 1 compilation Rules and Actions (Term Rewriting System) Now we call this Guarded Atomic Actions Rule conflict analysis Rule scheduling Level 2 synthesis James Hoe & Arvind @MIT Object code (Verilog/C) February 16, 2007

4 Static Elaboration At compile time Software Toolflow: Hardware
Inline function calls and datatypes Instantiate modules with specific parameters Resolve polymorphism/overloading source Software Toolflow: source Hardware Toolflow: design2 design3 design1 elaborate w/params .exe compile run1 run2.1 run3.1 run1.1 run w/ params run w/ params run2 run1 run3 February 16, 2007

5 802.11a Transmitter Overview
headers Must produce one OFDM symbol every 4 msec Controller Scrambler Encoder 24 Uncoded bits data Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol Interleaver Mapper IFFT Cyclic Extend One OFDM symbol (64 Complex Numbers) IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers accounts for 85% area February 16, 2007

6 Preliminary results [MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind
Design Lines of Relative Block Code (BSV) Area Controller % Scrambler % Conv. Encoder % Interleaver % Mapper % IFFT % Cyc. Extender % Complex arithmetic libraries constitute another 200 lines of code February 16, 2007

7 Combinational IFFT … * + - *j t2 t0 t3 t1
Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 * + - *j t2 t0 t3 t1 Constants t0 to t3 are different for each box and can have dramatci impact on optimizations. All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ... February 16, 2007

8 4-way Butterfly Node BSV has a very strong notion of types * + - *i t0
k0 k1 k2 k3 function Vector#(4,Complex) Bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); BSV has a very strong notion of types Every expression has a type. Either it is declared by the user or automatically deduced by the compiler The compiler verifies that the type declarations are compatible February 16, 2007

9 BSV code: 4-way Butterfly
function Vector#(4,Complex) Bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m = newVector(), y = newVector(), z = newVector(); m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction * + - *i m y z Polymorphic code: works on any type of numbers for which *, + and - have been defined Note: Vector does not mean storage February 16, 2007

10 Combinational IFFT … stage_f function repeat it three times in0 in1
Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 stage_f function repeat it three times February 16, 2007

11 BSV Code: Combinational IFFT
function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); The for loop is unfolded and stage_f is inlined during static elaboration Note: no notion of loops or procedures during execution February 16, 2007

12 BSV Code: Combinational IFFT- Unfolded
function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); stage_data[1] = stage_f(0,stage_data[0]); stage_data[2] = stage_f(1,stage_data[1]); stage_data[3] = stage_f(2,stage_data[2]); Stage_f can be inlined now; it could have been inlined before loop unfolding also. Does the order matter? February 16, 2007

13 Bluespec Code for stage_f
function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; return(stage_out); February 16, 2007

14 Architectural Exploration
February 16, 2007

15 Design Alternatives Reuse a block over multiple cycles we expect:
f g f g we expect: Throughput to Area to decrease – less parallelism decrease – reusing a block Energy/unit work ? The clock needs to run faster for the same throughput  hyper-linear increase in energy more on power issues later February 16, 2007

16 Combinational IFFT Opportunity for reuse
in1 in2 in63 in3 in4 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 Reuse the same circuit three times February 16, 2007

17 Circular pipeline: Reusing the Pipeline Stage
in1 in2 in63 in3 in4 out0 out1 out2 out63 out3 out4 Bfly4 Permute_1 64, 4-way Muxes Permute_2 Stage Counter 16 Bfly4s can be shared but not the three permutations. Hence the need for muxes Permute_3 February 16, 2007

18 Superfolded circular pipeline: Just one Bfly-4 node!
in1 in2 in63 in3 in4 out0 out1 out2 out63 out3 out4 Bfly4 Permute_1 Permute_2 Permute_3 Stage Counter 0 to 2 Index Counter 0 to 15 64, 4-way Muxes 4, 16-way Muxes 4, 16-way DeMuxes February 16, 2007

19 Algorithmic Improvements
in0 in1 in2 in63 in3 in4 Bfly-4 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 1. All the three permutations can be made identical  more saving in area in the folded case 2. One multiplication can be removed from Bfly-4 February 16, 2007

20 Area improvements because of change in Algorithm
Design Old Area New Area (mm2) (mm2) Combinational Simple Pipe Folded Pipe February 16, 2007

21 Which design consumes the least energy to transmit a symbol?
Can we quickly code up all the alternatives? single source with parameters? Not practical in traditional hardware description languages like Verilog/VHDL 5-minute break to stretch you legs February 16, 2007

22 Pipelining a block inQ outQ f2 f1 f3 Combinational C inQ outQ f2 f1 f3
Pipeline P inQ outQ f Folded Pipeline FP Clock: C < P  FP Clock? Area? Throughput? Area: FP < C < P Throughput: FP < C < P February 16, 2007

23 Synchronous pipeline x sReg1 inQ f1 f2 f3 sReg2 outQ
rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule This rule can fire only if - inQ has an element - outQ has space Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated This is real IFFT code; just replace f1, f2 and f3 with stage_f code February 16, 2007

24 Stage functions f1, f2 and f3
function f1(x); return (stage_f(1,x)); endfunction function f2(x); return (stage_f(2,x)); function f3(x); return (stage_f(3,x)); The stage_f fucntion is given on slide 12 February 16, 2007

25 Problem: What about pipeline bubbles?
x sReg1 inQ f1 f2 f3 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule Red and Green tokens must move even if there is nothing in the inQ! Also if there is no token in sReg2 then nothing should be enqueued in the outQ Valid bits or the Maybe type Modify the rule to deal with these conditions February 16, 2007

26 The Maybe type data in the pipeline
typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= Valid f1(inQ.first()); inq.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f2(sx1); tagged Invalid: sReg2 <= Invalid; case (sReg2) matches tagged Valid .sx2: outQ.enq(f3(sx2)); endrule February 16, 2007

27 Folded pipeline x sReg inQ f outQ stage The same code will work for superfolded pipelines by changing n and stage function f rule folded-pipeline (True); if (stage==1) begin sxIn= inQ.first(); inQ.deq(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==n) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==n)? 1 : stage+1; endrule notice stage is a dynamic parameter now! February 16, 2007

28 802.11a Transmitter Synthesis results (Only the IFFT block is changing)
IFFT Design Area (mm2) ThroughputLatency (CLKs/sym) Min. Freq Required Pipelined 5.25 04 1.0 MHz Combinational 4.91 Folded (16 Bfly-4s) 3.97 Super-Folded (8 Bfly-4s) 3.69 06 1.5 MHz SF(4 Bfly-4s) 2.45 12 3.0 MHz SF(2 Bfly-4s) 1.84 24 6.0 MHz SF (1 Bfly4) 1.52 48 12 MHZ All these designs were done in less than 24 hours! The same source code TSMC .18 micron; numbers reported are before place and route. February 16, 2007

29 Why are the areas so similar
Folding should have given a 3x improvement in IFFT area BUT a constant twiddle allows low-level optimization on a Bfly-4 block a 2.5x area reduction! February 16, 2007

30 Summary It is essential to do architectural exploration for better (area, power, performance, ...) designs. It is possible to do so with new design tools and methodologies, i.e., Bluespec Better and faster tools for estimating area, timing and power would dramatically increase our capability to do architectural exploration. February 16, 2007

31 Bluespec Learnings How to write highly parameterized combinational codes How to write rules for simple synchronous pipelines Effect of dynamic vs static values on generated circuits Using Maybe types to express valid/invalid data Thanks February 16, 2007

32 Backup slides February 16, 2007

33 Function f for the folded pipeline is the same stage_f function but ...
function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; return(stage_out); will cause a mux to be generated February 16, 2007

34 Folded pipeline: stage function f
getTwiddle getTwiddle2 getTwiddle3 twid The rest of stage_f, i.e. Bfly-4s and permutations (shared) sx February 16, 2007

35 Function f for the Superfolded pipeline (One Bfly-4 case)
f will be invoked for 48 dynamic values of stage each invocation will modify 4 numbers in sReg after 16 invocations a permutation would be done on the whole sReg February 16, 2007

36 Code for the Superfolded pipeline stage function
function SVector#(64, Complex) f (Bit#(6) stage, SVector#(64, Complex) stage_in); begin let idx = stage `mod` 16; let twid = getTwiddle(stage `div` 16, idx); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp = stage_in; stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return((idx == 15) ? stage_out: stage_temp); One Bfly-4 case February 16, 2007

37 Nirav Dave, Mike Pellauer, Steve Gerding, Arvind MEMOCODE 2006
Experimental Results Nirav Dave, Mike Pellauer, Steve Gerding, Arvind MEMOCODE 2006 February 16, 2007

38 Expressing these designs in Bluespec was easy
All these designs were done in less than one day! Designers were experts in Bluespec Area and power estimates? Combinational Pipelined Folded (16 Bfly-4s) Super-Folded (8 Bfly-4s) Super-Folded (4 Bfly-4s) Super-Folded (2 Bfly-4s) Super-Folded (1 Bfly-4) The same source code February 16, 2007

39 Bluespec SystemVerilog source
Bluespec Tool flow Blueview Bluespec SystemVerilog source Sequence Design PowerTheater Bluespec Compiler VCD output Debussy Visualization C Bluesim Cycle Accurate Verilog 95 RTL Verilog sim RTL synthesis gates Power estimation tool files Bluespec tools 3rd party tools Legend Physical Place & Route Tapeout FPGA February 16, 2007

40 802.11a Transmitter Synthesis results (Only the IFFT block is changing)
IFFT Design Area (mm2) Symbol Latency (CLKs) ThroughputLatency (CLKs/sym) Min. Freq Required Average Power (mW) Pipelined 5.25 12 04 1.0 MHz 4.92 Combinational 4.91 10 3.99 Folded (16 Bfly-4s) 3.97 7.27 Super-Folded (8 Bfly-4s) 3.69 15 06 1.5 MHz 10.9 SF(4 Bfly-4s) 2.45 21 3.0 MHz 14.4 SF(2 Bfly-4s) 1.84 33 24 6.0 MHz 21.1 SF (1 Bfly4) 1.52 57 48 12 MHZ 34.6 TSMC .18 micron; numbers reported are before place and route. (DesignCompiler), Power numbers are from Sequence PowerTheater February 16, 2007

41 Power can be reduced further
Right now all blocks in the transmitter run on the same clock  if we run IFFT faster then all other blocks also run faster Bluespec has facilities for Multiple Clock Domains and the design can be easily modified to run the earlier blocks at a lower clock rate February 16, 2007


Download ppt "Architectural Exploration:"

Similar presentations


Ads by Google