Architectural Exploration:

Slides:

Advertisements

Similar presentations

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.

Advertisements

An EHR based methodology for Concurrency management Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

March, 2007http://csg.csail.mit.edu/arvind802.11a-1 Architectural Exploration: Area-Performance tradeoff in a Transmitter Arvind Computer Science.

Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

March 2007http://csg.csail.mit.edu/arvindSemantics-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Folding and Pipelining complex combinational circuits Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.

Logic Synthesis – 3 Optimization Ahmed Hemani Sources: Synopsys Documentation.

November 2, 2006http://csg.csail.mit.edu/6.827/L15-1 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence.

June 5, Architectural Exploration: a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence.

February 21, 2007http://csg.csail.mit.edu/6.375/L07-1 Bluespec-4: Architectural exploration using IP lookup Arvind Computer Science & Artificial Intelligence.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L06-1.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 4, 2013

December 10, 2009 L29-1 The Semantics of Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute.

Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February 20, 2013http://csg.csail.mit.edu/6.375L05-1.

March 4, 2009L13-1http://csg.csail.mit.edu/6.375 Multiple Clock Domains Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Folded Combinational Circuits as an example of Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Multiple Clock Domains (MCD) Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 15, 2010.

Multiple Clock Domains (MCD) Continued … Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology November.

Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.

March, 2007http://csg.csail.mit.edu/arvindIFFT-1 Combinational Circuits: IFFT, Types, Parameterization... Arvind Computer Science & Artificial Intelligence.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

September 8, 2009http://csg.csail.mit.edu/koreaL03-1 Combinational Circuits in Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Folding complex combinational circuits to save area Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

March 1, 2006http://csg.csail.mit.edu/6.375/L09-1 Bluespec-3: Architecture exploration using static elaboration Arvind Computer Science & Artificial Intelligence.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Simple Inelastic and Folded Pipelines Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 14, 2011L04-1.

Computer Architecture: A Constructive Approach Pipelining combinational circuits Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*, Massachusetts.

Multiple Clock Domains (MCD) Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Combinational Circuits in Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 9, 2011L03-1

Overview Logistics Last lecture Today HW5 due today

Introduction to Bluespec: A new methodology for designing Hardware

Introduction to Bluespec: A new methodology for designing Hardware

Functions and Types in Bluespec

Bluespec-3: A non-pipelined processor Arvind

Folded Combinational Circuits as an example of Sequential Circuits

Folded “Combinational” circuits

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Sequential Circuits: Constructive Computer Architecture

Combinational Circuits in Bluespec

Combinational Circuits in Bluespec

FFT: An example of complex combinational circuits

Combinational Circuits in Bluespec

Pipelining combinational circuits

Multirule Systems and Concurrent Execution of Rules

Constructive Computer Architecture: Guards

Pipelining combinational circuits

Combinational Circuits and Simple Synchronous Pipelines

Combinational Circuits and Simple Synchronous Pipelines

Modules with Guarded Interfaces

Pipelining combinational circuits

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Introduction to Bluespec: A new methodology for designing Hardware

Bluespec-3: A non-pipelined processor Arvind

Multirule systems and Concurrent Execution of Rules

Combinational Circuits in Bluespec

Multiple Clock Domains

FFT: An example of complex combinational circuits

Constructive Computer Architecture: Guards

Elastic Pipelines and Basics of Multi-rule Systems

Simple Synchronous Pipelines

Multirule systems and Concurrent Execution of Rules

Introduction to Bluespec: A new methodology for designing Hardware

Pipelining combinational circuits

Architectural Exploration:

Simple Synchronous Pipelines

Presentation transcript:

Architectural Exploration: Area-Power tradeoff in 802.11a transmitter design Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 16, 2007 http://csg.csail.mit.edu/6.375/

This lecture has two purposes Illustrate how area-power tradeoff can be studied at a high-level for a realistic design Example: 802.11a transmitter Illustrate some features of BSV Static elaboration Combinational circuits Simple synchronous pipelines Valid bits as the Maybe type in BSV No prior understanding of 802.11a is necessary to follow this lecture http://csg.csail.mit.edu/6.375/ February 16, 2007

Bluespec: Two-Level Compilation (Objects, Types, Higher-order functions) Lennart Augustsson @Sandburst 2000-2002 Type checking Massive partial evaluation and static elaboration Level 1 compilation Rules and Actions (Term Rewriting System) Now we call this Guarded Atomic Actions Rule conflict analysis Rule scheduling Level 2 synthesis James Hoe & Arvind @MIT 1997-2000 Object code (Verilog/C) http://csg.csail.mit.edu/6.375/ February 16, 2007

Static Elaboration At compile time Software Toolflow: Hardware Inline function calls and datatypes Instantiate modules with specific parameters Resolve polymorphism/overloading source Software Toolflow: source Hardware Toolflow: design2 design3 design1 elaborate w/params .exe compile run1 run2.1 … run3.1 run1.1 run w/ params run w/ params run2 run1 … run3 http://csg.csail.mit.edu/6.375/ February 16, 2007

802.11a Transmitter Overview headers Must produce one OFDM symbol every 4 msec Controller Scrambler Encoder 24 Uncoded bits data Depending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol Interleaver Mapper IFFT Cyclic Extend One OFDM symbol (64 Complex Numbers) IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers http://csg.csail.mit.edu/6.375/ accounts for 85% area February 16, 2007

Preliminary results [MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind Design Lines of Relative Block Code (BSV) Area Controller 49 0% Scrambler 40 0% Conv. Encoder 113 0% Interleaver 76 1% Mapper 112 11% IFFT 95 85% Cyc. Extender 23 3% Complex arithmetic libraries constitute another 200 lines of code http://csg.csail.mit.edu/6.375/ February 16, 2007

Combinational IFFT … * + - *j t2 t0 t3 t1 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 * + - *j t2 t0 t3 t1 Constants t0 to t3 are different for each box and can have dramatci impact on optimizations. All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ... http://csg.csail.mit.edu/6.375/ February 16, 2007

4-way Butterfly Node BSV has a very strong notion of types * + - *i t0 k0 k1 k2 k3 function Vector#(4,Complex) Bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); BSV has a very strong notion of types Every expression has a type. Either it is declared by the user or automatically deduced by the compiler The compiler verifies that the type declarations are compatible http://csg.csail.mit.edu/6.375/ February 16, 2007

BSV code: 4-way Butterfly function Vector#(4,Complex) Bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k); Vector#(4,Complex) m = newVector(), y = newVector(), z = newVector(); m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3]; y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]); z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3]; return(z); endfunction * + - *i m y z Polymorphic code: works on any type of numbers for which *, + and - have been defined Note: Vector does not mean storage http://csg.csail.mit.edu/6.375/ February 16, 2007

Combinational IFFT … stage_f function repeat it three times in0 in1 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 stage_f function repeat it three times http://csg.csail.mit.edu/6.375/ February 16, 2007

BSV Code: Combinational IFFT function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); The for loop is unfolded and stage_f is inlined during static elaboration Note: no notion of loops or procedures during execution http://csg.csail.mit.edu/6.375/ February 16, 2007

BSV Code: Combinational IFFT- Unfolded function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(4,SVector#(64, Complex)) stage_data = replicate(newSVector); stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]); stage_data[1] = stage_f(0,stage_data[0]); stage_data[2] = stage_f(1,stage_data[1]); stage_data[3] = stage_f(2,stage_data[2]); Stage_f can be inlined now; it could have been inlined before loop unfolding also. Does the order matter? http://csg.csail.mit.edu/6.375/ February 16, 2007

Bluespec Code for stage_f function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; return(stage_out); http://csg.csail.mit.edu/6.375/ February 16, 2007

Architectural Exploration February 16, 2007 http://csg.csail.mit.edu/6.375/

Design Alternatives Reuse a block over multiple cycles we expect: f g f g we expect: Throughput to Area to decrease – less parallelism decrease – reusing a block Energy/unit work ? The clock needs to run faster for the same throughput  hyper-linear increase in energy more on power issues later http://csg.csail.mit.edu/6.375/ February 16, 2007

Combinational IFFT Opportunity for reuse … in1 in2 in63 in3 in4 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 Reuse the same circuit three times http://csg.csail.mit.edu/6.375/ February 16, 2007

Circular pipeline: Reusing the Pipeline Stage … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Bfly4 Permute_1 64, 4-way Muxes Permute_2 Stage Counter 16 Bfly4s can be shared but not the three permutations. Hence the need for muxes Permute_3 http://csg.csail.mit.edu/6.375/ February 16, 2007

Superfolded circular pipeline: Just one Bfly-4 node! … in1 in2 in63 in3 in4 out0 out1 out2 out63 out3 out4 Bfly4 Permute_1 Permute_2 Permute_3 Stage Counter 0 to 2 Index Counter 0 to 15 64, 4-way Muxes 4, 16-way Muxes 4, 16-way DeMuxes http://csg.csail.mit.edu/6.375/ February 16, 2007

Algorithmic Improvements in0 … in1 in2 in63 in3 in4 Bfly-4 Bfly4 x16 out0 out1 out2 out63 out3 out4 Permute_1 Permute_2 Permute_3 1. All the three permutations can be made identical  more saving in area in the folded case 2. One multiplication can be removed from Bfly-4 http://csg.csail.mit.edu/6.375/ February 16, 2007

Area improvements because of change in Algorithm Design Old Area New Area (mm2) (mm2) Combinational 4.69 4.91 Simple Pipe 5.14 5.25 Folded Pipe 5.89 3.97 http://csg.csail.mit.edu/6.375/ February 16, 2007

Which design consumes the least energy to transmit a symbol? Can we quickly code up all the alternatives? single source with parameters? Not practical in traditional hardware description languages like Verilog/VHDL 5-minute break to stretch you legs http://csg.csail.mit.edu/6.375/ February 16, 2007

Pipelining a block inQ outQ f2 f1 f3 Combinational C inQ outQ f2 f1 f3 Pipeline P inQ outQ f Folded Pipeline FP Clock: C < P  FP Clock? Area? Throughput? Area: FP < C < P Throughput: FP < C < P http://csg.csail.mit.edu/6.375/ February 16, 2007

Synchronous pipeline x sReg1 inQ f1 f2 f3 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule This rule can fire only if - inQ has an element - outQ has space Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated This is real IFFT code; just replace f1, f2 and f3 with stage_f code http://csg.csail.mit.edu/6.375/ February 16, 2007

Stage functions f1, f2 and f3 function f1(x); return (stage_f(1,x)); endfunction function f2(x); return (stage_f(2,x)); function f3(x); return (stage_f(3,x)); The stage_f fucntion is given on slide 12 http://csg.csail.mit.edu/6.375/ February 16, 2007

Problem: What about pipeline bubbles? x sReg1 inQ f1 f2 f3 sReg2 outQ rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2)); endrule Red and Green tokens must move even if there is nothing in the inQ! Also if there is no token in sReg2 then nothing should be enqueued in the outQ Valid bits or the Maybe type Modify the rule to deal with these conditions http://csg.csail.mit.edu/6.375/ February 16, 2007

The Maybe type data in the pipeline typedef union tagged { void Invalid; data_T Valid; } Maybe#(type data_T); data valid/invalid Registers contain Maybe type values rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= Valid f1(inQ.first()); inq.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f2(sx1); tagged Invalid: sReg2 <= Invalid; case (sReg2) matches tagged Valid .sx2: outQ.enq(f3(sx2)); endrule http://csg.csail.mit.edu/6.375/ February 16, 2007

Folded pipeline x sReg inQ f outQ stage The same code will work for superfolded pipelines by changing n and stage function f rule folded-pipeline (True); if (stage==1) begin sxIn= inQ.first(); inQ.deq(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==n) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==n)? 1 : stage+1; endrule notice stage is a dynamic parameter now! http://csg.csail.mit.edu/6.375/ February 16, 2007

802.11a Transmitter Synthesis results (Only the IFFT block is changing) IFFT Design Area (mm2) ThroughputLatency (CLKs/sym) Min. Freq Required Pipelined 5.25 04 1.0 MHz Combinational 4.91 Folded (16 Bfly-4s) 3.97 Super-Folded (8 Bfly-4s) 3.69 06 1.5 MHz SF(4 Bfly-4s) 2.45 12 3.0 MHz SF(2 Bfly-4s) 1.84 24 6.0 MHz SF (1 Bfly4) 1.52 48 12 MHZ All these designs were done in less than 24 hours! The same source code TSMC .18 micron; numbers reported are before place and route. http://csg.csail.mit.edu/6.375/ February 16, 2007

Why are the areas so similar Folding should have given a 3x improvement in IFFT area BUT a constant twiddle allows low-level optimization on a Bfly-4 block a 2.5x area reduction! http://csg.csail.mit.edu/6.375/ February 16, 2007

Summary It is essential to do architectural exploration for better (area, power, performance, ...) designs. It is possible to do so with new design tools and methodologies, i.e., Bluespec Better and faster tools for estimating area, timing and power would dramatically increase our capability to do architectural exploration. http://csg.csail.mit.edu/6.375/ February 16, 2007

Bluespec Learnings How to write highly parameterized combinational codes How to write rules for simple synchronous pipelines Effect of dynamic vs static values on generated circuits Using Maybe types to express valid/invalid data Thanks http://csg.csail.mit.edu/6.375/ February 16, 2007

Backup slides February 16, 2007 http://csg.csail.mit.edu/6.375/

Function f for the folded pipeline is the same stage_f function but ... function SVector#(64, Complex) stage_f (Bit#(2) stage, SVector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; return(stage_out); will cause a mux to be generated http://csg.csail.mit.edu/6.375/ February 16, 2007

Folded pipeline: stage function f getTwiddle getTwiddle2 getTwiddle3 twid The rest of stage_f, i.e. Bfly-4s and permutations (shared) sx http://csg.csail.mit.edu/6.375/ February 16, 2007

Function f for the Superfolded pipeline (One Bfly-4 case) f will be invoked for 48 dynamic values of stage each invocation will modify 4 numbers in sReg after 16 invocations a permutation would be done on the whole sReg http://csg.csail.mit.edu/6.375/ February 16, 2007

Code for the Superfolded pipeline stage function function SVector#(64, Complex) f (Bit#(6) stage, SVector#(64, Complex) stage_in); begin let idx = stage `mod` 16; let twid = getTwiddle(stage `div` 16, idx); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp = stage_in; stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; end return((idx == 15) ? stage_out: stage_temp); One Bfly-4 case http://csg.csail.mit.edu/6.375/ February 16, 2007

Nirav Dave, Mike Pellauer, Steve Gerding, Arvind MEMOCODE 2006 Experimental Results Nirav Dave, Mike Pellauer, Steve Gerding, Arvind MEMOCODE 2006 February 16, 2007 http://csg.csail.mit.edu/6.375/

Expressing these designs in Bluespec was easy All these designs were done in less than one day! Designers were experts in Bluespec Area and power estimates? Combinational Pipelined Folded (16 Bfly-4s) Super-Folded (8 Bfly-4s) Super-Folded (4 Bfly-4s) Super-Folded (2 Bfly-4s) Super-Folded (1 Bfly-4) The same source code http://csg.csail.mit.edu/6.375/ February 16, 2007

Bluespec SystemVerilog source Bluespec Tool flow Blueview Bluespec SystemVerilog source Sequence Design PowerTheater Bluespec Compiler VCD output Debussy Visualization C Bluesim Cycle Accurate Verilog 95 RTL Verilog sim RTL synthesis gates Power estimation tool files Bluespec tools 3rd party tools Legend Physical Place & Route Tapeout FPGA http://csg.csail.mit.edu/6.375/ February 16, 2007

802.11a Transmitter Synthesis results (Only the IFFT block is changing) IFFT Design Area (mm2) Symbol Latency (CLKs) ThroughputLatency (CLKs/sym) Min. Freq Required Average Power (mW) Pipelined 5.25 12 04 1.0 MHz 4.92 Combinational 4.91 10 3.99 Folded (16 Bfly-4s) 3.97 7.27 Super-Folded (8 Bfly-4s) 3.69 15 06 1.5 MHz 10.9 SF(4 Bfly-4s) 2.45 21 3.0 MHz 14.4 SF(2 Bfly-4s) 1.84 33 24 6.0 MHz 21.1 SF (1 Bfly4) 1.52 57 48 12 MHZ 34.6 TSMC .18 micron; numbers reported are before place and route. (DesignCompiler), Power numbers are from Sequence PowerTheater http://csg.csail.mit.edu/6.375/ February 16, 2007

Power can be reduced further Right now all blocks in the transmitter run on the same clock  if we run IFFT faster then all other blocks also run faster Bluespec has facilities for Multiple Clock Domains and the design can be easily modified to run the earlier blocks at a lower clock rate http://csg.csail.mit.edu/6.375/ February 16, 2007