March 1, 2006http://csg.csail.mit.edu/6.375/L09-1 Bluespec-3: Architecture exploration using static elaboration Arvind Computer Science & Artificial Intelligence.

Slides:



Advertisements
Similar presentations
March, 2007http://csg.csail.mit.edu/arvind802.11a-1 Architectural Exploration: Area-Performance tradeoff in a Transmitter Arvind Computer Science.
Advertisements

Folding and Pipelining complex combinational circuits Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.
a By Yasir Ateeq. Table of Contents INTRODUCTION TASKS OF TRANSMITTER PACKET FORMAT PREAMBLE SCRAMBLER CONVOLUTIONAL ENCODER PUNCTURER INTERLEAVER.
6.375 Project Arthur Chang Omid Salehi-Abari Sung Sik Woo May 11, 2011
1 Peak-to-Average Power Ratio (PAPR) One of the main problems in OFDM system is large PAPR /PAR(increased complexity of the ADC and DAC, and reduced efficiency.
Implement a 2x2 MIMO OFDM-based channel measurement system (no data yet) at 2.4 GHz Perform baseband processing and digital up and down conversion on Nallatech.
November 2, 2006http://csg.csail.mit.edu/6.827/L15-1 An hardware inspired model for parallel programming Arvind Computer Science & Artificial Intelligence.
June 5, Architectural Exploration: a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence.
1 EQ2430 Project Course in Signal Processing and Digital Communications - Spring 2011 On phase noise and it effect in OFDM communication system School.
Chapter 7. Register Transfer and Computer Operations
Overview Logistics Last lecture Today HW5 due today
Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1
February 21, 2007http://csg.csail.mit.edu/6.375/L07-1 Bluespec-4: Architectural exploration using IP lookup Arvind Computer Science & Artificial Intelligence.
PERFORMANCE COMPARISON AND EVALUATION OF A AND ITS IMPLEMENTATION IN RECONFIGURABLE ENVIRONMENT SABA ZIA 2007-NUST-MS-PHD-TE-05 Project Advisor:
Computer Architecture: A Constructive Approach Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
CH09 Computer Arithmetic  CPU combines of ALU and Control Unit, this chapter discusses ALU The Arithmetic and Logic Unit (ALU) Number Systems Integer.
National Institute Of Science & Technology OFDM Deepak Ranjan Panda (EI ) [1] Orthogonal Frequency Division multiplexing (OFDM) Technical Seminar.
March 4, 2009L13-1http://csg.csail.mit.edu/6.375 Multiple Clock Domains Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of.
Folded Combinational Circuits as an example of Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Multiple Clock Domains (MCD) Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 15, 2010.
Multiple Clock Domains (MCD) Continued … Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology November.
Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.
LZRW3 Decompressor dual semester project Part A Mid Presentation Students: Peleg Rosen Tal Czeizler Advisors: Moshe Porian Netanel Yamin
Outline Transmitters (Chapters 3 and 4, Source Coding and Modulation) (week 1 and 2) Receivers (Chapter 5) (week 3 and 4) Received Signal Synchronization.
Philips Research r0-WNG 1 / 23 IEEE session Hawaii November 2002 Alexei Gorokhov, Paul Mattheijssen, Manel Collados, Bertrand Vandewiele,
March, 2007http://csg.csail.mit.edu/arvindIFFT-1 Combinational Circuits: IFFT, Types, Parameterization... Arvind Computer Science & Artificial Intelligence.
September 8, 2009http://csg.csail.mit.edu/koreaL03-1 Combinational Circuits in Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Folding complex combinational circuits to save area Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.
Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September.
Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.
Simple Inelastic and Folded Pipelines Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 14, 2011L04-1.
Computer Architecture: A Constructive Approach Pipelining combinational circuits Teacher: Yoav Etsion Taken (with permission) from Arvind et al.*, Massachusetts.
Multiple Clock Domains (MCD) Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Combinational Circuits in Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 9, 2011L03-1
Introduction to OFDM and Cyclic prefix
October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010
William Stallings Computer Organization and Architecture 8th Edition
Overview Logistics Last lecture Today HW5 due today
Folded Combinational Circuits as an example of Sequential Circuits
Folded “Combinational” circuits
Sequential Circuits - 2 Constructive Computer Architecture Arvind
Embedded Systems Design
Architectural Exploration:
Sequential Circuits: Constructive Computer Architecture
Combinational Circuits in Bluespec
FFT: An example of complex combinational circuits
Combinational Circuits in Bluespec
Multirule Systems and Concurrent Execution of Rules
Constructive Computer Architecture: Guards
Combinational Circuits and Simple Synchronous Pipelines
Combinational Circuits and Simple Synchronous Pipelines
Modules with Guarded Interfaces
Sequential Circuits - 2 Constructive Computer Architecture Arvind
ECEG-3202 Computer Architecture and Organization
UWB Receiver Algorithm
Combinational Circuits in Bluespec
Multiple Clock Domains
FFT: An example of complex combinational circuits
Constructive Computer Architecture: Guards
Simple Synchronous Pipelines
Architectural Exploration:
Simple Synchronous Pipelines
Presentation transcript:

March 1, 2006http://csg.csail.mit.edu/6.375/L09-1 Bluespec-3: Architecture exploration using static elaboration Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

March 1, 2006L09-2http://csg.csail.mit.edu/6.375/ Design a a Transmitter a is an IEEE Standard for wireless communication Frequency of Operation: 5Ghz band Modulation: Orthogonal Frequency Division Multiplexing (OFDM) Transmitter Receiver Channel TX MAC Analog RX Analog TX RX MAC

March 1, 2006L09-3http://csg.csail.mit.edu/6.375/ Nomenclature Base data unit of the system: 24 uncoded bits Sample – One complex baseband value Symbol – One OFDM symbol that will be transmitted In time domain: 64 Samples long In frequency domain: 64 Tones (48 data, 4 pilot, 12 unused) Represented in fixed point (16 bit real, 16 bit imag) Frame - A unit of data, corresponds to: 1 Symbol at 6 Mbps (i.e. 1 frame represents one symbol) ½ Symbol at 12 Mbps (i.e. 2 frames represent one symbol) ¼ Symbol at 24 Mbps (i.e. 4 frames represent one symbol) Message – A sequence of data Symbols preceded by a header Symbol (SIGNAL)

March 1, 2006L09-4http://csg.csail.mit.edu/6.375/ Need Fixed Point Arithmetic Floating point is too inefficient to use We need to represent fractional values between -1 and 1 in our system Fixed Point: use a 16 bit integer to represent each value Store the value multiplied by 2 15 (32,768) Use 2’s compliment arithmetic on fixed point values, but watch for overflow MSB indicates sign of number (1 for negative) Examples: -1.0=> 0x8000 (-32768) 1/√2=> 0x5a82 ( 23170) -3/√10=> 0x8692 (-31086)

March 1, 2006L09-5http://csg.csail.mit.edu/6.375/ Transmitter Overview ControllerScramblerEncoderInterleaverMapper IFFT Cyclic Extend headers data IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain) complex numbers compute intensive

March 1, 2006L09-6http://csg.csail.mit.edu/6.375/ Mapper Maps incoming data to tones based on rate Outputs 1 OFDM symbol to the IFFT Depending on the rate, 48, 96, or 192 bits of input may be required to fill one symbol. Output: [data (64 complex numbers)] Input: [rate (2), data (48)]

March 1, 2006L09-7http://csg.csail.mit.edu/6.375/ Receiver Overview SynchronizerFFT Serial to Parallel Detector / Deinterleaver ViterbiControllerDescrambler compute intensive FFT, in half duplex system is often shared with IFFT

March 1, 2006L09-8http://csg.csail.mit.edu/6.375/ Synchronizer Performs two important tasks: Timing estimation and synchronization  Decides when a new message is present  Tells rest of receiver at which sample the incoming symbol starts Frequency offset estimation and correction  Estimates the offset of the transmitter and receiver clocks  Rotates input data to correct for this offset Extremely complicated !

March 1, 2006L09-9http://csg.csail.mit.edu/6.375/ Viterbi Decoder Uses the Viterbi algorithm to decode convolutionally encoded symbols Requires three 48-bit inputs to perform sufficient traceback Will only output a frame after it receives the two subsequent frames Detector flushes the Viterbi module with zeros after header and end of message

March 1, 2006L09-10http://csg.csail.mit.edu/6.375/ IFFT Requirements a needs to process a symbol in 4 sec (250KHz) IFFT must output a symbol every 4 sec  i.e. perform an Inverse FFT of 64 complex numbers Each module before IFFT must process every 4 sec  1 frame for 6Mbps rate  2 frames for 12Mbps rate  4 frames for 24Mbps rate Even in the worst case (24Mbps) the clock frequency can be as low as 1Mhz. But what about the area & power?

March 1, 2006L09-11http://csg.csail.mit.edu/6.375/ Area-Frequency Tradeoff We can decrease the area by multiplexing some circuits and running the system at a higher frequency Reuse Twice the frequency but half the area

March 1, 2006L09-12http://csg.csail.mit.edu/6.375/ Combinational IFFT in0 … in1 in2 in63 in3 in4 Radix 4 … x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3

March 1, 2006L09-13http://csg.csail.mit.edu/6.375/ Radix-4 Node * * * * * j k0 k1 k2 k3 out0 out1 out2 out3 twid3 twid2 twid1 twid0

March 1, 2006L09-14http://csg.csail.mit.edu/6.375/ Bluespec code: Radix-4 Node function Tuple4#(Complex, Complex, Complex, Complex) radix4(Tuple4#(Complex, Complex, Complex, Complex) twids, Complex k0, Complex k1, Complex k2, Complex k3); match {.t0,.t1,.t2,.t3} = twids; Complex m0 = k0 * t0; Complex m1 = k1 * t1; Complex m2 = k2 * t2; Complex m3 = k3 * t3; Complex y0 = m0 + m2; Complex y1 = m0 - m2; Complex y2 = m1 + m3; Complex y3 = m1 - m3; Complex y3_j = Complex {i: negate(y3.q), q: y3.i}; Complex z0 = y0 + y2; Complex z1 = y1 - y3_j; Complex z2 = y0 - y2; Complex z3 = y1 - y3_j; return tuple4(z0, z1, z2, z3); endfunction

March 1, 2006L09-15http://csg.csail.mit.edu/6.375/ Bluespec code for pure Combinational Circuit function SVector#(64, Complex) ifft (SVector#(64, Complex) in_data); //Declare vectors SVector#(64, Complex) stage12_data = newSVector(); SVector#(64, Complex) stage12_permuted = newSVector(); SVector#(64, Complex) stage12_out = newSVector(); SVector#(64, Complex) stage23_data = newSVector(); … //Radix 4 stage 1 (unpermuted) for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid0 = getTwiddle(0, fromInteger(i)); match {.y0,.y1,.y2,.y3} = radix4(twid0, in_data[idx], in_data[idx + 1], in_data[idx + 2], in_data[idx + 3]); stage12_data[idx] = y0; stage12_data[idx + 1] = y1; stage12_data[idx + 2] = y2; stage12_data[idx + 3] = y3; end //Stage 1 permutation for (Integer i = 0; i < 64; i = i + 1) stage12_permuted[i] = stage12_data[permute_1to2[i]]; //Continued on next slide…

March 1, 2006L09-16http://csg.csail.mit.edu/6.375/ Bluespec code for pure Combinational Circuit continued // (* continued from previous *) stage12_out = stage12_permuted; //Later implementations will change this //Radix 4 stage 2 (unpermuted) for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid1 = getTwiddle(1, fromInteger(i)); match {.y0,.y1,.y2,.y3} = radix4(twid1, stage12_out[idx], stage12_out[idx + 1], stage12_out[idx + 2], stage12_out[idx + 3]); stage23_data[idx] = y0; stage23_data[idx + 1] = y1; stage23_data[idx + 2] = y2; stage23_data[idx + 3] = y3; end //Stage 2 permutation for (Integer i = 0; i < 64; i = i + 1) stage23_permuted[i] = stage23_data[permute64_2to3[i]]; … //Repeat for Stage 3 … return stage3out_permuted; endfunction

March 1, 2006L09-17http://csg.csail.mit.edu/6.375/ Pipelined IFFT in0 … in1 in2 in63 in3 in4 Radix 4 … x16 Radix 4 … … out0 … out1 out2 out63 out3 out4 Permute_1Permute_2Permute_3 Put a register to hold 64 complex numbers at the output of each stage. Even more hardware but clock can go faster – less combinational circuitry between two stages

March 1, 2006L09-18http://csg.csail.mit.edu/6.375/ Bluespec code for Pipeline Stage module mkIFFT_Pipelined() (I_IFFT); //Declare vectors SVector#(64, Complex) in_data; SVector#(64, Complex) stage12_data = newSVector(); … //Declare FIFOs FIFO#(SVector#(64, Complex)) in_fifo <- mkFIFO(); //Declare pipeline registers Reg#(SVector#(64, Complex)) stage12_reg <- mkReg(newSVector()); Reg#(SVector#(64, Complex)) stage23_reg <- mkReg(newSVector()); //Read input in_data = in_fifo.first(); //Radix 4 stage 1 (unpermuted) for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4;  let twid0 = getTwiddle(0, fromInteger(i));  match {.y0,.y1,.y2,.y3} = radix4(twid0, in_data[idx], in_data[idx + 1], //Continue as before…

March 1, 2006L09-19http://csg.csail.mit.edu/6.375/ Bluespec code for Pipeline Stage … //Read from pipe register for stage 2 stage12_out = stage12_reg; //Radix 4 stage 2 (unpermuted) for (Integer i = 0; i < 16; i = i + 1) … //Read from pipe register for stage 3 stage23_out = stage23_reg; rule writeRegs (True); stage12_reg <= stage12_permuted; stage23_reg <= stage23_permuted; in_fifo.deq(); out_fifo.enq(stage3out_permuted); endrule method Action inp (Vector#(64, Complex) data); in_fifo.enq(data); endmethod … endmodule

March 1, 2006L09-20http://csg.csail.mit.edu/6.375/ Circular pipeline: Reusing the Pipeline Stage in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Radix 4 Permute_1Permute_2Permute_3 Stage Counter 16 Radix 4s can be shared but not the three permutations. Hence the need for muxes 64, 4-way Muxes

March 1, 2006L09-21http://csg.csail.mit.edu/6.375/ Bluespec Code for Circular Pipeline module mkIFFT_Circular (I_IFFT); SVector#(64, Complex) in_data = newSVector(); SVector#(64, Complex) stage_data = newSVector(); SVector#(64, Complex) stage_permuted = newSVector(); //State elements Reg#(SVector#(64, Complex)) data_reg <- mkReg(newSVector()); Reg#(Bit#(2)) stage_counter <- mkReg(0); FIFO#(SVector#(64, Complex)) in_fifo <- mkFIFO(); //Read input in_data = data_reg; //Perform a single Radix 4 stage (unpermuted) for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage_counter, fromInteger(i)); match {.y0,.y1,.y2,.y3} = radix4(twid, in_data[idx], in_data[idx + 1], in_data[idx + 2], in_data[idx + 3]); stage_data[idx] = y0; stage_data[idx + 1] = y1; stage_data[idx + 2] = y2; stage_data[idx + 3] = y3; end //Continued…

March 1, 2006L09-22http://csg.csail.mit.edu/6.375/ Bluespec Code for Circular Pipeline //Stage permutation for (Integer i = 0; i < 64; i = i + 1) stage_permuted[i] = case (stage_counter) 0: return in_wire._read[i]; 1: return stage_data[permute64_1to2[i]]; 2: return stage_data[permute64_2to3[i]]; 3: return stage_data[permute64_3toOut[i]]; endcase; rule writeRegs (True); data_reg <= stage_permuted; stage_counter <= stage_counter + 1; endrule method Action inp(SVector#(64, Complex) data) if (stage_counter == 0); in_fifo.enq(data); stage_counter <= 1; endmethod … endmodule

March 1, 2006L09-23http://csg.csail.mit.edu/6.375/ Just one Radix-4 node! in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 Radix 4 Permute_1Permute_2Permute_3 Stage Counter 0 to 2 Index Counter 0 to 15 64, 4-way Muxes 4, 16-way Muxes 4, 16-way DeMuxes The two stage registers can be folded into one

March 1, 2006L09-24http://csg.csail.mit.edu/6.375/ Bluespec Code for Extreme Reuse module mkIFFT_SuperCircular (I_IFFT); SVector#(64, Complex)) new_post_reg = newSVector(); //State Reg#(SVector#(64, Complex)) data_reg <- mkReg(newSVector()); Reg#(SVector#(64, Complex)) post_reg <- mkReg(newSVector()); Reg#(Bit#(2)) stage_counter no value Reg#(Bit#(5)) idx_counter permute FIFO#(SVector#(64, Complex)) in_fifo <- mkFIFO(); let twid = getTwiddle(stage_counter, idx_counter); match {.y0,.y1,.y2,.y3} = radix4(twid, select(in_data, {idx_counter,2’b00}), select(in_data, {idx_counter,2’b01}), select(in_data, {idx_counter,2’b10})); //Permutation takes post_reg’s values back to data_reg for (Integer i = 0; i < 64; i = i + 1) permutedV[i] = case (stage_counter) 1: return post_reg[permute64_1to2[i]]; 2: return post_reg[permute64_2to3[i]]; 3: return post_reg[permute64_3toOut[i]]; default: return in_fifo.first()[i]; endcase;

March 1, 2006L09-25http://csg.csail.mit.edu/6.375/ Bluespec Code for Extreme Reuse-2 rule doRadix(stage_counter != 0); if (idx_counter < 16) //We need to calc new radix values begin //generates new_post_reg value: post_reg after writing in the 4 new values let stage_data0 = post_reg; let stage_data1 = update(stage_data, idx, y0); let stage_data2 = update(stage_data1,idx + 1, y1); let stage_data3 = update(stage_data2,idx + 2, y2); new_post_reg = update(stage_data3,idx + 3, y3); post_reg <= new_post_reg; end else //(idx_counter == 16) We need to permute begin data_reg <= premutedV; end //We always increment counters idx_counter <= (idx_counter == 16) ? 0: idx_counter + 1; if (idx_counter == 16) stage_counter <= stage_counter + 1; endrule //Everything else as before…

March 1, 2006L09-26http://csg.csail.mit.edu/6.375/ Synthesis results Did not have time to synthesize these various designs But we have results from a term project from last year Steve Gerding, Elizabeth Basha & Rose Liu

March 1, 2006L09-27http://csg.csail.mit.edu/6.375/ IFFT Initial Design 16-Node Stage 1 16-Node Stage 2 16-Node Stage 3 InputDataQ OutputDataQ Twiddle Multiply Stage Combining Stage 1 Combining Stage 2 Radix4 Node Area = 29.12m 2 Cycle Time = 63.18ns Throughput = 1 Symbol / 63.18ns Radix4 Nodes * Steve Gerding, Elizabeth Basha & Rose Liu

March 1, 2006L09-28http://csg.csail.mit.edu/6.375/ IFFT Initial Design 16-Node Stage 1 16-Node Stage 2 16-Node Stage 3 InputDataQ OutputDataQ Twiddle Multiply Stage Combining Stage 1 Combining Stage 2 Radix4 Node Area = 29.12m 2 Cycle Time = 63.18ns Throughput = 1 Symbol / 63.18ns Radix4 Nodes * Steve Gerding, Elizabeth Basha & Rose Liu

March 1, 2006L09-29http://csg.csail.mit.edu/6.375/ Data and Twiddle Setup InputDataQ OutputDataQ 16-Node Stage IFFT Design Exploration 1 Area = 5.19m 2 Cycle Time = 30.50ns Throughput = 1 Symbol / 3 x 30.50ns = 1 Symbol / 91.50ns Steve Gerding, Elizabeth Basha & Rose Liu

March 1, 2006L09-30http://csg.csail.mit.edu/6.375/ Start InputDataQ OutputDataQ 16-Node Stage IFFT Design Exploration 2 Area = 4.57mm 2 Cycle Time = 32.89ns Throughput = 1 symbol / 3x 32.89ns = 1 symbol / 98.67ns Data and Twiddle Setup