October 6, 2009http://csg.csail.mit.edu/koreaL10-1 IP Lookup-2: The Completion Buffer Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Slides:



Advertisements
Similar presentations
Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
Advertisements

March 2007http://csg.csail.mit.edu/arvindSemantics-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Stmt FSM Richard S. Uhler Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology (based on a lecture prepared by Arvind)
Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L07-1
February 21, 2007http://csg.csail.mit.edu/6.375/L07-1 Bluespec-4: Architectural exploration using IP lookup Arvind Computer Science & Artificial Intelligence.
March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.
IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L06-1.
IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 4, 2013
February 22, 2005http://csg.csail.mit.edu/6.884/L07-1 Bluespec-1: Design Affects Everything Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.
Constructive Computer Architecture Tutorial 2 Advanced BSV Sizhuo Zhang TA Oct 9, 2015T02-1http://csg.csail.mit.edu/6.175.
Constructive Computer Architecture Tutorial 2 Advanced BSV Andy Wright TA September 26, 2014http://csg.csail.mit.edu/6.175T02-1.
September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.
Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology
Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.
Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline Joel Emer Computer Science & Artificial Intelligence Lab. Massachusetts.
Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1
Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.
Modeling Processors Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 1, 2010
Functions and Types in Bluespec
Bluespec-3: A non-pipelined processor Arvind
Bluespec-6: Modeling Processors
Folded “Combinational” circuits
Scheduling Constraints on Interface methods
Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind
Sequential Circuits - 2 Constructive Computer Architecture Arvind
Sequential Circuits Constructive Computer Architecture Arvind
Sequential Circuits: Constructive Computer Architecture
Caches-2 Constructive Computer Architecture Arvind
IP Lookup: Some subtle concurrency issues
Stmt FSM Arvind (with the help of Nirav Dave)
Performance Specifications
Pipelining combinational circuits
Multirule Systems and Concurrent Execution of Rules
Bluespec-1: Design Affects Everything
Constructive Computer Architecture: Guards
Sequential Circuits Constructive Computer Architecture Arvind
Pipelining combinational circuits
Bluespec-4: Architectural exploration using IP lookup Arvind
Constructive Computer Architecture: Well formed BSV programs
Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind
Modules with Guarded Interfaces
Pipelining combinational circuits
Sequential Circuits - 2 Constructive Computer Architecture Arvind
Bluespec-3: A non-pipelined processor Arvind
Multirule systems and Concurrent Execution of Rules
Stmt FSM Arvind (with the help of Nirav Dave)
Caches-2 Constructive Computer Architecture Arvind
IP Lookup Arvind Computer Science & Artificial Intelligence Lab
IP Lookup: Some subtle concurrency issues
Elastic Pipelines and Basics of Multi-rule Systems
Constructive Computer Architecture: Guards
Elastic Pipelines and Basics of Multi-rule Systems
Multirule systems and Concurrent Execution of Rules
IP Lookup: Some subtle concurrency issues
Bluespec-5: Scheduling & Rule Composition
Modeling Processors Arvind
Modular Refinement Arvind
Implementing for Correct Concurrency
Constructive Computer Architecture: Well formed BSV programs
Caches-2 Constructive Computer Architecture Arvind
Presentation transcript:

October 6, 2009http://csg.csail.mit.edu/koreaL10-1 IP Lookup-2: The Completion Buffer Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

IP-Lookup module without the completion buffer module mkIPLookup(IPLookup); rule recirculate… ; rule exit …; method Action enter (IP ip); ram.req(ip[31:16]); fifo.enq(ip[15:0]); endmethod method ActionValue#(Msg) getResult(); outQ.deq(); return outQ.first(); endmethod endmodule done? RAM fifo outQ enter getResult The packets may come out of order October 6, 2009 L10-2

IP Lookup rules rule recirculate (!isLeaf(ram.peek())); IP rip = fifo.first(); fifo.enq(rip << 8); ram.req(ram.peek() + rip[15:8]); fifo.deq(); ram.deq(); endrule rule exit (isLeaf(ram.peek())); outQ.enq(ram.peek()); fifo.deq(); ram.deq(); endrule Method enter and exit can be scheduled simultaneously, assuming fifo.enq and fifo.deq can be done simultaneously October 6, 2009 L10-3

IP-Lookup module with the completion buffer done? RAM fifo enter getResult outQ cbuf yes no getToken Completion buffer - ensures that departures take place in order even if lookups complete out-of-order - gives out tokens to control the entry into the circular pipeline The fifo now must also hold the “token” while the memory access is in progress: Tuple2#(Token,Bit#(16)) remainingIP October 6, 2009 L10-4

Completion buffer: Interface interface CBuffer#(type t); method ActionValue#(Token) getToken(); method Action put(Token tok, t d); method ActionValue#(t) getResult(); endinterface typedef Bit#(TLog#(n)) TokenN#(numeric type n); typedef TokenN#(16) Token; cbuf getResults getToken put (result & token) October 6, 2009 L10-5

Completion buffer: Implementation module mkCBuffer (CBuffer#(t)) provisos (Bits#(t,sz)); RegFile#(Token, Maybe#(t)) buf <- mkRegFileFull(); Reg#(Token) i <- mkReg(0); //input index Reg#(Token) o <- mkReg(0); //output index Reg#(Int#(32)) cnt <- mkReg(0); //number of filled slots … IIVIVIIIVIVI cnt i o buf Elements must be representable as bits A circular buffer with two pointers i and o, and a counter cnt Elements are of Maybe type October 6, 2009 L10-6

Completion buffer: Implementation // state elements // buf, i, o, n... method ActionValue#(t) getToken() if (cnt < maxToken); cnt <= cnt + 1; i <= (i== maxToken) ? 0: i + 1; buf.upd(i, Invalid); return i; endmethod method Action put(Token tok, t data); return buf.upd(tok, Valid data); endmethod method ActionValue#(t) getResult() if (cnt > 0) &&& (buf.sub(o) matches tagged (Valid.x)); o <= (o==maxToken) ? 0 : o + 1; cnt <= cnt - 1; return x; endmethod IIVIVIIIVIVI cnt i o buf Can these methods execute concurrently? Does it matter? October 6, 2009 L10-7

IP-Lookup module with the completion buffer module mkIPLookup(IPLookup); rule recirculate… ; rule exit …; method Action enter (IP ip); Token tok <- cbuf.getToken(); ram.req(ip[31:16]); fifo.enq(tuple2(tok,ip[15:0])); endmethod method ActionValue#(Msg) getResult(); let result <- cbuf.getResult(); return result; endmethod endmodule done? RAM fifo enter getResult cbuf yes no getToken for enter and getResult to execute simultaneously, cbuf.getToken and cbuf.getResult must execute simultaneously October 6, 2009 L10-8

IP Lookup rules with completion buffer rule recirculate (!isLeaf(ram.peek())); match{.tok,.rip} = fifo.first(); fifo.enq(tuple2(tok,(rip << 8))); ram.req(ram.peek() + rip[15:8]); fifo.deq(); ram.deq(); endrule rule exit (isLeaf(ram.peek())); cbuf.put(ram.peek()); fifo.deq(); ram.deq(); endrule For rule exit and method enter to execute simultaneously, cbuf.put and cbuf.getToken must execute simultaneously October 6, 2009 L10-9  For no dead cycles cbuf.getToken and cbuf.put and cbuf.getResult must be able to execute simultaneously

Completion buffer: Concurrency Issue // state elements // buf, i, o, n... method ActionValue#(t) getToken() if (cnt < maxToken); cnt <= cnt + 1; i <= (i==maxToken) ? 0 : i + 1; buf.upd(i, Invalid); return i; endmethod method Action put(Token tok, t data); buf.upd(tok, Valid data); endmethod method ActionValue#(t) getResult() if (cnt > 0) &&& (buf.sub(o) matches tagged (Valid.x)); o <= (o==maxToken) ? 0 : o + 1; cnt <= cnt - 1; return x; endmethod IIVIVIIIVIVI cnt i o buf Can these methods execute concurrently? NO! October 6, 2009 L10-10

Concurrency Analysis IIVIVIIIVIVI cnt i o buf A circular buffer with two pointers i and o, and a counter cnt Elements are of Maybe type buf must allow two simultaneous updates and one read It is possible to design such a buf because the updates are always to different addresses because of the way buf is used Use a vector of registers No compiler can detect that October 6, 2009 L10-11

Adding Concurrent methods to the completion buffer Assume: only put tokens given to the completion buffer Can we reach a state where two methods write on the same data? No.

Bypass Registers For concurrency, we want to write a register logically earlier in the cycle where it’s read For correctness we must bypass the written value to the reads Build a BypassRegister from RWire write < read September 29, 2009 L

Configuration Registers Sometimes the higher-level semantics say that the bypass is not necessary The bypass condition never arises In such cases, an ordinary register behaves like a bypass register But the concurrency analysis prohibits the bypass orderings Configuration Registers are ordinary registers with CF reads and writes Allows the bypass ordering October 6, 2009 L10-14

Concurrent Completion buffer: Implementation module mkCBuffer (CBuffer#(t)) provisos (Bits#(t,sz)); Vector#(TokenSz, Maybe#(t)) buf <- replicateM(mkConfigReg(Invalid)); Reg#(Token) i <- mkReg(0); //input index Reg#(Token) o <- mkReg(0); //output index Counter#(32) cnt <- mkCounter(0); //number of filled slots … IIVIVIIIVIVI cnt i o buf A circular buffer with two pointers i and o, and a counter cnt Elements are of Maybe type October 6, 2009 L10-15

A special counter module We often need to keep count of certain events Need to read count, decrement and increment Since decrementing and incrementing don’t change the count we can remove some bypassing links Implemented as Counter Library modules (implemented using Rwires) October 6, 2009 L10-16

Completion buffer: Concurrency Issue // state elements // buf, i, o, n... method ActionValue#(t) getToken() if (cnt.read() < maxToken); cnt.inc(); i <= (i==maxToken) ? 0 : i + 1; buf[i] <= Invalid; return i; endmethod method Action put(Token tok, t data); buf[tok] <= Valid data; endmethod method ActionValue#(t) getResult() if (cnt.read() > 0) &&& (buf[o] matches tagged Valid.x); o <= (o==maxToken) ? 0 : o + 1; cnt.dec(); return x; endmethod IIVIVIIIVIVI cnt i o buf Can these methods execute concurrently? Yes! October 6, 2009 L10-17

Longest Prefix Match for IP lookup: 3 possible implementation architectures Rigid pipeline Inefficient memory usage but simple design Linear pipeline Efficient memory usage through memory port replicator Circular pipeline Efficient memory with most complex control Which is “best”? October 6, 2009 L10-18 Arvind, Nikhil, Rosenband & Dave ICCAD 2004

October 6, 2009 L10-19http://csg.csail.mit.edu/korea Implementations of Static pipelines Two designers, two results LPM versions Best Area (gates) Best Speed (ns) Static V (Replicated FSMs) Static V (Single FSM) Replicated: RAM FSM MUX / De-MUX FSM Counter MUX / De-MUX result IP addr FSM RAM MUX result IP addr BEST: Each packet is processed by one FSM Shared FSM

Synthesis results LPM versions Code size (lines) Best Area (gates) Best Speed (ns) Mem. util. (random workload) Static V % Static BSV (5% larger)3.32 (7% faster)63.5% Linear V % Linear BSV (8% larger)4.7 (same)99.9% Circular V % Circular BSV (1% larger)3.67 (2% slower)99.9% Synthesis: TSMC 0.18 µm lib - Bluespec results can match carefully coded Verilog - Micro-architecture has a dramatic impact on performance - Architecture differences are much more important than language differences in determining QoR October 6, 2009 L10-20 V = Verilog;BSV = Bluespec System Verilog