IP Lookup Arvind Computer Science & Artificial Intelligence Lab

Slides:

Advertisements

Similar presentations

March 2007http://csg.csail.mit.edu/arvindSemantics-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Advertisements

Stmt FSM Richard S. Uhler Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology (based on a lecture prepared by Arvind)

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

February 21, 2007http://csg.csail.mit.edu/6.375/L07-1 Bluespec-4: Architectural exploration using IP lookup Arvind Computer Science & Artificial Intelligence.

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L06-1.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 4, 2013

February 22, 2005http://csg.csail.mit.edu/6.884/L07-1 Bluespec-1: Design Affects Everything Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February 20, 2013http://csg.csail.mit.edu/6.375L05-1.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

October 6, 2009http://csg.csail.mit.edu/koreaL10-1 IP Lookup-2: The Completion Buffer Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Elastic Pipelines: Concurrency Issues

Bluespec-3: A non-pipelined processor Arvind

Bluespec-6: Modeling Processors

Folded “Combinational” circuits

Scheduling Constraints on Interface methods

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Sequential Circuits Constructive Computer Architecture Arvind

Sequential Circuits: Constructive Computer Architecture

Caches-2 Constructive Computer Architecture Arvind

IP Lookup: Some subtle concurrency issues

Stmt FSM Arvind (with the help of Nirav Dave)

Performance Specifications

Pipelining combinational circuits

Multirule Systems and Concurrent Execution of Rules

Bluespec-1: Design Affects Everything

Constructive Computer Architecture: Guards

Sequential Circuits Constructive Computer Architecture Arvind

Pipelining combinational circuits

EHR: Ephemeral History Register

Bluespec-4: Architectural exploration using IP lookup Arvind

Constructive Computer Architecture: Well formed BSV programs

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Modeling Processors: Concurrency Issues

Modules with Guarded Interfaces

Pipelining combinational circuits

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Elastic Pipelines: Concurrency Issues

Bluespec-3: A non-pipelined processor Arvind

Multirule systems and Concurrent Execution of Rules

Stmt FSM Arvind (with the help of Nirav Dave)

Modular Refinement - 2 Arvind

IP Lookup: Some subtle concurrency issues

Elastic Pipelines: Concurrency Issues

Elastic Pipelines and Basics of Multi-rule Systems

Constructive Computer Architecture: Guards

Elastic Pipelines and Basics of Multi-rule Systems

Multirule systems and Concurrent Execution of Rules

Introduction to Bluespec: A new methodology for designing Hardware

IP Lookup: Some subtle concurrency issues

Bluespec-5: Scheduling & Rule Composition

Implementing for Correct Concurrency

Constructive Computer Architecture: Well formed BSV programs

Caches-2 Constructive Computer Architecture Arvind

Presentation transcript:

IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 17, 2009 http://csg.csail.mit.edu/arvind

IP Lookup block in a router Queue Manager Packet Processor Exit functions Control Processor Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC The design example you’ll be seeing is from an IP switch. We’ll be looking at the IP lookup function that you typically find on the ingress port of a line card. A packet is routed based on the “Longest Prefix Match” (LPM) of it’s IP address with entries in a routing table Line rate and the order of arrival must be maintained line rate  15Mpps for 10GE http://csg.csail.mit.edu/arvind February 17, 2009

Sparse tree representation F … A … 7 A … B F … F * E 5.*.*.* D 10.18.200.5 C 10.18.200.* B 7.14.7.3 A 7.14.*.* 14 3 5 E F … F 7 F … 10 F … F … 18 255 F … F … IP address Result M Ref 7.13.7.3 F 10.18.201.5 7.14.7.2 5.13.7.2 E 10.18.200.7 C C … 5 D 200 2 F … 3 4 A In this lecture: Level 1: 16 bits Level 2: 8 bits Level 3: 8 bits  1 to 3 memory accesses 1 4 http://csg.csail.mit.edu/arvind February 17, 2009

“C” version of LPM int lpm (IPA ipa) /* 3 memory lookups */ { int p; /* Level 1: 16 bits */ p = RAM [ipa[31:16]]; if (isLeaf(p)) return value(p); /* Level 2: 8 bits */ p = RAM [ptr(p) + ipa [15:8]]; /* Level 3: 8 bits */ p = RAM [ptr(p) + ipa [7:0]]; return value(p); /* must be a leaf */ } … 216 -1 28 -1 Not obvious from the C code how to deal with - memory latency - pipelining Here’s a longest prefix match algorithm in software. It’s pretty simple, up to three lookups of different parts of the IP address. (Click now) Can you visualize how to implement this in HW? It’s not obvious. Memory latency ~30ns to 40ns Must process a packet every 1/15 ms or 67 ns Must sustain 3 memory dependent lookups in 67 ns http://csg.csail.mit.edu/arvind February 17, 2009

Longest Prefix Match for IP lookup: 3 possible implementation architectures Rigid pipeline Inefficient memory usage but simple design Linear pipeline Efficient memory usage through memory port replicator Circular pipeline Efficient memory with most complex control So, let’s look at three different approaches: Rigid pipeline Linear pipeline Circular pipeline They all appear at a high-level to have different advantages, but… 1 2 3 Designer’s Ranking: Which is “best”? http://csg.csail.mit.edu/arvind February 17, 2009 Arvind, Nikhil, Rosenband & Dave ICCAD 2004

Circular pipeline enter? done? RAM yes inQ fifo no outQ The fifo holds the request while the memory access is in progress The architecture has been simplified for the sake of the lecture. Otherwise, a “completion buffer” has to be added at the exit to make sure that packets leave in order. http://csg.csail.mit.edu/arvind February 17, 2009

FIFO n = # of bits needed to represent a value of type t interface FIFO#(type t); method Action enq(t x); // enqueue an item method Action deq(); // remove oldest entry method t first(); // inspect oldest item endinterface n enab enq rdy not full n = # of bits needed to represent a value of type t enab rdy deq module FIFO not empty n first not empty rdy http://csg.csail.mit.edu/arvind February 17, 2009

Request-Response Interface for Synchronous Memory Data Ack Ready req deq peek Addr Ready ctr (ctr > 0) ctr++ ctr-- deq Enable enq Synch Mem Latency N interface Mem#(type addrT, type dataT); method Action req(addrT x); method Action deq(); method dataT peek(); endinterface Making a synchronous component latency- insensitive http://csg.csail.mit.edu/arvind February 17, 2009

Circular Pipeline Code enter? done? RAM inQ fifo rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq(); endrule done? Is the same as isLeaf rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq(); endrule When can enter fire? inQ has an element and ram & fifo each has space http://csg.csail.mit.edu/arvind February 17, 2009

Circular Pipeline Code: discussion enter? done? RAM inQ fifo rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq(); endrule rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq(); endrule When can recirculate fire? ram & fifo each has an element and ram, fifo & outQ each has space Is this possible? http://csg.csail.mit.edu/arvind February 17, 2009

One Element FIFO We can build such a FIFO more on this later module mkFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); method Action enq(t x) if (!full); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; method t first() if (full); return (data); method Action clear(); endmodule enq and deq cannot even be enabled together much less fire concurrently! n not empty not full rdy enab enq deq module FIFO The functionality we want is as if deq “happens” before enq; if deq does not happen then enq behaves normally We can build such a FIFO more on this later http://csg.csail.mit.edu/arvind February 17, 2009

Dead cycles rule enter (True); IP ip = inQ.first(); done? RAM inQ fifo rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq(); endrule assume simultaneous enq & deq is allowed rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq(); endrule Can a new request enter the system when an old one is leaving? Is this worth worrying about? http://csg.csail.mit.edu/arvind February 17, 2009

The Effect of Dead Cycles enter done? RAM yes in fifo no Circular Pipeline RAM takes several cycles to respond to a request Each IP request generates 1-3 RAM requests FIFO entries hold base pointer for next lookup and unprocessed part of the IP address What is the performance loss if “exit” and “enter” don’t ever happen in the same cycle? >33% slowdown! Unacceptable http://csg.csail.mit.edu/arvind February 17, 2009

The compiler issue Can the compiler detect all the conflicting conditions? Important for correctness Does the compiler detect conflicts that do not exist in reality? False positives lower the performance The main reason is that sometimes the compiler cannot detect under what conditions the two rules are mutually exclusive or conflict free What can the user specify easily? Rule priorities to resolve nondeterministic choice yes yes In many situations the correctness of the design is not enough; the design is not done unless the performance goals are met http://csg.csail.mit.edu/arvind February 17, 2009

Scheduling conflicting rules When two rules conflict on a shared resource, they cannot both execute in the same clock The compiler produces logic that ensures that, when both rules are applicable, only one will fire Which one? source annotations (* descending_urgency = “recirculate, enter” *) http://csg.csail.mit.edu/arvind February 17, 2009

So is there a dead cycle? rule enter (True); IP ip = inQ.first(); done? RAM inQ fifo rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq(); endrule rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq(); endrule In general these two rules conflict but when isLeaf(p) is true there is no apparent conflict! http://csg.csail.mit.edu/arvind February 17, 2009

Rule Spliting rule foo (True); if (p) r1 <= 5; else r2 <= 7; endrule rule fooT (p); r1 <= 5; endrule rule fooF (!p); r2 <= 7;  rule fooT and fooF can be scheduled independently with some other rule http://csg.csail.mit.edu/arvind February 17, 2009

Spliting the recirculate rule rule recirculate (!isLeaf(ram.peek())); IP rip = fifo.first(); fifo.enq(rip << 8); ram.req(ram.peek() + rip[15:8]); fifo.deq(); ram.deq(); endrule rule exit (isLeaf(ram.peek())); outQ.enq(ram.peek()); fifo.deq(); ram.deq(); endrule rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(ip[15:0]); inQ.deq(); endrule Now rules enter and exit can be scheduled simultaneously, assuming fifo.enq and fifo.deq can be done simultaneously http://csg.csail.mit.edu/arvind February 17, 2009

Back to the fifo problem module mkFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); method Action enq(t x) if (!full); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; method t first() if (full); return (data); method Action clear(); endmodule The functionality we want is as if deq “happens” before enq; if deq does not happen then enq behaves normally n not empty not full rdy enab enq deq module FIFO http://csg.csail.mit.edu/arvind February 17, 2009

RWire to rescue interface RWire#(type t); method Action wset(t x); method Maybe#(t) wget(); endinterface Like a register in that you can read and write it but unlike a register - read happens after write - data disappears in the next cycle RWires can break the atomicity of a rule if not used properly http://csg.csail.mit.edu/arvind February 17, 2009

One Element “Loopy” FIFO module mkLFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); RWire#(void) deqEN <- mkRWire(); method Action enq(t x) if (!full || isValid (deqEN.wget())); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; deqEN.wset(?); method t first() if (full); return (data); method Action clear(); full <= False; endmodule This works correctly in both cases (fifo full and fifo empty). not empty not full rdy enab enq deq module FIFO !full or http://csg.csail.mit.edu/arvind February 17, 2009

Problem solved! LFIFO fifo <- mkLFIFO; // use a loopy fifo rule recirculate (True); TableEntry p = ram.peek(); ram.deq(); IP rip = fifo.first(); if (isLeaf(p)) outQ.enq(p); else begin fifo.enq(rip << 8); ram.req(p + rip[15:8]); end fifo.deq(); endrule RWire has been safely encapsulated inside the Loopy FIFO – users of Loopy fifo need not be aware of RWires http://csg.csail.mit.edu/arvind February 17, 2009

Packaging a module: Turning a rule into a method inQ outQ enter? RAM done? fifo rule enter (True); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(p[15:0]); inQ.deq(); endrule method Action enter (IP ip); ram.req(ip[31:16]); fifo.enq(ip[15:0]); endmethod Similarly a method can be written to extract elements from the outQ http://csg.csail.mit.edu/arvind February 17, 2009

Circular pipeline with Completion Buffer getToken luResp cbuf inQ yes luReq enter? RAM done? no fifo Completion buffer - gives out tokens to control the entry into the circular pipeline - ensures that departures take place in order even if lookups complete out-of-order The fifo holds the token while the memory access is in progress: Tuple2#(Bit#(16), Token) http://csg.csail.mit.edu/arvind remainingIP February 17, 2009

Circular Pipeline Code with Completion Buffer enter? done? RAM cbuf inQ fifo rule enter (True); Token tok <- cbuf.getToken(); IP ip = inQ.first(); ram.req(ip[31:16]); fifo.enq(tuple2(ip[15:0], tok)); inQ.deq(); endrule rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .tok} = fifo.first(); if (isLeaf(p)) cbuf.put(tok, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+rip[15:8]); end fifo.deq(); endrule http://csg.csail.mit.edu/arvind February 17, 2009

Completion buffer i o cnt buf Elements must be representable as bits V cnt i o buf interface CBuffer#(type t); method ActionValue#(Token) getToken(); method Action put(Token tok, t d); method ActionValue#(t) getResult(); endinterface typedef Bit#(TLog#(n)) TokenN#(numeric type n); typedef TokenN#(16) Token; module mkCBuffer (CBuffer#(t)) provisos (Bits#(t,sz)); RegFile#(Token, Maybe#(t)) buf <- mkRegFileFull(); Reg#(Token) i <- mkReg(0); //input index Reg#(Token) o <- mkReg(0); //output index Reg#(Int#(32)) cnt <- mkReg(0); //number of filled slots … Elements must be representable as bits http://csg.csail.mit.edu/arvind February 17, 2009

Completion buffer i o // state elements // buf, i, o, n ... cnt V cnt i o buf Completion buffer // state elements // buf, i, o, n ... method ActionValue#(t) getToken() if (cnt < maxToken); cnt <= cnt + 1; i <= i + 1; buf.upd(i, Invalid); return i; endmethod method Action put(Token tok, t data); return buf.upd(tok, Valid data); endmethod method ActionValue#(t) getResult() if (cnt > 0) &&& (buf.sub(o) matches tagged (Valid .x)); o <= o + 1; cnt <= cnt - 1; return x; endmethod Home work: Think about concurrency Issues, i.e., can these methods be executed concurrently? Do they need to? http://csg.csail.mit.edu/arvind February 17, 2009

Longest Prefix Match for IP lookup: 3 possible implementation architectures Rigid pipeline Inefficient memory usage but simple design Linear pipeline Efficient memory usage through memory port replicator Circular pipeline Efficient memory with most complex control So, let’s look at three different approaches: Rigid pipeline Linear pipeline Circular pipeline They all appear at a high-level to have different advantages, but… Which is “best”? http://csg.csail.mit.edu/arvind Arvind, Nikhil, Rosenband & Dave ICCAD 2004 February 17, 2009

Implementations of Static pipelines Two designers, two results LPM versions Best Area (gates) Best Speed (ns) Static V (Replicated FSMs) 8898 3.60 Static V (Single FSM) 2271 3.56 Replicated: RAM FSM MUX / De-MUX Counter result IP addr FSM RAM MUX result IP addr BEST: Each packet is processed by one FSM Shared FSM All three architectures, in Verilog (V) and in Bluespec (BSV). BSV designs done in 3 days, with same speed/area as hand coded RTL (V). – REVIEW Static BS; Initial; 3.3ns 252 1719 105% 3375 41% 3.69 11% Static BS; Data alignment; 3.3ns 252 1038 24% 2606 9% 3.48 5% Static BS; No type conversion; 3.3ns 252 948 13% 2478 4% 3.61 9% Static BS; Nest case (BEST); 3.3ns 252 837 0% 2391 0% 3.32 0% Static Verilog; Replicated; 3.3ns 243 7151 775% 8898 292% 3.60 1% Static Verilog; BEST; 3.3ns 240 818 0% 2271 0% 3.56 0% http://csg.csail.mit.edu/arvind February 17, 2009

Synthesis results - Bluespec results can match carefully coded Verilog LPM versions Code size (lines) Best Area (gates) Best Speed (ns) Mem. util. (random workload) Static V 220 2271 3.56 63.5% Static BSV 179 2391 (5% larger) 3.32 (7% faster) Linear V 410 14759 4.7 99.9% Linear BSV 168 15910 (8% larger) 4.7 (same) Circular V 364 8103 3.62 Circular BSV 257 8170 (1% larger) 3.67 (2% slower) Synthesis: TSMC 0.18 µm lib - Bluespec results can match carefully coded Verilog - Micro-architecture has a dramatic impact on performance - Architecture differences are much more important than language differences in determining QoR http://csg.csail.mit.edu/arvind February 17, 2009 V = Verilog;BSV = Bluespec System Verilog