Bluespec-4: Architectural exploration using IP lookup Arvind

Slides:

Advertisements

Similar presentations

March 2007http://csg.csail.mit.edu/arvindSemantics-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Advertisements

Stmt FSM Richard S. Uhler Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology (based on a lecture prepared by Arvind)

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

February 21, 2007http://csg.csail.mit.edu/6.375/L07-1 Bluespec-4: Architectural exploration using IP lookup Arvind Computer Science & Artificial Intelligence.

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L06-1.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 4, 2013

February 22, 2005http://csg.csail.mit.edu/6.884/L07-1 Bluespec-1: Design Affects Everything Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.

March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Constructive Computer Architecture: Guards Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September 24, 2014.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Constructive Computer Architecture Store Buffers and Non-blocking Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Simple Inelastic and Folded Pipelines Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 14, 2011L04-1.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

October 6, 2009http://csg.csail.mit.edu/koreaL10-1 IP Lookup-2: The Completion Buffer Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Bluespec-3: A non-pipelined processor Arvind

Bluespec-6: Modeling Processors

Folded “Combinational” circuits

Scheduling Constraints on Interface methods

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Sequential Circuits Constructive Computer Architecture Arvind

Sequential Circuits: Constructive Computer Architecture

Caches-2 Constructive Computer Architecture Arvind

IP Lookup: Some subtle concurrency issues

Stmt FSM Arvind (with the help of Nirav Dave)

Performance Specifications

Pipelining combinational circuits

Multirule Systems and Concurrent Execution of Rules

Bluespec-1: Design Affects Everything

Constructive Computer Architecture: Guards

Sequential Circuits Constructive Computer Architecture Arvind

Pipelining combinational circuits

EHR: Ephemeral History Register

Constructive Computer Architecture: Well formed BSV programs

Blusepc-5: Dead cycles, bubbles and Forwarding in Pipelines Arvind

Modules with Guarded Interfaces

Pipelining combinational circuits

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Bluespec-3: A non-pipelined processor Arvind

Multirule systems and Concurrent Execution of Rules

Stmt FSM Arvind (with the help of Nirav Dave)

Caches-2 Constructive Computer Architecture Arvind

Modular Refinement - 2 Arvind

IP Lookup Arvind Computer Science & Artificial Intelligence Lab

Caches and store buffers

IP Lookup: Some subtle concurrency issues

Elastic Pipelines and Basics of Multi-rule Systems

Constructive Computer Architecture: Guards

Elastic Pipelines and Basics of Multi-rule Systems

Simple Synchronous Pipelines

Multirule systems and Concurrent Execution of Rules

IP Lookup: Some subtle concurrency issues

Bluespec-5: Scheduling & Rule Composition

Modeling Processors Arvind

Modular Refinement Arvind

Implementing for Correct Concurrency

Constructive Computer Architecture: Well formed BSV programs

Caches-2 Constructive Computer Architecture Arvind

Presentation transcript:

Bluespec-4: Architectural exploration using IP lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 21, 2007 http://csg.csail.mit.edu/6.375/

IP Lookup block in a router Queue Manager Packet Processor Exit functions Control Processor Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC The design example you’ll be seeing is from an IP switch. We’ll be looking at the IP lookup function that you typically find on the ingress port of a line card. A packet is routed based on the “Longest Prefix Match” (LPM) of it’s IP address with entries in a routing table Line rate and the order of arrival must be maintained line rate  15Mpps for 10GE http://csg.csail.mit.edu/6.375/ February 21, 2007

Sparse tree representation F … A … 7 A … B F … F * E 5.*.*.* D 10.18.200.5 C 10.18.200.* B 7.14.7.3 A 7.14.*.* 14 3 5 E F … F 7 F … 10 F … F … 18 255 F … F … IP address Result M Ref 7.13.7.3 F 10.18.201.5 7.14.7.2 5.13.7.2 E 10.18.200.7 C C … 5 D 200 2 F … 3 Real-world lookup algorithms are more complex but all make a sequence of dependent memory references. 1 4 http://csg.csail.mit.edu/6.375/ February 21, 2007

Table representation issues Table size Depends on the number of entries: 10K to 100K Too big to fit on chip memory  SRAM  DRAM  latency, cost, power issues Number of memory accesses for an LPM? Too many  difficult to do table lookup at line rate (say at 10Gbps) Control-plane issues: incremental table update size, speed of table maintenance software In this lecture (to fit the code on slides!): Level 1: 16 bits, Level 2: 8 bits, Level 3: 8 bits  from 1 to 3 memory accesses for an LPM http://csg.csail.mit.edu/6.375/ February 21, 2007

“C” version of LPM int lpm (IPA ipa) /* 3 memory lookups */ { int p; /* Level 1: 16 bits */ p = RAM [ipa[31:16]]; if (isLeaf(p)) return value(p); /* Level 2: 8 bits */ p = RAM [ptr(p) + ipa [15:8]]; /* Level 3: 8 bits */ p = RAM [ptr(p) + ipa [7:0]]; return value(p); /* must be a leaf */ } … 216 -1 28 -1 Not obvious from the C code how to deal with - memory latency - pipelining Here’s a longest prefix match algorithm in software. It’s pretty simple, up to three lookups of different parts of the IP address. (Click now) Can you visualize how to implement this in HW? It’s not obvious. Memory latency ~30ns to 40ns Must process a packet every 1/15 ms or 67 ns Must sustain 3 memory dependent lookups in 67 ns http://csg.csail.mit.edu/6.375/ February 21, 2007

Microarchitecture -1 Static Pipeline IP Lookup: Microarchitecture -1 Static Pipeline February 21, 2007 http://csg.csail.mit.edu/6.375/

Static Pipeline Assume the memory has a latency of n (4) cycles and can accept a request every cycle Assume every IP look up takes exactly m (3) memory reads Assuming there is always an input to process: Number of lookups per packet Pipelining to deal with latency Inefficient memory usage – unused memory slots represent wasted bandwidth Difficult to schedule table updates The system needs space for at least n packets for full pipelining http://csg.csail.mit.edu/6.375/ February 21, 2007

Static (Synchronous) Pipeline Microarcitecture Provide n (> latency) registers; mark all of them as Empty Let a new message enter the system when the last register is empty or an old request leaves Each Register r hold either the result value or the remainder of the IP address. r5 also has to hold the next address for the memory typedef union tagged { Value Result; struct{Bit#(16) remainingIP; Bit#(19) ptr;} IPptr } regData The state c of each register is typedef enum { Empty , Level1 , Level2 , Level3 } State; RAM req IP addr resp RAM latency=4 ri, ci http://csg.csail.mit.edu/6.375/ February 21, 2007

Static code RAM rule static (True); if (next(c5) == Empty) req IP addr resp RAM latency=4 r1, c1 inQ outQ r2, c2 r3, c3 r4, c4 r5, c5 rule static (True); if (next(c5) == Empty) if (inQ.notEmpty) begin IP ip = inQ.first(); inQ.deq(); ram.req(ext(ip[31:16])); r1 <= IPptr{ip[15:0],?}; c1 <= Level1; end else c1 <= Empty; else begin r1 <= r5; c1 <= next(c5); if(!isResult(r5)) ram.req(ptr(r5));end r2 <= r1; c2 <= c1; r3 <= r2; c3 <= c2; r4 <= r3; c4 <= c3; TableEntry p; if((c4 != Empty)&& !isResult(r4)) p <- ram.resp(); r5 <= nextReq(p, r4); c5 <= c4; if (c5 == Level3) outQ.enq(result(r5)); endrule http://csg.csail.mit.edu/6.375/ February 21, 2007

The next function function State next (State c); case (c) Empty : return(Empty); Level1 : return(Level2); Level2 : return(Level3); Level3 : return(Empty); endcase endfunction http://csg.csail.mit.edu/6.375/ February 21, 2007

The nextReq function function RegData nextReq(TableEntry p, RegData r); case (r) matches tagged Result .*: return r; tagged IPptr .ip: if (isLeaf(p)) return tagged Result value(p), else return tagged IPptr{remainingIP: ip.remainingIP << 8, ptr: ptr(p) + ip.remainingIP[15:8] }; endcase endfunction http://csg.csail.mit.edu/6.375/ February 21, 2007

Another Static Organization RAM FSM MUX / De-MUX Counter result IP addr Each packet is processed by its own FSM Counter determines which FSM gets to go http://csg.csail.mit.edu/6.375/ February 21, 2007

Code for Static-2 Organization function Action doFSM(r,c); action if (c == Empty) else if (c == Level1 || c == Level2) begin else if (c == Level3) begin endaction endfunction RAM FSM MUX / De-MUX result IP addr Counter http://csg.csail.mit.edu/6.375/ February 21, 2007

Implementations of Static pipelines Two designers, two results LPM versions Best Area (gates) Best Speed (ns) Static V (Replicated FSMs) 8898 3.60 Static V (Single FSM) 2271 3.56 Replicated: RAM FSM MUX / De-MUX Counter result IP addr FSM RAM MUX result IP addr BEST: Each packet is processed by one FSM Shared FSM All three architectures, in Verilog (V) and in Bluespec (BSV). BSV designs done in 3 days, with same speed/area as hand coded RTL (V). – REVIEW Static BS; Initial; 3.3ns 252 1719 105% 3375 41% 3.69 11% Static BS; Data alignment; 3.3ns 252 1038 24% 2606 9% 3.48 5% Static BS; No type conversion; 3.3ns 252 948 13% 2478 4% 3.61 9% Static BS; Nest case (BEST); 3.3ns 252 837 0% 2391 0% 3.32 0% Static Verilog; Replicated; 3.3ns 243 7151 775% 8898 292% 3.60 1% Static Verilog; BEST; 3.3ns 240 818 0% 2271 0% 3.56 0% http://csg.csail.mit.edu/6.375/ February 21, 2007

Microarchitecture -2 Circular Pipeline IP Lookup: Microarchitecture -2 Circular Pipeline 5-minute break to stretch you legs February 21, 2007 http://csg.csail.mit.edu/6.375/

Circular pipeline Completion buffer getToken luResp cbuf inQ yes luReq enter? RAM done? no fifo Completion buffer - gives out tokens to control the entry into the circular pipeline - ensures that departures take place in order even if lookups complete out-of-order The fifo holds the token while the memory access is in progress: Tuple2#(Bit#(16), Token) http://csg.csail.mit.edu/6.375/ remainingIP February 21, 2007

Circular Pipeline Code enter? done? RAM cbuf inQ fifo rule enter (True); Token tok <- cbuf.getToken(); IP ip = inQ.first(); ram.req(ext(ip[31:16])); fifo.enq(tuple2(ip[15:0], tok)); inQ.deq(); endrule rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .t} = fifo.first(); if (isLeaf(p)) cbuf.put(t, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+signExtend(rip[15:8])); end fifo.deq(); endrule http://csg.csail.mit.edu/6.375/ February 21, 2007

Completion buffer i o cnt buf interface CBuffer#(type t); V cnt i o buf interface CBuffer#(type t); method ActionValue#(Token) getToken(); method Action put(Token tok, t d); method ActionValue#(t) getResult(); endinterface module mkCBuffer (CBuffer#(t)) provisos (Bits#(t,sz)); RegFile#(Token, Maybe#(t)) buf <- mkRegFileFull(); Reg#(Token) i <- mkReg(0); //input index Reg#(Token) o <- mkReg(0); //output index Reg#(Token) cnt <- mkReg(0); //number of filled slots … http://csg.csail.mit.edu/6.375/ February 21, 2007

Completion buffer i o ... // state elements buf, i, o, n ... cnt V cnt i o buf Completion buffer ... // state elements buf, i, o, n ... method ActionValue#(t) getToken() if (cnt <= maxToken); cnt <= cnt + 1; i <= i + 1; buf.upd(i, Invalid); return i; endmethod method Action put(Token tok, t data); return buf.upd(tok, Valid data); endmethod method ActionValue#(t) getResult() if (cnt > 0) &&& (buf.sub(o) matches tagged (Valid .x)); o <= o + 1; cnt <= cnt - 1; return x; endmethod http://csg.csail.mit.edu/6.375/ February 21, 2007

Longest Prefix Match for IP lookup: 3 possible implementation architectures Rigid pipeline Inefficient memory usage but simple design Linear pipeline Efficient memory usage through memory port replicator Circular pipeline Efficient memory with most complex control So, let’s look at three different approaches: Rigid pipeline Linear pipeline Circular pipeline They all appear at a high-level to have different advantages, but… 1 2 3 Designer’s Ranking: Which is “best”? http://csg.csail.mit.edu/6.375/ February 21, 2007 Arvind, Nikhil, Rosenband & Dave ICCAD 2004

Synthesis results - Bluespec results can match carefully coded Verilog LPM versions Code size (lines) Best Area (gates) Best Speed (ns) Mem. util. (random workload) Static V 220 2271 3.56 63.5% Static BSV 179 2391 (5% larger) 3.32 (7% faster) Linear V 410 14759 4.7 99.9% Linear BSV 168 15910 (8% larger) 4.7 (same) Circular V 364 8103 3.62 Circular BSV 257 8170 (1% larger) 3.67 (2% slower) Synthesis: TSMC 0.18 µm lib - Bluespec results can match carefully coded Verilog - Micro-architecture has a dramatic impact on performance - Architecture differences are much more important than language differences in determining QoR http://csg.csail.mit.edu/6.375/ February 21, 2007 V = Verilog;BSV = Bluespec System Verilog

A problem ... rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .t} = fifo.first(); if (isLeaf(p)) cbuf.put(t, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+signExtend(rip[15:8])); end fifo.deq(); endrule What condition does the fifo need to satisfy for this rule to fire? http://csg.csail.mit.edu/6.375/ February 21, 2007

One Element FIFO enq and deq cannot be enabled together! module mkFIFO1 (FIFO#(t)); Reg#(t) data <- mkRegU(); Reg#(Bool) full <- mkReg(False); method Action enq(t x) if (!full); full <= True; data <= x; endmethod method Action deq() if (full); full <= False; method t first() if (full); return (data); method Action clear(); endmodule enq and deq cannot be enabled together! http://csg.csail.mit.edu/6.375/ February 21, 2007

Another Problem: Dead cycle elimination enter? done? RAM cbuf inQ fifo rule enter (True); Token tok <- cbuf.getToken(); IP ip = inQ.first(); ram.req(ext(ip[31:16])); fifo.enq(tuple2(ip[15:0], tok)); inQ.deq(); endrule rule recirculate (True); TableEntry p <- ram.resp(); match {.rip, .t} = fifo.first(); if (isLeaf(p)) cbuf.put(t, p); else begin fifo.enq(tuple2(rip << 8, tok)); ram.req(p+signExtend(rip[15:8])); end fifo.deq(); endrule Can a new request enter the system simultaneously with an old one leaving? Solutions next time ... http://csg.csail.mit.edu/6.375/ February 21, 2007