March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Slides:

Advertisements

Similar presentations

Elastic Pipelines and Basics of Multi-rule Systems Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February.

Advertisements

An EHR based methodology for Concurrency management Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Constructive Computer Architecture: Multirule systems and Concurrent Execution of Rules Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

March 2007http://csg.csail.mit.edu/arvindSemantics-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

CS 151 Digital Systems Design Lecture 37 Register Transfer Level

Stmt FSM Richard S. Uhler Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology (based on a lecture prepared by Arvind)

Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology October 13, 2009http://csg.csail.mit.edu/koreaL12-1.

February 21, 2007http://csg.csail.mit.edu/6.375/L07-1 Bluespec-4: Architectural exploration using IP lookup Arvind Computer Science & Artificial Intelligence.

March, 2007http://csg.csail.mit.edu/arvindIPlookup-1 IP Lookup Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

September 24, L08-1 IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 22, 2011L06-1.

IP Lookup: Some subtle concurrency issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology March 4, 2013

December 10, 2009 L29-1 The Semantics of Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute.

Computer Architecture: A Constructive Approach Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

February 22, 2005http://csg.csail.mit.edu/6.884/L07-1 Bluespec-1: Design Affects Everything Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

December 12, 2006http://csg.csail.mit.edu/6.827/L24-1 Scheduling Primitives for Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Pipelining combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology February 20, 2013http://csg.csail.mit.edu/6.375L05-1.

Realistic Memories and Caches Li-Shiuan Peh Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology March 21, 2012L13-1

September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

SEQUENTIAL CIRCUITS Component Design and Use. Register with Parallel Load  Register: Group of Flip-Flops  Ex: D Flip-Flops  Holds a Word of Data 

Folded Combinational Circuits as an example of Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.

Multiple Clock Domains (MCD) Continued … Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology November.

September 22, 2009http://csg.csail.mit.edu/koreaL07-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab.

Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

Elastic Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 28, 2011L08-1http://csg.csail.mit.edu/6.375.

Constructive Computer Architecture Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology September.

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

February 20, 2009http://csg.csail.mit.edu/6.375L08-1 Asynchronous Pipelines: Concurrency Issues Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Non-blocking Caches Arvind (with Asif Khan) Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology May 14, 2012L25-1

Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Computer Architecture: A Constructive Approach Bluespec execution model and concurrent rule scheduling Teacher: Yoav Etsion Taken (with permission) from.

October 20, 2009L14-1http://csg.csail.mit.edu/korea Concurrency and Modularity Issues in Processor pipelines Arvind Computer Science & Artificial Intelligence.

October 6, 2009http://csg.csail.mit.edu/koreaL10-1 IP Lookup-2: The Completion Buffer Arvind Computer Science & Artificial Intelligence Lab Massachusetts.

Introduction to Bluespec: A new methodology for designing Hardware

Introduction to Bluespec: A new methodology for designing Hardware

Bluespec-3: A non-pipelined processor Arvind

Bluespec-6: Modeling Processors

Sequential Circuits Constructive Computer Architecture Arvind

Sequential Circuits: Constructive Computer Architecture

Introduction to Bluespec: A new methodology for designing Hardware

Caches-2 Constructive Computer Architecture Arvind

IP Lookup: Some subtle concurrency issues

Stmt FSM Arvind (with the help of Nirav Dave)

Performance Specifications

Multirule Systems and Concurrent Execution of Rules

Bluespec-1: Design Affects Everything

Constructive Computer Architecture: Guards

Sequential Circuits Constructive Computer Architecture Arvind

Bluespec-4: Architectural exploration using IP lookup Arvind

Bluespec-7: Scheduling & Rule Composition

Modules with Guarded Interfaces

Sequential Circuits - 2 Constructive Computer Architecture Arvind

Introduction to Bluespec: A new methodology for designing Hardware

Bluespec-3: A non-pipelined processor Arvind

Stmt FSM Arvind (with the help of Nirav Dave)

IP Lookup Arvind Computer Science & Artificial Intelligence Lab

Bluespec-4: Rule Scheduling and Synthesis

IP Lookup: Some subtle concurrency issues

Elastic Pipelines and Basics of Multi-rule Systems

Constructive Computer Architecture: Guards

GCD: A simple example to introduce Bluespec

Elastic Pipelines and Basics of Multi-rule Systems

Bluespec-7: Scheduling & Rule Composition

Multirule systems and Concurrent Execution of Rules

Introduction to Bluespec: A new methodology for designing Hardware

IP Lookup: Some subtle concurrency issues

Bluespec-5: Scheduling & Rule Composition

Modular Refinement Arvind

Presentation transcript:

March 6, 2006http://csg.csail.mit.edu/6.375/L10-1 Bluespec-4: Rule Scheduling and Synthesis Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

March 6, 2006http://csg.csail.mit.edu/6.375/L10-2 Exploring microarchitectures IP Lookup Module

March 6, 2006L10-3http://csg.csail.mit.edu/6.375/ IP Lookup block in a router Queue Manager Packet Processor Exit functions Control Processor Line Card (LC) IP Lookup SRAM (lookup table) Arbitration Switch LC A packet is routed based on the “Longest Prefix Match” (LPM) of it’s IP address with entries in a routing table Line rate and the order of arrival must be maintained line rate  15Mpps for 10GE

March 6, 2006L10-4http://csg.csail.mit.edu/6.375/ IP addressResultM Ref F F E C Sparse tree representation 3 A … A … B C … C … 5 D F … F … 14 A … A … 7 F … F … 200 F … F … F* E5.*.*.* D C * B A7.14.*.* F … F … F F … E A Real-world lookup algorithms are more complex but all make a sequence of dependent memory references.

March 6, 2006L10-5http://csg.csail.mit.edu/6.375/ Table representation issues Table size Depends on the number of entries: 10K to 100K Too big to fit on chip memory  SRAM  DRAM  latency, cost, power issues Number of memory accesses for an LPM? Too many  difficult to do table lookup at line rate (say at 10Gbps) Control-plane issues: incremental table update size, speed of table maintenance software In this lecture (to fit the code on slides!): Level 1: 16 bits, Level 2: 8 bits, Level 3: 8 bits  from 1 to 3 memory accesses for an LPM

March 6, 2006L10-6http://csg.csail.mit.edu/6.375/ “C” version of LPM int lpm (IPA ipa) /* 3 memory lookups */ { int p; /* Level 1: 16 bits */ p = RAM [ipa[31:16]]; if (isLeaf(p)) return p; /* Level 2: 8 bits */ p = RAM [p + ipa [15:8]]; if (isLeaf(p)) return p; /* Level 3: 8 bits */ p = RAM [p + ipa [7:0]]; return p; /* must be a leaf */ } How to implement LPM in HW? Not obvious from the C code! … … … … 0 Must process a packet every 1/15 s or 67 ns Must sustain 3 memory dependent lookups in 67 ns

March 6, 2006L10-7http://csg.csail.mit.edu/6.375/ Static Pipeline Assume the memory has a latency of n cycles and can accept a request every cycle Inefficient memory usage – unused memory slots represent wasted bandwidth. Difficult to schedule table updates RAM req IP addr resp RAM latency=4

March 6, 2006L10-8http://csg.csail.mit.edu/6.375/ Circular pipeline luReq luResp Completion buffer - gives out tokens to control the entry into the circular pipeline - ensures that departures take place in order even if lookups complete out-of-order enter? done? RAM cbuf yes getToken in active no

March 6, 2006L10-9http://csg.csail.mit.edu/6.375/ RAMs: Synchronous vs Asynchronous view Basic memory components are "synchronous": Present a read-address A J on clock J Data D J arrives on clock J+N If you don't "catch" D J on clock J+N, it may be lost, i.e., data D J+1 may arrive on clock J+1+N This kind of synchronicity can pervade the design and cause complications Synch Mem, Latency N AddrData Clock

March 6, 2006L10-10http://csg.csail.mit.edu/6.375/ Asynchronous RAMs It's easier to work with an "asynchronous" block Synch Mem Latency N Addr Ready ctr (ctr > 0) ctr++ ctr-- deq Enable enq interface AsyncRAM#(type addr_T, type data_T);  method Action req(addr_T a);  method ActionValue#(data_T) resp(); endinterface Data Ack Data Ready req resp

March 6, 2006L10-11http://csg.csail.mit.edu/6.375/ Static code rule static (True); if (c5 == 3) begin IP ip = in.first(); ram.req(ip[31:16]); r1 <= ip[15:0]; in.deq(); c1 <= 1; end else begin r1 <= r5; c1 <= c5+1; ram.req(r5); end r2 <= r1; c2 <= c1; r3 <= r2; c3 <= c2; r4 <= r3; c4 <= c3; TableEntry p <- ram.resp(); r5 <= nextReq( p, r4); c5 <= c4; if (c5 == 3) out.enq(r5); endrule RAM req IP addr resp ri, ci RAM latency=4

March 6, 2006L10-12http://csg.csail.mit.edu/6.375/ Circular Pipeline Code rule enter (True); Token t <- cbuf.getToken(); IP ip = in.first(); ram.req(ip[31:16]); active.enq(tuple2(ip[15:0], t)); in.deq(); endrule rule done (True); TableEntry p <- ram.resp(); match {.rip,.t} = active.first(); if (isLeaf(p)) cbuf.done(t, p); else begin active.enq(rip << 8, t); ram.req(p + signExtend(rip[15:7])); end active.deq(); endrule enter? done? RAM cbuf in active

March 6, 2006L10-13http://csg.csail.mit.edu/6.375/ Completion buffer interface CBuffer#(type any_T); method ActionValue#(Token) getToken(); method Action done(Token t, any_T d); method ActionValue#(any_T) getResult(); endinterface module mkCBuffer (CBuffer#(any_T)) provisos (Bits#(any_T,sz)); RegFile#(Token, Maybe#(any_T)) buf <- mkRegFileFull(); Reg#(Token) i <- mkReg(0); //input index Reg#(Token) o <- mkReg(0); //output index Reg#(Token) cnt <- mkReg(0); //number of filled slots … IIVIVIIIVIVI n i o buf

March 6, 2006L10-14http://csg.csail.mit.edu/6.375/ Completion buffer... // state elements buf, i, o, n... method ActionValue#(any_T) getToken() if (cnt <= maxToken); cnt <= cnt + 1; i <= i + 1; buf.upd(i, Invalid); return i; endmethod method Action done(Token t, any_T data); return buf.upd(t, Valid data); endmethod method ActionValue#(any_T) get() if (cnt > 0) &&& (buf.sub(o) matches tagged (Valid.x)); o <= o + 1; cnt <= cnt - 1; return x; endmethod IIVIVIIIVIVI cnt i o buf

March 6, 2006L10-15http://csg.csail.mit.edu/6.375/ Synthesis from rules... we will revisit IP LPM block synthesis results after a better understanding of the synthesis procedure

March 6, 2006L10-16http://csg.csail.mit.edu/6.375/ Synthesis: From State & Rules into Synchronous FSMs interface module Transition Logic I O S “Next” S Collection of State Elements

March 6, 2006L10-17http://csg.csail.mit.edu/6.375/ Hardware Elements Combinational circuits Mux, Demux, ALU,... Synchronous state elements Flipflop, Register, Register file, SRAM, DRAM Sel O I0I1InI0I1In Mux... Sel I De- Mux... O0O1OnO0O1On OpSelect - Add, Sub, AddU,... - And, Or, Not,... - GT, LT, EQ,... - SL, SR, SRA,... Result NCVZ A B ALU ff D D D D D D D QQQQQQQQ D Clk En register

March 6, 2006L10-18http://csg.csail.mit.edu/6.375/ Flip-flops with Write Enables ff Q D C EN C D Q ff Q DCDC EN 0101 ff Q D C EN dangerous! Edge-triggered: Data is sampled at the rising edge

March 6, 2006L10-19http://csg.csail.mit.edu/6.375/ Semantics and synthesis Rules Semantics: “Untimed” (one rule at a time) Verilog RTL Semantics: clocked synchronous HW (multiple rules per clock) Scheduling and Synthesis by the BSV compiler Using Rule Semantics, establish functional correctness Using Schedules, establish performance correctness Verification activities

March 6, 2006L10-20http://csg.csail.mit.edu/6.375/ Rule semantics Given a set of rules and an initial state while ( some rule is applicable in the current state ) choose one applicable rule apply that rule to the current state to produce the next state of the system* (*) These rule semantics are “untimed” – the action to change the state can take as long as necessary provided the state change is seen as atomic, i.e., not divisible. Bluespec synthesis is all about executing many rules concurrently while preserving the above semantics

March 6, 2006L10-21http://csg.csail.mit.edu/6.375/ Rule: As a State Transformer A rule may be decomposed into two parts (s) and (s) such that s next = if (s) then (s) else s (s) is the condition (predicate) of the rule, a.k.a. the “CAN_FIRE” signal of the rule. (conjunction of explicit and implicit conditions) (s) is the “state transformation” function, i.e., computes the next-state value in terms of the current state values.

March 6, 2006L10-22http://csg.csail.mit.edu/6.375/ Compiling a Rule f x current state next state values   enable f x rule r (f.first() > 0) ; x <= x + 1 ; f.deq (); endrule  = enabling condition  = action signals & values rdy signals read methods enable signals action parameters

March 6, 2006L10-23http://csg.csail.mit.edu/6.375/ Combining State Updates: strawman next state value latch enable R OR 11 nn  1,R  n,R OR ’s from the rules that update R ’s from the rules that update R What if more than one rule is enabled?

March 6, 2006L10-24http://csg.csail.mit.edu/6.375/ Combining State Updates next state value latch enable R Scheduler: Priority Encoder OR 11 nn 11 nn  1,R  n,R OR ’s from the rules that update R Scheduler ensures that at most one  i is true ’s from all the rules

March 6, 2006L10-25http://csg.csail.mit.edu/6.375/ One-rule-at-a-time Scheduler Scheduler: Priority Encoder 11 22 nn 11 22 nn 1  i   i 2  1   2 ....   n   1   2 ....   n 3. One rewrite at a time i.e. at most one  i is true Very conservative way of guaranteeing correctness

March 6, 2006L10-26http://csg.csail.mit.edu/6.375/ Executing Multiple Rules Per Cycle Can these rules be executed simultaneously? These rules are “conflict free” because they manipulate different parts of the state rule ra (z > 10); x <= x + 1; endrule rule rb (z > 20); y <= y + 2; endrule Rule a and Rule b are conflict-free if s.  a (s)   b (s)  1.  a ( b (s))   b ( a (s)) 2.  a ( b (s)) ==  b ( a (s))

March 6, 2006L10-27http://csg.csail.mit.edu/6.375/ Executing Multiple Rules Per Cycle Can these rules be executed simultaneously? These rules are “sequentially composable”, parallel execution behaves like ra < rb rule ra (z > 10); x <= y + 1; endrule rule rb (z > 20); y <= y + 2; endrule Rule a and Rule b are sequentially composable if s.  a (s)   b (s)   b ( a (s))

March 6, 2006L10-28http://csg.csail.mit.edu/6.375/ Multiple-Rules-per-Cycle Scheduler Scheduler 11 22 nn 11 22 nn 1.  i   i 2.  1   2 ....   n   1   2 ....   n 3.  Multiple operations such that  i   j  R i and R j are conflict-free or sequentially composable Scheduler Divide the rules into smallest conflicting groups; provide a scheduler for each group

March 6, 2006L10-29http://csg.csail.mit.edu/6.375/ Sequentially composable Muxing structure Muxing logic requires determining for each register (action method) the rules that update it and under what conditions Conflict Free and or and or 11 11 22 22 11  1 and ~  2 22 22  1  ~  2

March 6, 2006L10-30http://csg.csail.mit.edu/6.375/ Scheduling and control logic Modules (Current state) Rules   Scheduler 11 nn 11 nn Muxing 11 nn nn nn Modules (Next state) cond action “CAN_FIRE”“WILL_FIRE”

March 6, 2006L10-31http://csg.csail.mit.edu/6.375/ Synthesis Summary Bluespec generates a combinational hardware scheduler allowing multiple enabled rules to execute in the same clock cycle The hardware makes a rule-execution decision on every clock (i.e., it is not a static schedule) Among those rules that CAN_FIRE, only a subset WILL_FIRE that is consistent with a Rule order Since multiple rules can write to a common piece of state, the compiler introduces suitable muxing and mux control logic This is very simple logic: the compiler will not introduce long paths on its own (details later)

March 6, 2006L10-32http://csg.csail.mit.edu/6.375/ Scheduling conflicting rules When two rules conflict on a shared resource, they cannot both execute in the same clock The compiler produces logic that ensures that, when both rules are applicable, only one will fire Which one? more on this later