John DeHart and Mike Wilson

John DeHart and Mike Wilson
SPP V2 Router Design John DeHart and Mike Wilson

Revision History 3 June 2008 25 June 2008 27 July 2009 24 August 2009
Initial release, presentation 25 June 2008 Updates on feedback from presentation 27 July 2009 Current status, changes, Control documentation 24 August 2009 Updates from debugging, simulation

Current Status: Summary
Memory Layout Done, may need revisiting Scripts (.ind files) done, missing TCAM initialization NPUA blocks written, simulates, some GPE-to-NPE problems NPUB broken, needs some changes Needs RxB SRAM Ring fix HdrFmt needs internal header fix Recent changes to LookupB/Copy not yet added Need some changes to TxB for chained buffers Recent changes: Exception, Local Delivery packets omitted in original design Necessitates changes to Parse Changed ResultTable indexing Impacts LookupB/Copy

SPP Versions SPP Version 0: SPP Version 1: SPP Version 2:
What we used for SIGCOMM Paper SPP Version 1: Bare minimum we would need to release something to PlanetLab Users SPP Version 2: What we would REALLY like to release to PlanetLab users.

Objectives for SPP-NPE version 2
Deal with constraints imposed by switch can send to only one NPU; can receive from only one NPU split processing across NPUs parsing, lookup on one; queuing on other Provide more resources for slice-specific processing Decouple QM schedulers from links collection of largely independent schedulers may use several to send to the same link e.g. separate rate classes (1-10M, M, M) optionally adjust scheduler rates dynamically Provide support for multicast requires addition of next-hop IP address after queueing Enable single slice to operate at 10 Gb/s Support “slow” code options Use separate rate classes to limit rate to slow code options LCI QMs for Parse, NPUB QMs for HdrFmt

SPP Version 2 System Architecture
Default Data Path GPE Blade GPE Blade LC Ingress Decap Parse Lookup AddShim NPUA 1 10Gb/s OR 10 1Gb/s SPI Switch SPI Switch Switch Blade RTM FIC FIC Copy QM HdrFormat LC Egress NPUB NPE 7010 Blade LC 7010 Blade

Fast-Path Data GPE Blade GPE Blade LC Ingress Decap Parse Lookup AddShim NPUA 1 10Gb/s OR 10 1Gb/s SPI Switch SPI Switch Switch Blade RTM FIC FIC Copy QM HdrFormat LC Egress NPUB NPE 7010 Blade LC 7010 Blade

Exception Data Path Local Delivery GPE Blade GPE Blade LC Ingress Decap Parse Lookup AddShim NPUA 1 10Gb/s OR 10 1Gb/s SPI Switch SPI Switch Switch Blade RTM FIC FIC Copy QM HdrFormat LC Egress NPUB NPE 7010 Blade LC 7010 Blade

NPE Version 2 Block Diagram
NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade StatsB (1 ME) SRAM/3 SRAM/0 TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

NPUA NN Scr/512 Scr/1024 NN NN SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) Scr/256 SRAM TCAM SPI Switch SPI Switch Switch Blade Scr/256 Scr/256 StatsB (1 ME) SRAM/3 SRAM/0 Scr/256 NPUB: 9 scratch rings in use; can have up to 12 that can do "full" test. Note: LookupB/Copy scratch rings must be at least 256 or overflow is possible. TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) NN Scr/256 SRAM Scr2NN/ Freelist (1 ME) SRAM SRAM Scr/256 NPUB 10

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade StatsB (1 ME) SRAM SRAM TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

PlanetLab NPE Input Frame from LC or GPE
Ethernet Header: DstAddr: MAC address of NPE SrcAddr: MAC address of LC or GPE VLAN: One VLAN per MR (MR == Slice) Only use lower 11 bits of Vlan Tag IP Header: Dst Addr: IP address of this node How many IP Addresses can a NODE have? Src Addr: IP address of previous hop Protocol: UDP UDP Header: Dst Port: Identifies input tunnel Src Port: with IP Src Addr identifies sending entity DstAddr (6B) SrcAddr (6B) Ethernet Header Type=802.1Q (2B) VLAN (2B) Type=IP (2B) Ver/HLen/Tos/Len (4B) ID/Flags/FragOff (4B) TTL (1B) Protocol = UDP (1B) Hdr Cksum (2B) Src Addr (4B) Header IP Dst Addr (4B) IP Options (0-40B) Src Port (2B) UDP Header Dst Port (2B) UDP length (2B) UDP checksum (2B) UDP Payload (MN Packet) PAD (nB) Ethernet Trailer CRC (4B) Indicates 8-Byte Boundaries Assuming no IP Options

Local Delivery / Exceptions
GPE has separate tunnels for LD and EX Standard filters handle these packets No internal packet headers required, although we can still use internal headers for exceptions Return path from GPE uses same tunnels Standard filters handle re-classify cases Internal packet headers from GPE to NPE are MNet-specific Provides filter key for GPE-routed packets Substrate headers unchanged MN frames carry code-option-specific details, filter key For IPv4, MN frame has IP version 0, payload has 112b lookup key to use. If GPE wants to reclassify, it sends a normal packet. GPE can figure out exceptions on it's own – silly to do it on NPE. However, Parse still sends Exception bits to HdrFmt, which can add the internal headers for the GPE. I think this is a bad idea, personally.... The GPE knows it's an exception because of the socket on which it arrives; it can check things much easier, albeit slower, than the NPE.

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 StatsB (1 ME) SRAM Eth. Frame Len (16b) Reserved (12b) Port (4b) SRAM TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

RxA No change from V1

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 StatsB (1 ME) SRAM Eth. Frame Len (16b) Reserved (12b) Port (4b) MN Frm Length(16b) MN Frm Offset (16b) SRAM Slice ID (VLAN) (16b) Rx UDP DPort (16b) TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Rx IP SAddr (32b) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Rx UDP SPort (16b) Reserved (12b) Code (4b) Slice Data Ptr (32b) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

Decap Inputs: Outputs: Initialization: Functionality: Status:
Packet from RxA Outputs: Meta-frame (handle, offset and length) Slice ID (VLAN tag) Actually, lower 11b of VLAN tag and lower 4b of RX DA in (for RxID) Metainterface (Rx Saddr, Rx Sport, Rx Dport) Code Option (4b, only 16 available) Slice data pointer Initialization: VLAN table, NPE MAC Address Functionality: Read VLAN tag from DRAM, determine correct code option. Validate packet. Drop invalid, unmatched packets. IP Options for NPE dropped in LC, should never arrive here! Enqueue valid packets to Scratch ring. Update stats Status: Works for valid packets, invalid packet handling untested

VLAN table code_option = 0 implies invalid slice
slice_data_ptr slice_data_size 1 … 0x0aa 0x7ff … SD data P data code_option = 0 implies invalid slice “on switch” for a slice in the data plane SD data is currently only counters 56B slice data Only use lower 11b of VLAN tag (2048 VLANs) Only changes from V1: No longer need all data on NPUA, drop HF data, per-slice buffer limits

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 StatsB (1 ME) SRAM MN Frm Length (16b) MN Frm Offset (16b) MN Frm Length(16b) MN Frm Offset (16b) SRAM Lookup Key[ ] Type(1b)/RxID(4b)/Slice ID(11b)/ Rx UDP DPort (16b) Slice ID (VLAN) (16b) Rx UDP DPort (16b) Rx IP SAddr (32b) TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Lookup Key[111-80] DA (32b) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Rx UDP SPort (16b) Reserved (12b) Code (4b) Lookup Key[ 79-48] SA (32b) Lookup Key[ 47-16] Ports (32b) Slice Data Ptr (32b) Lookup Key Proto/TCP_Flags [15- 0] (16b) Code (4b) Exception Bits (12b) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

Parse Inputs: Outputs: Initialization: Functionality: Status:
Meta-frame (handle, offset and length) Slice ID (VLAN tag, RxID) Tunnel ID (Rx Saddr, Rx Sport, Rx Dport) Code Option (4b, only 16 available) Slice data pointer Outputs: Lookup key (Includes slice ID, Rx UDP dport) Exception bits (MN-specific) – do we still need these? (Probably) Initialization: Slice Data Functionality: Slice-specific processing: Parse meta-frame. Extract lookup key. Raise any relevant exceptions. Can pass slice data to HdrFmt in bytes of packet. (0..15 are reserved for AddShim) Substrate processing: Add substrate-specific information to lookup key (32b: Lookup type, RxID, Slice ID, Rx UDP dport) Status: Needs internal packet handling from GPE for GPE-specified filter keys Needs to use "special" filter key for exception path, 0x0. Substrate processing should still pre-pend substrate-specific key information (slice, MiID) Works for normal (LCI-to-NPE) packets Note: No block is currently setting the RxID; this should probably be done in Decap and become part of the slice ID. Rsvd:1, RxID:4, VLAN & 0x7FF Exception bits: These are passed along from Parse to HeaderFormat via the Shim. However, normal operation for exceptions is to pass the packets to the GPE for exception processing. The exception bits are never passed to the GPE, so they are wasted. Currently, these consist of nothing more than a 12-bit opaque data field for Parse to HdrFmt.

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 MN Frm Length (16b) MN Frm Offset (16b) StatsB (1 ME) MN Frm Length (16b) MN Frm Offset (16b) SRAM SRAM Lookup Key[ ] Type(1b)/RxID(4b)/Slice ID(11b)/ Rx UDP DPort (16b) Slice ID (VLAN) (16b) Code (4b) Exception Bits (12b) Result Index (32b) TxB (2 ME) Lookup Key[111-80] DA (32b) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) Rsvd(16b) Stats Index (16b) LookupB &Copy (2 ME) RxB (2 ME) Lookup Key[ 79-48] SA (32b) Lookup Key[ 47-16] Ports (32b) Lookup Key Proto/TCP_Flags [15- 0] (16b) Code (4b) Exception Bits (12b) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

LookupA Inputs: Outputs: Initialization: Functionality: Status:
Meta-frame (handle, offset and length) Lookup key (Includes slice ID, RxID, Rx UDP dport) Code Option (4b, only 16 available) Exception bits Outputs: Lookup Result (Index into SRAM table on NPUB) Actual max index is 0x3FFFF (Unicast), with single-bit type flag = 19 bits Slice ID (VLAN tag) Exception bits (from Parse) Stats Index (from TCAM) Can this fit in the 13 bits leftover from the result index? No, result is bigger now. Initialization: Filters set in TCAM by control Functionality: Look up key in TCAM On miss, drop the packet Local Delivery is now a normal lookup Lookup result is now just a 32b index (and stats index) Status: Written; untested. Result size currently 48b; would like to reduce to 32b.

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 MN Frm Length (16b) MN Frm Offset (16b) StatsB (1 ME) SRAM SRAM Slice ID (VLAN) (16b) Code (4b) Exception Bits (12b) Result Index (32b) TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Rsvd(16b) Stats Index (16b) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

AddShim Inputs: Outputs: Initialization: Functionality: Status:
Meta-frame (handle, offset and length) Lookup Result (Index into SRAM table on NPUB) Slice ID (VLAN tag) Code Option (4b, only 16 available) Exception bits (from Parse) Stats Index (from TCAM) Outputs: Shim Packet (buffer handle) Buffer descriptor contains updated offset and length, if needed Initialization: None. Functionality: Prepend shim header to preserve packet annotations across NPU’s Overwrite the existing ethernet header (Up to 18B) with: Slice ID (16b) Code Option (4b) Exception Bits (12b) MN Frame Offset (16b) MN Frame Length (16b) Result Index (32b) Stats Index (16b) [This is the same on NPUA, NPUB] 30B for opaque slice data. Proper memory alignment required This is written by Parse, not AddShim! Status: Written. Works for properly aligned packets. Needs optimization.

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 StatsB (1 ME) SRAM SRAM TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

TxA Sends shim packet to NPUB. Unmodified 10 Gbps Tx 2×ME.

SPP Version2 NPUA to NPUB Frame
SHIM (16B) Slice ID (16b) Code Option (4b) Exception Bits (12b) Result Index (32b) Stats Index (16b) Offset of MN Packet (16b) Length of MN Packet (16b) Memory Alignment Padding (2B) IP Header, UDP Header may be overwritten by: opaque slice data, written in Parse SHIM (16B) Type=IP (2B) Ver/HLen/Tos/Len (4B) ID/Flags/FragOff (4B) TTL (1B) Protocol = UDP (1B) Hdr Cksum (2B) Src Addr (4B) Header IP Dst Addr (4B) IP Options (0-40B) Src Port (2B) UDP Header Dst Port (2B) UDP length (2B) UDP checksum (2B) UDP Payload (MN Packet) PAD (nB) Ethernet Trailer CRC (4B) Indicates 8-Byte Boundaries Assuming no IP Options

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) Port (4b) Reserved (12b) Eth. Frame Len (16b) Buffer Handle(24b) (8b) StatsA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade StatsB (1 ME) SRAM SRAM TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

RxB Needs to switch from NN output to Scratch or SRAM
Comments in code indicate SRAM should work Supporting code seems to be only for scratch rings Needs further examination DZar notes there are some obscure #define's needed for SRAM rings.

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) Buffer Handle(24b) Reserved (8b) Reserved (12b) PerSchedQID (15b) Sch 3b QM 2b Port (4b) Reserved (12b) Eth. Frame Len (16b) Buffer Handle(24b) (8b) StatsA (1 ME) SRAM TCAM Frame Length (16b) Stats Index (16b) SPI Switch SPI Switch Switch Blade StatsB (1 ME) SRAM SRAM TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

LookupB/Copy Inputs: Outputs: Initialization: Functionality (Overview)
Shim packet (buffer handle, frame length) Outputs: Packet (buffer handle, frame length) QueueID (QM, Scheduler, Queue ID) Stats Index Initialization: ResultTable (unicast+multicast) local endpoint table Ethernet SAddr Per-slice Packet Limits Functionality (Overview) Copy shim header into buffer descriptor Look up routing information from result index If multicast, make the copies Enqueue to correct QM (from ResultTable) Status Written, broken. Needs changes to handling of ResultTable; result indices are now absolute, not per-slice.

LookupB/Copy – Code Sketch
if not currently processing mcast packet read packet from SRAM ring extract shim load ResultTable value fill buffer descriptor if unicast if per-slice packet limit permits update per-slice packet count write to SRAM ring for correct QM. (By qmschedID in result table value). else drop buffer else start mcast processing fetch first header buffer descriptor if payload length ≠ 0 write ref count into payload descriptor else drop payload buffer drop buffer finish mcast processing else (Currently processing buffer, have empty header buffer handle) fill header buffer descriptor only chain if payload buffer is not empty if still making copies fetch next header buffer descriptor else finish mcast processing write current header buffer handle to SRAM ring for correct QM. (By qmschedID). signal next ME

ResultTable – Unicast Fanout: Ignored (Memory padding) QID Src MI:
Data needed to enqueue, rewrite packet: Fanout: Ignored (Memory padding) QID QMID, SchedID, QID (20b) (Lookup Result) Src MI: IP Saddr (32b) (Per SchedID Table) UDP Sport (16b) (Lookup Result) Tunnel Next Hop IP DAddr (32b) (Lookup Result) IP DPort (16b)(Lookup Result) Chassis Addressing Ethernet Dst MAC (48b) (Per SchedID Table) Slice Specific Lookup Result Data (?) (Lookup Result) Ethernet Src MAC Should be constant across all pkts. Results Entry: Fanout (4b) QID (20b) IP DAddr (32b) UDP DPort (16b) UDP SPort (16b) HFIndex (16b) IP SAddr (32b) Eth DA (48b) Per Sched Entry:

ResultTable – Multicast
Fanout gives the number of copies (0..15) Data needed per copy on NPUB: QID QMID, SchedID, QID (20b) (Lookup Result) Src MI: IP Saddr (32b) (Per SchedID Table) UDP Sport (16b) (Lookup Result) Tunnel Next Hop IP DAddr (32b) (Lookup Result) IP DPort (16b)(Lookup Result) Chassis Addressing Ethernet Dst MAC (48b) (Per SchedID Table) Slice Specific Lookup Result Data (?) (Lookup Result) Ethernet Src MAC Should be constant across all pkts. Support Multicast but optimize for Unicast Results Entry: Fanout (4b) QID (20b) IP DAddr (32b) ×16 UDP DPort (16b) UDP SPort (16b) HFIndex (16b) IP SAddr (32b) Eth DA (48b) Per Sched Entry:

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) Frame Length (16b) Buffer Handle(24b) Stats Index (16b) Reserved (8b) (12b) PerSchedQID (15b) Sch 3b QM 2b StatsA (1 ME) SRAM TCAM Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 SPI Switch SPI Switch Switch Blade StatsB (1 ME) SRAM SRAM TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

QM No change from V1 Some changes in how control allocates bandwidth
Incorporates change to limit queues by #pkts Some changes in how control allocates bandwidth Need to ensure that slow HdrFmt blocks can’t tie up the system Currently looking at worst-case engineering (everyone runs at slowest block speed)

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 SPI Switch SPI Switch Switch Blade StatsB (1 ME) SRAM SRAM TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

HdrFmt / SubEncap Inputs: Outputs: Initialization: Functionality:
Buffer Handle Remaining inputs come from Buffer Descriptor: Multicast or Unicast (from buffer_next) Frame length, offset HFIndex (index into HFTable, a slice-specific table) ResultIndex (for tunnel headers) Outputs: Packet (buffer handle) Buffer descriptor contains updated offset and length Initialization: HFTable, containing slice-specific data. For IPv4, this is unused. ResultTable, tunnel header information Functionality: Substrate level: read buffer descriptor and pass frame offset, length, HFIndex, mcast/ucast to slice-specific HdrFmt Slice level: arbitrary processing. For IPv4, this writes any next-hop information. Except for redirects such as exception packets, effectively does nothing. Encapsulate for output tunnel (from ResultTable) Update stats Status: Revisit multicast model Needs Internal Header code (Missing!) Multicast Model (current for IPv4): Strip substrate (tunnel) headers from input packet, but leave MNet headers Duplicate stripped packet (with headers) via packet chaining Mnet-specific packet handling code needs to account for this. For IPv4, that's trivial, since we don't do anything. Substrate Encap adds headers to header buffer descriptor. What if a MNet wants to change the MNet headers on a multicast packet? The only current possibility is to prepend stuff onto it at the header buffer descriptor. Does this make sense?

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 SPI Switch SPI Switch Switch Blade StatsB (1 ME) SRAM SRAM TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

Scr2NN/FreelistMgr Inputs: Outputs: Initialization: Functionality:
Buffer Handle (possibly chained) Outputs: Initialization: None Functionality: Combines Freelist Manager with Scr2NN glue FM: Read from scratch ring. Free buffers, correctly handling chained buffers and reference counts. Scr2NN: Read from Scratch, write to NN. Status: Needs to be reworked from scratch; my method of combining was wrong and could (probably would) deadlock. Both blocks exist, but combining them is not straight-forward. Open question: how should we prioritize among these tasks? The author should ensure that no deadlock is possible. (TxB writes to FM; if FM ring is full, TxB stalls. If Scr2NN is writing to TxB, it stalls. Gridlock.) As of August 2009, we'll use a temporary 4×4 thread split and revisit later.

NPUA SRAM SRAM GPE RxA (2 ME) Decap (1 ME) Parse (8 ME) LookupA (1 ME) AddShim (1 ME) TxA (2 ME) StatsA (1 ME) SRAM TCAM Buffer Handle(24b) Rsv (3b) Intf (4b) V 1 SPI Switch SPI Switch Switch Blade StatsB (1 ME) SRAM SRAM TxB (2 ME) HdrFmt/ SubEncap (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) Scr2NN/ Freelist (1 ME) SRAM SRAM NPUB

TxB Must support chained buffers
Multicast uses header buffers and payload buffers Headers are slice-specific; we can’t rely on known, static lengths as we did in ONL. Sends header from one buffer, payload from chained buffer. Can TX do this? Comments in the code seem to imply that chained (non-SOP) buffers must start at offset 0. Our payloads usually won’t. This will probably take some TX modification, but there’s no reason why it won’t work. Might have a performance penalty, of course…. [DZar]

SPP V2 SideB SRAM Buffer Descriptor
Buffer_Next (32b) LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list 0000 (4b) Reserved (4b) Ref_Cnt (8b) LW2 Stats Index (16b) Reserved (4b) Slice ID(xsid)(12b) LW3 HFIndex (16b) MR Exception Bits (16b) LW4 ResultIndex (32b) LW5 MR Bits (optional) (32b) LW6 Packet_Next (32b) LW7 Written by Rx, Added to by Copy Decremented by Freelist Mgr Ref_Cnt (8b) MR bits are first 4 bytes of MNet data in Payload Written by Freelist Mgr Written by Rx Written by LookupB/Copy Written by Rx or LookupB/Copy Written by QM 43

SPP V2 SideB SRAM Buffer Descriptor
Buffer_Next (32b) LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list 0000 (4b) Reserved (4b) Ref_Cnt (8b) LW2 Stats Index (16b) Reserved (4b) Slice ID(xsid)(12b) LW3 HFIndex (16b) MR Exception Bits (16b) LW4 ResultIndex (32b) LW5 MR Bits (optional) (32b) LW6 Packet_Next (32b) LW7 HFIndex is an index into the HFTable. Unused in IPv4. May not be needed in Buffer Descriptor, since SubstrateEncap can fetch it using ResultIndex ResultIndex is used to get tunnel header info from the ResultTable

SPP v2 Control New data path adds new Control requirements
Heterogeneous MNet execution times Control must select parameters for LCI QMs, NPUB QMs to avoid Parse, HdrFmt execution lag Slice is now partial VLAN tag Must ensure all VLAN tags have distinct low 11b Filter/Results now split across NPUA, NPUB Must coordinate updates to multiple data locations Synchronization issues require some care in Control

SPP v2 Control NPUA Data areas requiring Control setup
NPE MAC address at IPV4_SD_MAC_ADDR_HI32 IPV4_SD_MAC_ADDR_lo16 VLAN Table Used by Decap, Parse Maps VLANs to code options, data areas 2048-entry table at PL_SD_VLAN_CODE_OPT_TABLE_BASE struct { unsigned int code_opt; // only 4 lsb used unsigned int slice_data_ptr; unsigned int slice_data_size; }

SPP v2 Control Data areas requiring Control setup VLAN Table -cont'd-
Pointer to slice-specific SRAM areas Slice owners request amount needed (IPv4 code option needs 72B for counters) Control must pass along Slice owner initialization data Control can allocate in any 4B aligned location within Bank 3 addresses 0x x7FFFFF (upper 5MB of BANK3) Each slice-specific region must be at least SLICE_DATA_ENTRY_SIZE_MINIMUM (56B) in size Each code option has different additional size needs E.g., for IPv4, 56+64=128B total E.g., for i3, = 3256B total

SPP v2 Control Data areas requiring Control setup TCAM filters
Used by LookupA Tightly interlinked with tables on NPUB

SPP v2 Control NPUB Data areas requiring Control setup
NPE source MAC address (HdrFmt/SubstrateEncap) LC_MAC_ADDR_HI32 LC_MAC_ADDR_LO32 Per-Slice (2048) packet limits table (LookupB/Copy) at LC_PER_SLICE_PACKET_LIMIT_BASE struct { unsigned int current; unsigned int maximum; unsigned int overLimits; } Queue Manager parameters Must properly rate limit both bandwidth and slow HdrFmt code options No heterogeneous HdrFmt code options yet

SPP v2 Control NPUB Data areas requiring Control setup Result Table
Used by LookupB/Copy, HdrFmt/SubstrateEncap Results corresponding to TCAM lookups Links to per-QM scheduler tunnel endpoint values Also links to per-slice HdrFmt data areas

Filters and Results Slice owner maps filters to results
Filter is 144b key, first 32b is substrate's Meta-Interface ID Slice owner controls remaining 112b Results have multiple pieces Type: unicast / multicast Output QID('s) (associated with Meta-Interface) Control translates slice representation to substrate's tunnel Index into slice data in HFTable for Header Format to use

Adding (Multicast) Filter
Slice Owner View x filters y unicast results z multicast results 1. Add filter <Meta-Interface In, IP DA, IP SA, DPort, Sport, Proto> with result <Type=Multicast, Result R (in 1..z)> Result R = <Fanout, Meta-Interface, Index> [up to 16× entries] 2. Update HFTable (index, length, bytestream) Control Control map Copy Copy Map Map Range Validation TCAM slice/RxID/Dport key ResultIndex Range Validation Copy Multicast ResultTable fanout qm sched qid Tunnel: DA/DPort/SPort HFIndex ... up to Local "subnet" Tunnel SA Eth DA Next Hop HFTable 32 Entries (Opaque)

Filters and Results First, some things to remember:
This is the NPE: we are supporting protocols that may not be IP! Order of filters in a TCAM database defines those filters’ priority Lower numbered filter indices are higher priority TCAM filter lookup is done on the A-Side. TCAM filter result gives us a pointer to a full result which resides in SRAM on the B-Side. Thus the A-Side filter and the B-Side result need not be a 1-1 mapping We could have many filters using the same B-Side result. We are supporting Unicast and Multicast filters and results Multicast supports a maximum fanout of 16.

Filters and Results (continued)
Slice owners allocate N unicast filters and M multicast filters. They get: N+M Filter id’s (0 – (N+M-1) ) Contiguous in the TCAM Order in TCAM indicates priority, lower id  higher priority N Unicast Result indices (0 – (N-1) ) Contiguous in the Unicast portion of Result table M Multicast Result indices (0 - (M-1) ) Contiguous in the Multicast port of the Result table Filter id and Result index (unicast or multicast) are referenced separately. Example: Filter id 4 might use unicast result index 12 Unicast and Multicast filters in TCAM can be mingled. Remember: Order in TCAM is important. Example: A unicast catch-all (all wildcards) filter should probably be the LAST filter in a slice’s set of filters so it does not override other filters including multicast filters.

Filters and Results (continued)
Slice owners will have the ability to disable a filter. Control removes the filter from tthe TCAM (LookupA) Result is left on NPUB for "in-flight" packets Slice owners can also remove a filter This deletes the results from the B-side

Filter / Result Operations
TCAM Result (A-side) Result Table (B-side) Type (1b) ResultIndex(31) MC Result BitMask(16b) Stats Index (16) Unicast N Results 16B per Result Result Entry (B-side) If we use entire SRAM Bank: SRAM Banks are 8MB Result size is 16B TCAM has 128K 144b entries N + M = 128K (N+16*M) * 16 <= 8MB N = M = 26214 Valid(1b) QID (20b) IP DAddr (32b) UDP DPort (16b) UDP SPort (16b) HFIndex (16b) Pad (16b) Pad (11b) Multicast M Blocks 16 Results per Block 16B per Result 16B

Filter / Result Operations
Valid(1b) QID (20b) IP DAddr (32b) UDP DPort (16b) UDP SPort (16b) HFIndex (16b) Type (1b) ResultIndex(31) Result Bit Mask(16b) Stats Index (16) TCAM Result (A-side) Result Entry (B-side) add_mc_filter(fid, RxMI, Key, Mask, mcResultIndex, statIndex) update_mc_filter(fid, mcResultIndex, resultMask) add_mc_result(fid, mcResultIndex, entryIndex, Qinfo, DestInfo) update_mc_result(fid, mcResultIndex, entryIndex, Qinfo, DestInfo) remove_mc_filter(fid) remove_mc_result(mcResultIndex) add_uc_filter(fid, RxMI, Key, Mask, ucResultIndex, statIndex) update_uc_filter(fid, ucResultIndex, statIndex) add_uc_result(fid, ucResultIndex, Qinfo, DestInfo) update_uc_result(fid, ucResultIndex, Qinfo, DestInfo) remove_uc_filter(fid) remove_uc_result(ucResultIndex)

Multicast Filter / Result Operations
add_mc_filter(fid, RxMI, Key, mcResultIndex, resultMask, statIndex) Adds multicast filter to TCAM update_mc_filter(fid, mcResultIndex, resultMask, statIndex) Updates (re-writes) the TCAM result add_mc_result(mcResultIndex, entryIndex, Qinfo, DestInfo) Writes a MC result entry into Result Table Marks result as valid update_mc_result(mcResultIndex, entryIndex, Qinfo, DestInfo) Updates (re-writes) a MC result entry in the Result Table Implementation will almost certainly be same as add_mc_result so why have both? remove_mc_filter(fid) Removes the filter from the TCAM, leaves B-side results unchanged. remove_mc_result(mcResultIndex) Invalidates a multicast filter result

Unicast Filter / Result Operations
add_uc_filter(fid, RxMI, Key, ucResultIndex, statIndex) Adds unicast filter to TCAM update_uc_filter(fid, ucResultIndex, statIndex) Updates (re-writes) the TCAM result add_uc_result(ucResultIndex, Qinfo, DestInfo) Writes a UC result entry into the Result table Marks result as valid update_uc_result(ucResultIndex, Qinfo, DestInfo) Updates (re-writes) a UC result entry into the Result table Implementation will almost certainly be same as add_uc_result so why have both? remove_uc_filter(fid) Removes the filter from the TCAM, leaves the B-Side results unchanged remove_uc_result(ucResultIndex) Invalidates a unicast filter result

Extra Slides The rest of the slides are old or for extra information

Design Questions Small hole for abuse in HdrFmt
QM rate limits on payload length HdrFmt (after QM) can vastly increase packet length Should the LookupB table give the padding size for each entry? Enforced in SubEncap? ANSWER: No, we will resort to our control of HdrFmt to force it to behave. (We write all of the code options right now.) What are the best places to update stats on NPUB? ANSWER: Post-Q only Is there any remaining reason that NPUB would need the source tunnel information? ANSWER: No. If a code option needs it, put it into opaque slice data.

Questions/Issues 4/28/08: How many code options?
Limit of 16? To handle slow Code Options: LCI Queues would control traffic to Fast/Slow Parse Code Classes of code options defined by how long their Parse code takes. Scheduler assigned to a class of code option. NPE Queues would control traffic to Fast/Slow HF Code LCE Queues control the output rate to Interfaces. Multicast Problems: Impact of multicast traffic overloading Lookup/Copy and becoming a bottleneck. Rx on SideB, can it use SRAM output ring? All our other 10G Rx’s have NN output ring. Option for HF to send out additional pkts? How to pass MR and substrate hdrs to TxB? Through Ring or through Hdr Buffer associated with Hdr Buffer descriptor. If the latter then what are the constraints in Tx for buffer chaining?

Meeting Notes 1/15/08: QM: Add Pkt count to Queue Params, change limit from QLen to PktCount Add Per Slice Pkt limit to NPUA and NPUB Limit Fanout to 16 MCast: Control will allocate all 16 entries for a multicast result entry, result entry will be typed as multicast or unicast and will not transition from one to the other. What happens to pkts in queues when there is a route change that sends that flow’s pkts to a different interface and queue? Pkt ordering problems?

slice#, resultIndx, etc, passed in shim Lookup produces resultIndx, statsIndx SRAM NPUA SRAM GPE RxA (2 ME) Decap, Parse, LookupA, AddShim (8 MEs) TxA (2 ME) Stats (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade Stats (1 ME) flow control? SRAM TxB (2 ME) HdrFmt (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) for unicast, resultIndx replaced by QiD; allowing output side to skip lookup NPUB SRAM SRAM Lookup on <slice#, resultIndx> yields fanout, list of QiDs; copy to queues, adding copy#; (slice#, resultIndx remain in packet buffer) use slice# to select slice to format packet; use resultIndx to get next-hop

Questions/Issues Where are exit and entry points for packets sent to and from the GPE for exception processing? Parse (NPUA) and LookupA (NPUA) are where most exceptions are generated: IP Options No Route Etc. HdrFormat (NPUB) is where we do ethernet header processing What needs to be in the SHIM going from NPUA to NPUB? ResultIndex (32b) Exception Bits (12b) StatsIndex (16b) Slice# (12b) ??? Will we support multi-copy in a way similar to the ONL Router? How big can the fanout be? How many QIDs need to be stored with the LookupB Result? Is there some encoding for the QIDs that can take into account support for multicast and the copy#? For example: Multicast QID(20b) Multicast (1b): 1 Copy# (4b) PerMulticast QID(15b): One PerMulticast QID allocated for each Multicast Unicast QID(20b) Unicast (1b): 0 QID (19b) Are there timing/synchronization issues with adding, deleting or changing lookup entries between the two NPUs databases? Do we need flow control between TxA and RxB?

SRAM NPUA SRAM GPE RxA (2 ME) Decap, Parse, LookupA, AddShim (8 MEs) TxA (2 ME) Stats (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade NPUA: RxA:Same as Version 0 TxA: New 10Gb/s Decap: Same as Version 0 Parse: Same as Version 0 New code options? LookupA: Results will be different from Version 0 AddSim: New Stats (1 ME) flow control? SRAM TxB (2 ME) HdrFmt (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) NPUB SRAM SRAM

NPUB: RxB:Same as Version 0 TxB: New 10Gb/s with L2 Header coming in on input ring? LookupB: New Copy: New, may be able to use some code from ONL Copy QM: New, decoupled from Links HF: New, may use some code from Version 0 SRAM NPUA SRAM GPE RxA (2 ME) Decap, Parse, LookupA, AddShim (8 MEs) TxA (2 ME) Stats (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade Stats (1 ME) flow control? SRAM TxB (2 ME) HdrFmt (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) NPUB SRAM SRAM

NPUA SRAM Sram2NN (1 ME) SRAM GPE RxA (2 ME) Decap, Parse, LookupA, AddShim (8 MEs) TxA (2 ME) StatsA (1 ME) FreeList MgrA (1 ME) SRAM TCAM SPI Switch SPI Switch Switch Blade flow control? FreeList MgrB (1 ME) StatsB (1 ME) SRAM SRAM TxB (2 ME) HdrFmt (4 MEs) Queue Manager (4 MEs) LookupB &Copy (2 ME) RxB (2 ME) NPUB has 17 MEs currently spec’ed Scr2NN (1 ME) SRAM SRAM NPUB

SPP V2: MR Specific Code What about LookupA and LookupB?
Where does the MR Specific Code reside in V2: Parse HdrFormat What about LookupA and LookupB? Lookup is a “service” provided to the MRs by the Substrate. No MR specific code needed in LookupA or LookupB What about SideA AddShim? The Exception bits that go in the shim are MR Specific but they should be passed to AddShim and it will write them into the Shim. No MR Specific code needed in AddShim. What about SideB Copy? Is there anything MR specific about setting up multiple copies of a packet? There shouldn’t be. We will have the Copy block allocate a new hdr buffer descriptor and link it to the existing data buffer descriptor and take care of reference counts. The actual building of the new header(s) for the copies will be left to HF. No MR Specific code needed in Copy.

SPP V2: Hdr Format Lots of changes for HF:
Move behind QM More general: Support multiple source IP Addresses General support for Tunnels Eventually different kinds of tunnels (UDP/IP, GRE, …)? Support for Multicast Dealing with header buffer descriptors Reading Fanout table Substrate portion of HF will need to do Decap type table lookup Slice ID  (Code Option, Slice Memory Pointer, Slice Memory Size) HF gets a buffer descriptor from the QM The Substrate portion of HF must determine: Code Option (8b) Slice ID (12b) Location of Next Hop information (20b - 32b) LD vs. FWD? Stats Index (16b) Should HF do this of QM? The MR portion of HF must determine: Exception bits (16b) Lets put all of the above data in the Buf Desc LookupB/Copy will need to write it there based on what comes across from SideA in the shim

SPP V2: Result We need to be much more general in our support for Tunnels, Interfaces, MetaInterfaces, and Next Hops. SideB Result: Interface IP SAddr (32b) Eth MAC DAddr (48b) (LC, GPE1, GPE2, …, GPEn) SchedulerId (8b): which QM should handle pkt TxMI: IP Sport (16b) TxNextHop: IP DAddr (32b) IP DPort (16b)

Data Areas Where are the tables and what data is transmitted from SideA to SideB? SideA Tables Shim between SideA and SideB SideB Tables

Pkt Processing Data and Tables
SideA: MR/Slice Table: Generated by Control Used by: Substrate Decap to retrieve a MR/Slice’s parameters Indexed by SliceId == VLAN Contains: Code option Slice Memory ptr Slice Memory size ??? TCAM: LookupA Key: Result:

Data Areas Shim between SideA and SideB
Written to DRAM Buffer to be sent from SideA to SideB Contains: resultIndex (32b): Generated by Control Result of TCAM lookup on SideA Translates into an SRAM Address on SideB exceptionBits (12b) Generated by SideA Parse/Lookup Used by: SideB HF statsIndex (16b) SideA Lookup/AddShim to increment counters SideB Lookup/Copy to increment PreQ Cntrs (or perhaps SideA is the PreQ cntrs) SideB HF or QM to increment PostQ Cntrs sliceId (12b) Result of Decap read of Ethernet hdr (VLAN) ??? codeOption (4b) Slice Memory Ptr (32b)

Data Areas SideB Data Buffer Descriptor Hdr Buffer Descriptor
Used for multi-copy packets SPP V2 may require Tx to handle multi-buffer packets. It is unclear if we can cleanly do that same thing that we do with ONL where HF passes the Ethernet header to Tx. We may also need to have support for MR specific per copy data Results Table Generated by Control Used by: LookupB/Copy HF Should HF get its per copy info from here as well. Contains: Fanout (if fanout is > 1 we can overload some of the following fields with a pointer into a Fanout table) QID InterfaceId TxMI Id Probably doesn’t help to make it an index into a table for UDP Tunnels since UDP Port is 16 bits But for tunnels other than UDP tunnels it may help? TX NextHop Id Index into a table of Tunnel Next Hops

Data Areas (continued)
SideB (continued) Fanout Table Generated by Control Used by: LookupB/Copy HF Contains: QID[Fanout] InterfaceId TxMI Id Tx Next Hop ID[Fanout] Implementation Choices: One contiguous block of memory Fixed size or variable sized Chained with one set of values per entry Chained with N (N=4?) sets of values per entry

John DeHart and Mike Wilson

Similar presentations

Presentation on theme: "John DeHart and Mike Wilson"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

John DeHart and Mike Wilson

Similar presentations

Presentation on theme: "John DeHart and Mike Wilson"— Presentation transcript:

Similar presentations

About project

Feedback