An NP-Based Router for the Open Network Lab

Slides:

Advertisements

Similar presentations

NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius

Advertisements

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.

Paper Review Building a Robust Software-based Router Using Network Processors.

John DeHart ONL NP Router Block Design Review: Lookup (Part of the PLC Block)

David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block.

Jon Turner, John DeHart, Fred Kuhns Computer Science & Engineering Washington University Wide Area OpenFlow Demonstration.

Michael Wilson Block Design Review: ONL Header Format.

John DeHart and Mike Wilson SPP V2 Router Design.

1 - Charlie Wiseman - 05/11/07 Design Review: XScale Charlie Wiseman ONL NP Router.

Michael Wilson Block Design Review: Line Card Key Extract (Ingress and Egress)

Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar.

David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Freelist Manager.

John DeHart Block Design Review: Lookup for IPv4 MR, LC Ingress and LC Egress.

Brandon Heller Block Design Review: Substrate Decap and IPv4 Parse.

Queue Manager and Scheduler on Intel IXP John DeHart Amy Freestone Fred Kuhns Sailesh Kumar.

1 - Charlie Wiseman, Shakir James - 05/11/07 Design Review: Plugin Framework Charlie Wiseman and Shakir James ONL.

John DeHart An NP-Based Router for the Open Network Lab Memory Map.

David M. Zar Block Design Review: PlanetLab Line Card Header Format.

1 - John DeHart, Jing Lu - 3/8/2016 SRAM ONL NP Router Rx (2 ME) HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) TCAM SRAM Mux (1 ME) Tx (1 ME) QM (1 ME) xScale.

Mart Haitjema Block Design Review: ONL NP Router Multiplexer (MUX)

John DeHart Netgames Plugin Issues. 2 - JDD - 6/13/2016 SRAM ONL NP Router Rx (2 ME) HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) TCAM SRAM Mux (1 ME) Tx.

Flow Stats Module James Moscola September 12, 2007.

ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,

Chapter 6 Delivery & Forwarding of IP Packets

Cache Memory Presentation I

Design of a High Performance PlanetLab Node

Design of a Diversified Router: Memory Usage

Design of a Diversified Router: TCAM Usage

Design of a Diversified Router: TCAM Usage

SPP Version 1 Router Plans and Design

An NP-Based Router for the Open Network Lab Design

An NP-Based Router for the Open Network Lab

An NP-Based Ethernet Switch for the Open Network Lab Design

Design of a Diversified Router: Packet Formats

ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,

SPP Version 1 Router NAT John DeHart.

Design of a Diversified Router: Common Router Framework

Design of a Diversified Router: Project Management

ONL NP Router Plugins Shakir James, Charlie Wiseman, Ken Wong, John DeHart {scj1, cgw1, kenw,

Design of a Diversified Router: Dedicated CRF for IPv4 Metarouter

An NP-Based Router for the Open Network Lab Hardware

Design of a Diversified Router: IPv4 MR (Dedicated NP)

SPP V2 Router Plans and Design

Flow Stats Module James Moscola September 6, 2007.

Design of a Diversified Router: Line Card

An NP-Based Router for the Open Network Lab Overview by JST

ONL Stats Engine David M. Zar Applied Research Laboratory Computer Science and Engineering Department.

Next steps for SPP & ONL 2/6/2007

Network Core and QoS.

IXP Based Router for ONL: Architecture

An NP-Based Router for the Open Network Lab

An NP-Based Router for the Open Network Lab

QM Performance Analysis

Design of a Diversified Router: Project Assignments and Status Updates

Design of a Diversified Router: Dedicated CRF plus IPv4 Metarouter

Design of a Diversified Router: November 2006 Demonstration Plans

Code Review for IPv4 Metarouter Header Format

Code Review for IPv4 Metarouter Header Format

SPP Version 1 Router Plans and Design

An NP-Based Router for the Open Network Lab Meeting Notes

An NP-Based Router for the Open Network Lab Project Information

An NP-Based Router for the Open Network Lab Design

Implementing an OpenFlow Switch on the NetFPGA platform

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

SPP Router Plans and Design

IXP Based Router for ONL: Architecture

Design of a High Performance PlanetLab Node: Line Card

Ch 17 - Binding Protocol Addresses

Design of a Diversified Router: Project Management

Network Core and QoS.

Presentation transcript:

An NP-Based Router for the Open Network Lab Jon Turner with Patrick Crowley, John DeHart, Brandon Heller, Fred Kuhns, Jing Lu, Mike Wilson, Charlie Wiseman, Dave Zar

Issues and Questions Dropcounters What is our performance target? 5-port Router, full link rates. How should SRAM banks be allocated? How many packets should be able to be resident in system at any given time? How many queues do we need to support? Etc. How will lookups be structured? One operation across multiple DBs vs. multiple operations each on one DB Will results be stored in Associated Data SRAM or in one of our SRAM banks? Can we use SRAM Bank0 and still get the throughput we want? Multicast: Are we defining how an ONL user should implement multicast? Or are we just trying to provide some mechanisms to allow ONL users to experiment with multicast? Do we need to allow a Unicast lookup with one copy going out and one copy going to a plugin? If so, this would use the NH_MAC field and the copy vector field Plugins: Can they send pkts directly to the QM instead of always going back through Parse/Lookup/Copy? Use of NN rings between Plugins to do plugin chaining Plugins should be able to write to Stats module ring also to utilize stats counters as they want. Continued on next slide…

Issues and Questions XScale: Can it send pkts directly to the QM instead of always going through Parse/Lookup/Copy path? ARP request and reply? What else will it do besides handling ARP? Do we need to guarantee in-order delivery of packets for a flow that triggers an ARP operation? Re-injected packet may be behind a recently arrived packet for same flow. What is the format of our Buffer Descriptor: Add Reference Count (4 bits) Add MAC DAddr (48 bits) Does the Packet Size or Offset ever change once written? Plugins: Can they change the packet? Other? How will we write L2 Headers for multicast packets? If we are going to do this for multicast, we will do it for all packets, right? Copy writes MAC DAddr into Buffer descriptor HF reads MAC DAddr from Buffer descriptor HF writes full L2 Header into scratch ring data for Tx Tx takes L2 Header data (14 Bytes) from scratch ring and writes it to TBUF TX initiates transfer of rest of packet from DRAM to TBUF Continued on next slide…

Issues and Questions How will we manage the Free list? Support for Multicast (ref count in buf desc) makes reclaiming buffers a little trickier. Scratch ring to Separate ME Modify dl_buf_drop() Performance assumptions of blocks that do drops may have to be changed if we add an SRAM operation to a drop Note: test_and_decr SRAM atomic operation returns pre-modified value Usage Scenarios: It would be good to document some typical ONL usage examples. This might just be extracting some stuff from existing ONL documentation and class projects. Ken? It might also be good to document a JST dream sequence for an ONL experiment Oh my, what I have done now… Do we need to worry about balancing MEs across the two clusters? QM and Lookup are probably heaviest SRAM users Rx and Tx are probably heaviest DRAM users. Plugins need to be in neighboring MEs QM and HF need to be in neighboring MEs

Performance What is our performance target? To hit 5 Gb rate: Minimum Ethernet frame: 76B 64B frame + 12B InterFrame Spacing 5 Gb/sec * 1B/8b * packet/76B = 8.22 Mpkt/sec IXP ME processing: 1.4Ghz clock rate 1.4Gcycle/sec * 1 sec/ 8.22 Mp = 170.3 cycles per packet compute budget: (MEs*170) 1 ME: 170 cycles 2 ME: 340 cycles 3 ME: 510 cycles 4 ME: 680 cycles latency budget: (threads*170) 1 ME: 8 threads: 1360 cycles 2 ME: 16 threads: 2720 cycles 3 ME: 24 threads: 4080 cycles 4 ME: 32 threads: 5440 cycles

ONL NP Router (Jon’s Original) xScale xScale add large SRAM ring TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (2 ME) Stats (1 ME) large SRAM ring Each output has common set of QiDs Multicast copies use same QiD for all outputs QiD ignored for plugin copies Plugin Plugin Plugin Plugin Plugin xScale SRAM large SRAM ring

Design Configuration Add NN rings between Plugins for chaining Add Plugin write to QM Scratch Ring Tx is only 1ME Add Freelist Mgr ME

ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Assoc. Data ZBT-SRAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN SRAM SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

ONL Buffer Descriptor Written by Freelist Mgr Written by Rx Buffer_Next (32b) LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list 0000 (4b) Reserved (12b) LW2 MAC DAddr_47_32 (16b) Stats Index (16b) LW3 MAC DAddr_31_00 (32b) LW4 Reserved (28b) Ref_Cnt (4b) LW5 Reserved (32b) LW6 Packet_Next (32b) LW7 Written by Freelist Mgr Written by Rx Written by Copy Written by QM

MR Buffer Descriptor Buffer_Next (32b) Buffer_Size (16b) Offset (16b) LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list 0000 (4b) Reserved (4b) Reserved (8b) LW2 Reserved (16b) Stats Index (16b) LW3 Reserved (16b) Reserved (8b) Reserved (4b) Reserved (4b) LW4 Reserved (4b) Reserved (4b) Reserved (32b) LW5 Reserved (16b) Reserved (16b) LW6 Packet_Next (32b) LW7

Intel Buffer Descriptor Buffer_Next (32b) LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list (4b) Rx_stat (4b) Hdr_Type (8b) LW2 Input_Port (16b) Output_Port (16b) LW3 Next_Hop_ID (16b) Fabric_Port (8b) Reserved (4b) NHID type (4b) LW4 ColorID (4b) Reserved (4b) FlowID (32b) LW5 Class_ID (16b) Reserved (16b) LW6 Packet_Next (32b) LW7

SRAM Usage What will be using SRAM? Buffer descriptors Current MR supports 229,376 buffers 32 Bytes per SRAM buffer descriptor 7 MBytes Queue Descriptors Current MR supports 65536 queues 16 Bytes per Queue Descriptor 1 MByte Queue Parameters 16 Bytes per Queue Params (actually only 12 used in SRAM) QM Scheduling structure: Current MR supports 13109 batch buffers per QM ME 44 Bytes per batch buffer 576796 Bytes QM Port Rates 4 Bytes per port Plugin “scratch” memory How much per plugin? Large inter-block rings Rx  Mux  Plugins  Plugins Stats/Counters Currently 64K sets, 16 bytes per set: 1 MByte Lookup Results

SRAM Bank Allocation SRAM Banks: Same interface/bus as TCAM Bank1-3 8 MB each Criteria for how SRAM banks should be allocated? Size: SRAM Bandwidth: How many SRAM accesses per packet are needed for the various SRAM uses? QM needs buffer desc and queue desc in same bank

SRAM Accesses Per Packet To support 8.22 M pkts/sec we can have 24 Reads and 24 Writes per pkt (200M/8.22M) Rx: SRAM Dequeue (1 Word) To retrieve a buffer descriptor from free list Write buffer desc (2 Words) Parse Lookup TCAM Operations Reading Results Copy Write buffer desc (3 Words) Ref_cnt MAC DAddr Stats Index Pre-Q stats increments Read: 2 Words Write: 2 Words HF Should not need to read or write any of the buffer descriptor Tx Read buffer desc (4 Words) Freelist Mgr: SRAM Enqueue – Write 1 Word To return buffer descriptor to free list.

QM SRAM Accesses Per Packet QM (Worst case analysis) Enqueue (assume queue is idle and not loaded in Q-Array) Write Q-Desc (4 Words) Eviction of Least Recently Used Queue Write Q-Params ? When we evict a Q do we need to write its params back? The Q-Length is the only thing that the QM is changing. Looks like it writes it back ever time it enqueues or dequeues AND it writes it back when it evcicts (we can probably remove the one when it evicts) Read Q-Desc (4 Words) Read Q-Params (3 Words) Q-Length, Threshold, Quantum Write Q-Length (1 Word) SRAM Enqueue -- Write (1 Word) Scheduling structure accesses? They are done once every 5 pkts (when running full rate) Dequeue (assume queue is not loaded in Q-Array) See notes in enqueue section SRAM Dequeue -- Read (1 Word) Post-Q stats increments 2 Reads 2 Writes

QM SRAM Accesses Per Packet QM (Worst case analysis) Total Per Pkt accesses: Queue Descriptors and Buffer Enq/Deq: Write: 9 Words Read: 9 Words Queue Params: Write: 2 Words Read: 6 Words Scheduling Structure Accesses Per Iteration (batch of 5 packets): Advance Head: Read 11 Words Write Tail: Write 11 Words Update Freelist Read 2 Words OR Write 5 Words

Proposed SRAM Bank Allocation TCAM Lookup Results SRAM Bank 1 (2.5MB/8MB): QM Queue Params (1MB) QM Scheduling Struct (0.5 MB) QM Port Rates (20B) Large Inter-Block Rings (1MB) SRAM Rings are of sizes (in Words): 0.5K, 1K, 2K, 4K, 8K, 16K, 32K, 64K Rx  Mux (2 Words per pkt): 32KW (16K pkts): 128KB  Plugin (3 Words per pkt): 32KW each (10K Pkts each): 640KB  Plugin (3 Words per pkt): 64KW (20K Pkts): 256KB SRAM Bank 2 (8MB/8MB): Buffer Descriptors (7MB) Queue Descriptors (1MB) SRAM Bank 3 (6MB/8MB): Stats Counters (1MB) Plugin “scratch” memory (5MB, 1MB per plugin)

Lookups How will lookups be structured? Three Databases: Route Lookup: Containing Unicast and Multicast Entries Unicast: Port: Can be wildcarded Longest Prefix Match on DAddr Routes should be shorted in the DB with longest prefixes first. Multicast Port: Can be wildcarded? Exact Match on DAddr Longest Prefix Match on SAddr Routes should be sorted in the DB with longest prefixes first. Primary Filter Filters should be sorted in the DB with higher priority filters first Auxiliary Filter Will results be stored in Associated Data SRAM or in one of our external SRAM banks? Can we use SRAM Bank0 and still get the throughput we want? Priority between Primary Filter and Route Lookup A priority will be stored with each Primary Filter A priority will be assigned to RLs (all routes have same priority) PF priority and RL priority compared after result is retrieved. One of them will be selected based on this priority comparison. Auxiliary Filters: If matched, cause a copy of packet to be sent out according to the Aux Filter’s result.

TCAM Operations for Lookups Five TCAM Operations of interest: Lookup (Direct) 1 DB, 1 Result Multi-Hit Lookup (MHL) (Direct) 1 DB, <= 8 Results Simultaneous Multi-Database Lookup (SMDL) (Direct) 2 DB, 1 Result Each DBs must be consecutive! Care must be given when assigning segments to DBs that use this operation. There must be a clean separation of even and odd DBs and segments. Multi-Database Lookup (MDL) (Indirect) <= 8 DB, 1 Result Each Simultaneous Multi-Database Lookup (SMDL) (Indirect) Functionally same as Direct version but key presentation and DB selection are different. DBs need not be consecutive.

Lookups Route Lookup: Key (68b) Result (72b) Port/Plugin (4b) Can be a wildcard for Unicast. Probably can’t be a wildcard for Multicast DAddr (32b) Prefixed for Unicast Exact Match for Multicast SAddr (32b) Unicast entries always have this and its mask 0 Prefixed for Multicast Result (72b) One of 5 ports or 5 plugins. QID (17b) NH_IP/NH_MAC/CopyVector (48b) At most one of NH_IP, NH_MAC or CopyVector should be valid Valid Bits (3b) At most one of the following three bits should be set MCast Valid (1b) NH_IP_Valid (1b) NH_MAC_Valid (1b)

Lookups Filter Lookup Key (136b) Port/Plugin (4b) DAddr (32b) Can be a wildcard for Unicast. Probably can’t be a wildcard for Multicast DAddr (32b) SAddr (32b) Protocol (8b) DPort (16b) Sport (16b) TCP Flags (12b) Exception Bits (16b) Allow for directing of packets based on defined exceptions Result (84b) NH IP(32b)/MAC(48b)/CopyVector(10b) (48b) At most one of NH_IP, NH_MAC or CopyVector should be valid QID (17b) LD (1b): Send to XScale Drop (1b): Drop pkt Valid Bits (3b) At most one of the following three bits should be set NH IP Valid (1b) NH MAC Valid (1b) MCast Valid (1b) Sampling bits (2b) For Aux Filters only Priority (8b) For Primary Filters only

TCAM Core Lookup Performance Routes Filters Lookup/Core size of 72 or 144 bits, Freq=200MHz CAM Core can support 100M searches per second For 1 Router on each of NPUA and NPUB: 8.22 MPkt/s per Router 3 Searches per Pkt (Primary Filter, Aux Filter, Route Lookup) Total Per Router: 24.66 M Searches per second TCAM Total: 49.32 M Searches per second So, the CAM Core can keep up Now lets look at the LA-1 Interfaces…

TCAM LA-1 Interface Lookup Performance Routes Filters Lookup/Core size of 144 bits (ignore for now that Route size is smaller) Each LA-1 interface can support 40M searches per second. For 1 Router on each of NPUA and NPUB (each NPU uses a separate LA-1 Intf): 8.22 MPkt/s per Router Maximum of 3 Searches per Pkt (Primary Filter, Aux Filter, Route Lookup) Max of 3 assumes they are each done as a separate operation Total Per Interface: 24.66 M Searches per second So, the LA-1 Interfaces can keep up Now lets look at the AD SRAM Results …

TCAM Assoc. Data SRAM Results Performance 8.22M 72b or 144b lookups 32b results consumes 1/12 64b results consumes 1/6 128b results consumes 1/3 Routes Filters Lookup/Core size of 72 or 144 bits, Freq=200MHz, SRAM Result Size of 128 bits Associated SRAM can support up to 25M searches per second. For 1 Router on each of NPUA and NPUB: 8.22 MPkt/s per Router 3 Searches per Pkt (Primary Filter, Aux Filter, Route Lookup) Total Per Router: 24.66 M Searches per second TCAM Total: 49.32 M Searches per second So, the Associated Data SRAM can NOT keep up

Lookups: Proposed Design Use SRAM Bank 0 (4 MB) for all Results B0 Byte Address Range: 0x000000 – 0x3FFFFF 22 bits B0 Word Address Range: 0x000000 – 0x3FFFFC 20 bits Two trailing 0’s Use 32-bit Associated Data SRAM result for Address of actual Result: Done: 1b Hit: 1b MHit: 1b Priority: 8b Present for Primary Filters, for RL and Aux Filters should be 0 SRAM B0 Word Address: 21b 1 spare bit Use Multi-Database Lookup (MDL) Indirect for searching all 3 DBs Order of fields in Key is important. Each thread will need one TCAM context Route DB: Lookup Size: 68b (3 32b words transferred across QDR intf) Core Size: 72b AD Result Size: 32b SRAM B0 Result Size: 72b (3 Words) Primary DB: Lookup Size: 136b (5 32b words transferred across QDR intf) Core Size: 144b SRAM B0 Result Size: 76b (3 Words) Priority not included in SRAM B0 result because it is in AD result

Lookups: Latency Three searches in one MDL Indirect Operation Latencies for operation QDR xfer time: 6 clock cycles 1 for MDL Indirect subinstruction 5 for 144 bit key transferred across QDR Bus Instruction Fifo: 2 clock cycles Synchronizer: 3 clock cycles Execution Latency: search dependent Re-Synchronizer: 1 clock cycle Total: 12 clock cycles

Lookups: Latency 144 bit DB, 32 bits of AD (two of these) Instruction Latency: 30 Core blocking delay: 2 Backend latency: 8 72 bit DB, 32 bits of AD Core blocking delay:2 Latency of first search (144 bit DB): 11 + 30 = 41 clock cycles Latency of subsequent searchs: (previous search latency) – (backend latency of previous search) + (core block delay of previous search) + (backend latency of this search) Latency of second 144 bit search: 41 – 8 + 2 + 8 = 43 Latency of third search (72 bit): 43 – 8 + 2 + 8 = 45 clock cycles 45 QDR Clock cycles (200 MHz clock)  315 IXP Clock cycles (1400 MHz clock) This is JUST for the TCAM operation, we also need to read the SRAM: SRAM Read to retrieve TCAM Results Mailbox (3 words – one per search) TWO SRAM Reads to then retrieve the full results (3 Words each) from SRAM Bank 0 but we don’t have to wait for one to complete before issuing the second. About 150 IXP cycles for an SRAM Read  315 + 150 + 150 = 615 IXP Clock cycles Lets estimate 650 IXP Clock cycles for issuing, performing and retrieving results for a lookup. (multi-word, two reads, …) Does not include any lookup block processing

Lookups: SRAM Bandwidth Analysis is PER LA-1 QDR Interface That is, each of NPUA and NPUB can do the following. 16-bit QDR SRAM at 200 MHz Separate read and write bus Operations on rising and falling edge of each clock 32 bits of read AND 32 bits of write per clock tick QDR Write Bus: 6 32-bit cycles per instruction Cycle 0: Write Address bus contains the TCAM Indirect Instruction Write Data bus contains the TCAM Indirect MDL Sub-Instruction Cycles 1-5 Write Data bus contains the 5 words of the Lookup Key Write Bus can support 200M/6 = 33.33 M searches/sec QDR Read Bus: Retrieval of Results Mailbox: 3 32-bit cycles per instruction Retrieval of two full results from QDR SRAM Bank 0: Total of 9 32-bit cycles per instruction Read Bus can support 200M/9 = 22.22 M searches/sec Conclusion: Plenty of SRAM bandwidth to support TCAM operations AND SRAM Bank 0 accesses to perform all aspects of lookups at over 8.22 M searches/sec.

Objectives for ONL Router Reproduce approximately same functionality as current hardware router routes, filters (including sampling filters), stats, plugins Extensions multicast, explicit-congestion marking Use each NPU as separate 5 port router each responsible for half the external ports xScale on each NPU implements CP functions access to control variables, memory-resident statistics updating of routes, filters interaction with plugins through shared memory simple message buffer interface for request/response

Unicast, ARP and Multicast Each port has Ethernet header with fixed source MAC address – several cases for destination MAC address Case 1 – unicast packet with destination on attached subnet requires ARP to map dAdr to MAC address ARP cache holds mappings – issue ARP request on cache miss Case 2 – other unicast packets lookup must provide next-hop IP address then use ARP to obtain MAC address, as in case 1 Case 3 – Multicast packet lookup specifies copy-vector and QiD destination MAC address formed from IP multicast address Could avoid ARP in some cases e.g. point-to-point link but little advantage, since ARP mechanism required anyway Do we learn MAC Addresses from received pkts?

Proposed Approach Lookup does separate route lookup and filter lookup at most one match for route, up to two for filter (primary, aux) combine route lookup with ARP cache lookup xScale adds routes for multi-access subnets, based on ARP Route lookup for unicast, stored keys are (rcv port)+(dAdr prefix) lookup key is (rcv port)+(dAdr) result includes Port/Plugin, QiD, next-hop IP or MAC address, valid next-hop bit for multicast, stored keys are (rcv port)+(dAdr)+(sAdr prefix) lookup key is (rcv port)+(dAdr)+(sAdr) result includes 10 bit copy vector, QiD Filter lookup stored key is IP 5-tuple + TCP flags – arbitrary bit masks allowed lookup key is IP 5-tuple + flags if applicable result includes Port/Plugin or copy vector, QiD, next-hop IP or MAC address, valid next-hop bit, primary-aux bit, priority Destination MAC address passed through QM via being written in the buffer descriptor? Do we have 48 bits to spare? Yes, we actually have 14 free bytes. Enough for a full (non-vlan) ethernet header.

Lookup Processing On receiving unicast packet, do route & filter lookups if MAC address returned by route (or higher priority primary filter) is valid, queue the packet and continue else, pass packet to xScale, marking it as no-MAC leave it to xScale to generate ARP request, handle reply, insert route and re-inject packet into data path On receiving multicast packet, do route & filter lookups take higher priority result from route lookup or primary filter format MAC multicast address copy to queues specified by copy vector if matching auxiliary filter, filter supplies MAC address

Extra Slides

ONL NP Router TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (2 ME)

ONL NP Router TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Port Buf Handle(32b) Port (8b) Reserved Eth. Frame Len (16b) TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (2 ME) Buf Handle(24b) Frm Offset (16b) Frm Length(16b) Port (8) Buffer Handle(24b) Rsv (3b) Port (4b) V 1 Buffer Handle(24b) Rsv (3b) Port (4b) V 1

ONL NP Router Parse Lookup Do IP Router checks Extract lookup key Buf Handle(24b) Frm Offset (16b) Frm Length(16b) Port (8) Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Port TCAM Copy Port: Identifies Source MAC Addr Write it to buffer descriptor or let HF determine it via port? Unicast: Valid MAC: Write MAC Addr to Buffer descriptor and queue pkt No Valid MAC: Prepare pkt to be sent to XScale for ARP processing Multicast: Calculate Ethernet multicast Dst MAC Addr Fct(IP Multicast Dst Addr) Write Dst MAC Addr to buf desc. Same for all copies! For each bit set in copy bit vector: Queue a packet to port represented by bit in bit vector. Reference Count in buffer desc. Parse, Lookup, PHF&Copy (3 MEs) Parse Do IP Router checks Extract lookup key Lookup Perform lookups – potentially three lookups: Route Lookup Primary Filter lookup Auxiliary Filter lookup

Notes Need a reference count for multicast. (in buffer descriptor) How to handle freeing buffer for multicast packet? Drops can take place in the following blocks: Parse QM Plugin Tx Mux  Parse Reclassify bit For traffic that does not get reclassified after coming from a Plugin or the XScale we need all the data that the QM will need: QID Stats Index Output Port If a packet matches an Aux filter AND it needs ARP processing, the ARP processing takes precedence and we do not process the Aux filter result. Does anything other than ARP related traffic go to the XScale? IP exceptions like expired TTL? Can users direct traffic for delivery to the XScale and add processing there? Probably not if we are viewing the XScale as being like our CPs in the NSP implementation.

Notes Combining Parse/Lookup/Copy Dispatch loop Build settings TCAM mailboxes (there are 128 contexts) So with 24 threads we can have up to 5 TCAM contexts per thread. Rewrite Lookup in C Input and Output on Scratch rings Configurable priorities on Mux inputs Xscale, Plugins, Rx Should we allow plugins to write directly to QM input scratch ring for packets that do not need reclassification? If we allow this is there any reason for a plugin to send a packet back through Parse/Lookup/Copy if it wants it to NOT be reclassified? We can give Plugins the capability to use NN rings between themselves to chain plugins.

ONL NP Router xScale xScale add configurable per port delay (up to 150 ms total delay) add large SRAM ring TCAM Assoc. Data ZBT-SRAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (4 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (1 ME) Stats (1 ME) large SRAM ring Each output has common set of QiDs Multicast copies use same QiD for all outputs QiD ignored for plugin copies Plugin Plugin Plugin Plugin Plugin xScale SRAM large SRAM ring Plugin write access to QM Scratch Ring

ONL NP Router Each output has common set of QiDs xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (4 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (1 ME) Plugin2 Plugin3 Plugin4 Plugin5 Each output has common set of QiDs Multicast copies use same QiD for all outputs QiD ignored for plugin copies Stats (1 ME) NN NN NN NN Plugin1 xScale SRAM