Presentation is loading. Please wait.

Presentation is loading. Please wait.

An NP-Based Router for the Open Network Lab

Similar presentations


Presentation on theme: "An NP-Based Router for the Open Network Lab"— Presentation transcript:

1 An NP-Based Router for the Open Network Lab
John DeHart

2 Notes from 3/23/07 ONL Control Mtg
Using the same QID for all copies of a multicast does not work The QM does not partition QIDs across ports Do we need to support Datagram queues? Yes, we will support 64 datagram queues per port We will use the same Hash Function as in the NSP router For testing purposes, can users assign the datagram queues to filters/routes? Proposed partitioning of QIDs: QID[15:13]: Port Number 0-4 QID[12]: Reserved by RLI vs XScale 0: RLI Reserved 1: XScale Reserved QID[11: 0] : per port queues 4096 RLI reserved queues per port 4032 XScale reserved queues per port 64 datagram queues per port yyy xx xxxx: Datagram queues for port <yyy> IDT XScale software kernel memory issues still need to be resolved.

3 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Ring Scratch Ring TCAM Assoc. Data ZBT-SRAM SRAM NN NN Ring 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN Mostly Unchanged 64KW SRAM 32KW Each New NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 xScale SRAM Needs A Lot Of Mod. Needs Some Mod. Stats (1 ME) Tx, QM Parse Plugin XScale QM Copy Plugins FreeList Mgr (1 ME) SRAM

4 Project Assignments With Design and Policy help from Fred and Ken
XScale daemons, etc: Charlie With Design and Policy help from Fred and Ken PLC (Parse, Lookup and Copy): Jing and JohnD With consulting from Brandon QM: Dave and JohnD Rx: Dave Tx: Dave Stats: Dave Header Format: Mike Mux: Mart? Freelist_Mgr: JohnD Plugin Framework: Charlie and Shakir With consulting from Ken Dispatch loop and utilities: All Dl_sink_to_Stats, dl_sink_to_freelist_mgr These should take in a signal and not wait Documentation: Ken With help from All Test cases and test pkt generation: Brandon

5 Project Level Stuff Upgrade to IXA SDK 4.3.1 Project Files C vs. uc
Techx/Development/IXP_SDK_4.3/{cd1,cd2,4-3-1_update} Project Files We’re working on them right now. C vs. uc Probably any new blocks should be written in C Existing code (Rx, Tx, QM, Stats) can remain as uc. Freelist Mgr might go either way. Stubs: Do we need them this time around? SRAM rings: We need to understand the implications of using them. No way to pre-test for empty/full? Subversion Do we want to take this opportunity to upgrade? Current version: Cygwin (my laptop): Linux (bang.arl.wustl.edu): 1.3.2 Available: Cygwin: subversion.tigris.org: 1.4.3

6 Notes from 3/13/07 Ethertype needs to be written to buffer descriptor so HF can get it. Who tags non-IP pkts for being sent to XScale: Parse? We will not be supporting ethernet headers with: VLANs LLC/SNAP encapsulation Add In Plugin in data going to a Plugin: In Plugin: tells the last plugin that had the packet Plugins can write to other Plugins sram rings Support for XScale participation in an IP multicast For use with Control protocols? Add In Port values for Plugin and XScale generated packets Include both In Port and In Plugin to lookup key? Should flag bits also go to Plugins For users to use our IP MCast support they must abide by the IP multicast addressing rules. i.e. Copy will do the translation of IP MCast DAddr to Ethernet MCast DAddr so if the IP DA does not conform it can’t do it.

7 Issues and Questions Upgrade to IXA SDK 4.3.1 Which Rx to use?
Techx/Development/IXP_SDK_4.3/{cd1,cd2,4-3-1_update} Which Rx to use? Intel Rx from IXA SDK is our base for further work Which Tx to use? Three options: Our current Tx (Intel IXA SDK 4.0, Radisys modifications, WU Modifications) Among other changes, we removed some code that supported buffer chaining. Radisys Tx based on SDK 4.0 – we would need to re-do our modifications This would get the buffer chaining code back if we need/want it Intel IXA SDK Tx – no Radisys modifications, we would need to re-do our modifications How will we write L2 Headers? When there are >1 copies: For a copy going to the QM, Copy allocates a buffer and buffer descriptor for the L2 Header Copy writes the DAddr into the buffer descriptor Options: HF writes full L2 header to DRAM buffer and Tx initiates the transfer from DRAM to TBUF Unicast: to packet DRAM buffer Multicast: to prepended header DRAM buffer HF/Tx writes/reads L2 header to/from Scratch ring and Tx writes it directly to TBUF When there is only one copy of the packet: No extra buffer and buffer descriptor are allocated L2 header is given to Tx in same way as it is for the >1 copy case How should Exceptions be handled? TTL Expired IP Options present No Route C vs. uc Probably any new blocks should be written in C Existing code (Rx, Tx, QM, Stats) can remain as uc. Freelist Mgr? Continued on next slide…

8 Issues and Questions Need to add Global counters
See ONLStats.ppt Global counters: Per port Rx and Tx: Pkt and Byte counters Drop counters: Rx (out of buffers) Parse (malformed IP header/pkt) QM (queue overflow) Plugin XScale Copy (lookup result has Drop bit set, lookup MISS?) Tx (internal buffer overflow) What is our performance target? 5-port Router, full link rates. How should SRAM banks be allocated? How many packets should be able to be resident in system at any given time? How many queues do we need to support? Etc. How will lookups be structured? One operation across multiple DBs vs. multiple operations each on one DB Will results be stored in Associated Data SRAM or in one of our SRAM banks? Can we use SRAM Bank0 and still get the throughput we want? Multicast: Are we defining how an ONL user should implement multicast? Or are we just trying to provide some mechanisms to allow ONL users to experiment with multicast? Do we need to allow a Unicast lookup with one copy going out and one copy going to a plugin? If so, this would use the NH_MAC field and the copy vector field Continued on next slide…

9 Issues and Questions Plugins: XScale:
Can they send pkts directly to the QM instead of always going back through Parse/Lookup/Copy? Use of NN rings between Plugins to do plugin chaining Plugins should be able to write to Stats module ring also to utilize stats counters as they want. XScale: Can it send pkts directly to the QM instead of always going through Parse/Lookup/Copy path? ARP request and reply? What else will it do besides handling ARP? Do we need to guarantee in-order delivery of packets for a flow that triggers an ARP operation? Re-injected packet may be behind a recently arrived packet for same flow. What is the format of our Buffer Descriptor: Add Reference Count (4 bits) Add MAC DAddr (48 bits) Does the Packet Size or Offset ever change once written? Yes, Plugins can change the packet size and offset. Other? Continued on next slide…

10 Issues and Questions How will we manage the Free list?
Support for Multicast (ref count in buf desc) makes reclaiming buffers a little trickier. Scratch ring to Separate ME Do we want it to batch requests? Read 5 or 10 from the scratch ring at once, compare the buffer handles and accumulate Depending on queue, copies of packets will go out close in time to one another… But vast majority of packets will be unicast so no accumulation will be possible. Or, use the CAM to accumulate 16 buffer handles Evict unicast or done multicast from CAM and actually free descriptor Do we want to put Freelist Mgr ME just ahead of Rx and use NN ring into Rx to feed buffer descriptors when we can? We might be able to have Mux and Freelist Mgr share and ME (4 threads per or something) Modify dl_buf_drop() Performance assumptions of blocks that do drops may have to be changed if we add an SRAM operation to a drop It will also add a context swap. The drop code will need to do a test_and_decr, wait for the result (i.e. context swap) and then depending on the result perhaps do the drop. Note: test_and_decr SRAM atomic operation returns pre-modified value Usage Scenarios: It would be good to document some typical ONL usage examples. This might just be extracting some stuff from existing ONL documentation and class projects. Ken? It might also be good to document a JST dream sequence for an ONL experiment Oh my, what I have done now… Do we need to worry about balancing MEs across the two clusters? QM and Lookup are probably heaviest SRAM users Rx and Tx are probably heaviest DRAM users. Plugins need to be in neighboring MEs QM and HF need to be in neighboring MEs

11 Hardware Promentum™ ATCA-7010 (NP Blade): Two Intel IXP2850 NPs
1.4 GHz Core 700 MHz Xscale Each NPU has: 3x256MB RDRAM, 533 MHz 3 Channels Address space is striped across all three. 4 QDR II SRAM Channels Channels 1, 2 and 3 populated with 8MB each running at 200 MHz 16KB of Scratch Memory 16 Microengines Instruction Store: 8K 40-bit wide instructions Local Memory: bit words TCAM: Network Search Engine (NSE) on SRAM channel 0 Each NPU has a separate LA-1 Interface Part Number: IDT75K72234 18Mb TCAM Rear Transition Module (RTM) Connects via ATCA Zone 3 10 1GE Physical Interfaces Supports Fiber or Copper interfaces using SFP modules.

12 Hardware ATCA Chassis NP Blade RTM

13 NP Blades

14 ONL Router Architecture
Each NPU is one 5-port Router ONL Chassis has no switch Blade 1Gb/s Links on RTM connect to external ONL switch(es) / 5x1Gb/s NPUA / 5x1Gb/s SPI NPUB RTM 7010 Blade

15 Performance What is our performance target? To hit 5 Gb rate:
Minimum Ethernet frame: 76B 64B frame + 12B InterFrame Spacing 5 Gb/sec * 1B/8b * packet/76B = 8.22 Mpkt/sec IXP ME processing: 1.4Ghz clock rate 1.4Gcycle/sec * 1 sec/ 8.22 Mp = cycles per packet compute budget: (MEs*170) 1 ME: 170 cycles 2 ME: 340 cycles 3 ME: 510 cycles 4 ME: 680 cycles latency budget: (threads*170) 1 ME: 8 threads: 1360 cycles 2 ME: 16 threads: 2720 cycles 3 ME: 24 threads: 4080 cycles 4 ME: 32 threads: 5440 cycles

16 ONL NP Router (Jon’s Original)
xScale xScale add large SRAM ring TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (2 ME) Stats (1 ME) large SRAM ring Each output has common set of QiDs Multicast copies use same QiD for all outputs QiD ignored for plugin copies Plugin Plugin Plugin Plugin Plugin xScale SRAM large SRAM ring

17 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

18 Inter Block Rings Scratch Rings (sizes in 32b Words: 128, 256, 512, 1024) XScale  MUX N Word per pkt 256 Word Ring 256/N pkts PLC  XScale MUX  PLC 1 Word per pkt 256 pkts  QM N Words per pkt 1024 Word Ring 1024/N Pkts HF  TX  Stats  Freelist Mgr Total Scratch Size: 4KW (16KB) Total Used in Rings: 2.5 KW

19 Inter Block Rings SRAM Rings (sizes in 32b KW: 0.5, 1, 2, 4, 8, 16, 32, 64) RX  MUX 2 Words per pkt 32KW Ring 16K Pkts PLC  Plugins (5 of them) 3 Words per pkt 32KW Rings ~10K Pkts Plugins  MUX (1 serving all plugins) 64KW Ring ~20K Pkts NN Rings (128 32b words) QM HF 1 Word per pkt 128 Pkts Plugin N  Plugin N+1 (for N=1 to N=4) Words per pkt is plugin dependent

20 SRAM Buffer Descriptor
Problem: With the use of Filters, Plugins and recycling back around for reclassification, we can end up with an arbitrary number of copies of one packet in the system at a time. Each copy of a packet could end up going to an output port and need a different MAC DAddr from all the other packets Having one Buffer Descriptor per packet regardless of the number of copies will not be sufficient. Solution: When there are multiple copies of the packet in the system, each copy will need a separate Header buffer descriptor which will contain the MAC DAddr for that copy. When the Copy block gets a packet that it only needs to send one copy to QM, it will read the current reference count and if this copy is the ONLY copy in the system, it will not prepend the Header buffer descriptor. SRAM buffer descriptors are the scarce resource and we want to optimize their use. Therefore: We do NOT want to always prepend a header buffer descriptor Otherwise, Copy will prepend a Header buffer descriptor to each copy going to the QM. Copy does not need to prepend a Header buffer descriptor to copies going to plugins We have to think some more about the case of copies going to the XScale. The Header buffer descriptors will come from the same pool (freelist 0) as the PacketPayload buffer descriptors. There is no advantage to associating these Header buffer descriptors with small DRAM buffers. DRAM is not the scarce resource SRAM buffer descriptors are the scarce resource.

21 ONL Buffer Descriptor 1 Written by Rx, Added to by Copy Decremented by
Buffer_Next (32b) LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list 0000 (4b) Reserved (4b) Ref_Cnt (8b) LW2 MAC DAddr_47_32 (16b) Stats Index (16b) LW3 MAC DAddr_31_00 (32b) LW4 Reserved (16b) EtherType (16b) LW5 Reserved (32b) LW6 Packet_Next (32b) LW7 1 Written by Rx, Added to by Copy Decremented by Freelist Mgr Ref_Cnt (8b) Written by Freelist Mgr Written by Rx Written by Copy Written by Rx and Plugins Written by QM

22 MR Buffer Descriptor Buffer_Next (32b) Buffer_Size (16b) Offset (16b)
LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list 0000 (4b) Reserved (4b) Reserved (8b) LW2 Reserved (16b) Stats Index (16b) LW3 Reserved (16b) Reserved (8b) Reserved (4b) Reserved (4b) LW4 Reserved (4b) Reserved (4b) Reserved (32b) LW5 Reserved (16b) Reserved (16b) LW6 Packet_Next (32b) LW7

23 Intel Buffer Descriptor
Buffer_Next (32b) LW0 Buffer_Size (16b) Offset (16b) LW1 Packet_Size (16b) Free_list (4b) Rx_stat (4b) Hdr_Type (8b) LW2 Input_Port (16b) Output_Port (16b) LW3 Next_Hop_ID (16b) Fabric_Port (8b) Reserved (4b) NHID type (4b) LW4 ColorID (4b) Reserved (4b) FlowID (32b) LW5 Class_ID (16b) Reserved (16b) LW6 Packet_Next (32b) LW7

24 SRAM Usage What will be using SRAM? Buffer descriptors
Current MR supports 229,376 buffers 32 Bytes per SRAM buffer descriptor 7 MBytes Queue Descriptors Current MR supports queues 16 Bytes per Queue Descriptor 1 MByte Queue Parameters 16 Bytes per Queue Params (actually only 12 used in SRAM) QM Scheduling structure: Current MR supports batch buffers per QM ME 44 Bytes per batch buffer Bytes QM Port Rates 4 Bytes per port Plugin “scratch” memory How much per plugin? Large inter-block rings Rx  Mux  Plugins  Plugins Stats/Counters Currently 64K sets, 16 bytes per set: 1 MByte Lookup Results

25 SRAM Bank Allocation SRAM Banks:
4 MB total, 2MB per NPU Same interface/bus as TCAM Bank1-3 8 MB each Criteria for how SRAM banks should be allocated? Size: SRAM Bandwidth: How many SRAM accesses per packet are needed for the various SRAM uses? QM needs buffer desc and queue desc in same bank

26 SRAM Accesses Per Packet
To support 8.22 M pkts/sec we can have 24 Reads and 24 Writes per pkt (200M/8.22M) Rx: SRAM Dequeue (1 Word) To retrieve a buffer descriptor from free list Write buffer desc (2 Words) Parse Lookup TCAM Operations Reading Results Copy Write buffer desc (3 Words) Ref_cnt MAC DAddr Stats Index Pre-Q stats increments Read: 2 Words Write: 2 Words HF Should not need to read or write any of the buffer descriptor Tx Read buffer desc (4 Words) Freelist Mgr: SRAM Enqueue – Write 1 Word To return buffer descriptor to free list.

27 QM SRAM Accesses Per Packet
QM (Worst case analysis) Enqueue (assume queue is idle and not loaded in Q-Array) Write Q-Desc (4 Words) Eviction of Least Recently Used Queue Write Q-Params ? When we evict a Q do we need to write its params back? The Q-Length is the only thing that the QM is changing. Looks like it writes it back ever time it enqueues or dequeues AND it writes it back when it evcicts (we can probably remove the one when it evicts) Read Q-Desc (4 Words) Read Q-Params (3 Words) Q-Length, Threshold, Quantum Write Q-Length (1 Word) SRAM Enqueue -- Write (1 Word) Scheduling structure accesses? They are done once every 5 pkts (when running full rate) Dequeue (assume queue is not loaded in Q-Array) See notes in enqueue section SRAM Dequeue -- Read (1 Word) Post-Q stats increments 2 Reads 2 Writes

28 QM SRAM Accesses Per Packet
QM (Worst case analysis) Total Per Pkt accesses: Queue Descriptors and Buffer Enq/Deq: Write: 9 Words Read: 9 Words Queue Params: Write: 2 Words Read: 6 Words Scheduling Structure Accesses Per Iteration (batch of 5 packets): Advance Head: Read 11 Words Write Tail: Write 11 Words Update Freelist Read 2 Words OR Write 5 Words

29 Proposed SRAM Bank Allocation
TCAM Lookup Results SRAM Bank 1 (2.5MB/8MB): QM Queue Params (1MB) QM Scheduling Struct (0.5 MB) QM Port Rates (20B) Large Inter-Block Rings (1MB) SRAM Rings are of sizes (in Words): 0.5K, 1K, 2K, 4K, 8K, 16K, 32K, 64K Rx  Mux (2 Words per pkt): 32KW (16K pkts): 128KB  Plugin (3 Words per pkt): 32KW each (10K Pkts each): 640KB  Plugin (3 Words per pkt): 64KW (20K Pkts): 256KB SRAM Bank 2 (8MB/8MB): Buffer Descriptors (7MB) Queue Descriptors (1MB) SRAM Bank 3 (6MB/8MB): Stats Counters (1MB) Plugin “scratch” memory (5MB, 1MB per plugin)

30 Lookups How will lookups be structured?
Three Databases: Route Lookup: Containing Unicast and Multicast Entries Unicast: Port: Can be wildcarded Longest Prefix Match on DAddr Routes should be shorted in the DB with longest prefixes first. Multicast Port: Can be wildcarded? Exact Match on DAddr Longest Prefix Match on SAddr Routes should be sorted in the DB with longest prefixes first. Primary Filter Filters should be sorted in the DB with higher priority filters first Auxiliary Filter Will results be stored in Associated Data SRAM or in one of our external SRAM banks? Can we use SRAM Bank0 and still get the throughput we want? Priority between Primary Filter and Route Lookup A priority will be stored with each Primary Filter A priority will be assigned to RLs (all routes have same priority) PF priority and RL priority compared after result is retrieved. One of them will be selected based on this priority comparison. Auxiliary Filters: If matched, cause a copy of packet to be sent out according to the Aux Filter’s result.

31 TCAM Operations for Lookups
Five TCAM Operations of interest: Lookup (Direct) 1 DB, 1 Result Multi-Hit Lookup (MHL) (Direct) 1 DB, <= 8 Results Simultaneous Multi-Database Lookup (SMDL) (Direct) 2 DB, 1 Result Each DBs must be consecutive! Care must be given when assigning segments to DBs that use this operation. There must be a clean separation of even and odd DBs and segments. Multi-Database Lookup (MDL) (Indirect) <= 8 DB, 1 Result Each Simultaneous Multi-Database Lookup (SMDL) (Indirect) Functionally same as Direct version but key presentation and DB selection are different. DBs need not be consecutive.

32 Lookups Route Lookup: Key (72b) Result (79b)
Port (4b): Can be a wildcard (for Unicast, probably not for Multicast) Plugin (4b): Can be a wildcard (for Unicast, probably not for Multicast) DAddr (32b) Prefixed for Unicast Exact Match for Multicast SAddr (32b) Unicast entries always have this and its mask 0 Prefixed for Multicast Result (79b) CopyVector (11b) One bit for each of the 5 ports and 5 plugins and one bit for the XScale QID (16b) Drop (1b): Drop pkt NH_IP/NH_MAC (48b) At most one of NH_IP or NH_MAC should be valid Valid Bits (3b) At most one of the following three bits should be set IP_MCast Valid (1b) NH_IP_Valid (1b) NH_MAC_Valid (1b)

33 Lookups Filter Lookup Key (140b)
Port (4b): Can be a wildcard (for Unicast, probably not for Multicast) Plugin (4b): Can be a wildcard (for Unicast, probably not for Multicast) DAddr (32b) SAddr (32b) Protocol (8b) DPort (16b) Sport (16b) TCP Flags (12b) Exception Bits (16b) Allow for directing of packets based on defined exceptions Result (89b) CopyVector (11b) One bit for each of the 5 ports and 5 plugins and one bit for the XScale NH IP(32b)/MAC(48b) (48b) At most one of NH_IP or NH_MAC should be valid QID (16b) Drop (1b): Drop pkt Valid Bits (3b) At most one of the following three bits should be set NH IP Valid (1b) NH MAC Valid (1b) IP_MCast Valid (1b) Sampling bits (2b) : For Aux Filters only Priority (8b) : For Primary Filters only

34 TCAM Core Lookup Performance
Routes Filters Lookup/Core size of 72 or 144 bits, Freq=200MHz CAM Core can support 100M searches per second For 1 Router on each of NPUA and NPUB: 8.22 MPkt/s per Router 3 Searches per Pkt (Primary Filter, Aux Filter, Route Lookup) Total Per Router: M Searches per second TCAM Total: M Searches per second So, the CAM Core can keep up Now lets look at the LA-1 Interfaces…

35 TCAM LA-1 Interface Lookup Performance
Routes Filters Lookup/Core size of 144 bits (ignore for now that Route size is smaller) Each LA-1 interface can support 40M searches per second. For 1 Router on each of NPUA and NPUB (each NPU uses a separate LA-1 Intf): 8.22 MPkt/s per Router Maximum of 3 Searches per Pkt (Primary Filter, Aux Filter, Route Lookup) Max of 3 assumes they are each done as a separate operation Total Per Interface: M Searches per second So, the LA-1 Interfaces can keep up Now lets look at the AD SRAM Results …

36 TCAM Assoc. Data SRAM Results Performance
8.22M 72b or 144b lookups 32b results consumes 1/12 64b results consumes 1/6 128b results consumes 1/3 Routes Filters Lookup/Core size of 72 or 144 bits, Freq=200MHz, SRAM Result Size of 128 bits Associated SRAM can support up to 25M searches per second. For 1 Router on each of NPUA and NPUB: 8.22 MPkt/s per Router 3 Searches per Pkt (Primary Filter, Aux Filter, Route Lookup) Total Per Router: M Searches per second TCAM Total: M Searches per second So, the Associated Data SRAM can NOT keep up

37 Lookups: Proposed Design
Use SRAM Bank 0 (2 MB per NPU) for all Results B0 Byte Address Range: 0x – 0x3FFFFF 22 bits B0 Word Address Range: 0x – 0x3FFFFC 20 bits Two trailing 0’s Use 32-bit Associated Data SRAM result for Address of actual Result: Done: 1b Hit: 1b MHit: 1b Priority: 8b Present for Primary Filters, for RL and Aux Filters should be 0 SRAM B0 Word Address: 21b 1 spare bit Use Multi-Database Lookup (MDL) Indirect for searching all 3 DBs Order of fields in Key is important. Each thread will need one TCAM context Route DB: Lookup Size: 68b (3 32b words transferred across QDR intf) Core Size: 72b AD Result Size: 32b SRAM B0 Result Size: 78b (3 Words) Primary DB: Lookup Size: 136b (5 32b words transferred across QDR intf) Core Size: 144b SRAM B0 Result Size: 82b (3 Words) Priority not included in SRAM B0 result because it is in AD result

38 Lookups: Latency Three searches in one MDL Indirect Operation
Latencies for operation QDR xfer time: 6 clock cycles 1 for MDL Indirect subinstruction 5 for 144 bit key transferred across QDR Bus Instruction Fifo: 2 clock cycles Synchronizer: 3 clock cycles Execution Latency: search dependent Re-Synchronizer: 1 clock cycle Total: 12 clock cycles

39 Lookups: Latency 144 bit DB, 32 bits of AD (two of these)
Instruction Latency: 30 Core blocking delay: 2 Backend latency: 8 72 bit DB, 32 bits of AD Core blocking delay:2 Latency of first search (144 bit DB): = 41 clock cycles Latency of subsequent searchs: (previous search latency) – (backend latency of previous search) + (core block delay of previous search) + (backend latency of this search) Latency of second 144 bit search: 41 – = 43 Latency of third search (72 bit): 43 – = 45 clock cycles 45 QDR Clock cycles (200 MHz clock)  315 IXP Clock cycles (1400 MHz clock) This is JUST for the TCAM operation, we also need to read the SRAM: SRAM Read to retrieve TCAM Results Mailbox (3 words – one per search) TWO SRAM Reads to then retrieve the full results (3 Words each) from SRAM Bank 0 but we don’t have to wait for one to complete before issuing the second. About 150 IXP cycles for an SRAM Read  = 615 IXP Clock cycles Lets estimate 650 IXP Clock cycles for issuing, performing and retrieving results for a lookup. (multi-word, two reads, …) Does not include any lookup block processing

40 Lookups: SRAM Bandwidth
Analysis is PER LA-1 QDR Interface That is, each of NPUA and NPUB can do the following. 16-bit QDR SRAM at 200 MHz Separate read and write bus Operations on rising and falling edge of each clock 32 bits of read AND 32 bits of write per clock tick QDR Write Bus: 6 32-bit cycles per instruction Cycle 0: Write Address bus contains the TCAM Indirect Instruction Write Data bus contains the TCAM Indirect MDL Sub-Instruction Cycles 1-5 Write Data bus contains the 5 words of the Lookup Key Write Bus can support 200M/6 = M searches/sec QDR Read Bus: Retrieval of Results Mailbox: 3 32-bit cycles per instruction Retrieval of two full results from QDR SRAM Bank 0: Total of 9 32-bit cycles per instruction Read Bus can support 200M/9 = M searches/sec Conclusion: Plenty of SRAM bandwidth to support TCAM operations AND SRAM Bank 0 accesses to perform all aspects of lookups at over 8.22 M searches/sec.

41 Block Interfaces The next set of slides show the block interfaces
These slides are still very much a work in progress

42 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

43 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM Buf Handle(32b) InPort (4b) Reserved (12b) Eth. Frame Len (16b) 32KW Each SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

44 ONL NP Router 7 3 2 1 xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME)
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN Flags: Source (3b): Rx/XScale/Plugin PassThrough(1)/Classify(0) (1b): Reserved (4b) 64KW Rsv (4b) Out Port (4b) Buffer Handle(24b) SRAM 32KW Each In Plugin (4b) In Port (4b) Flags (8b) QID(16b) SRAM Ring Frame Length (16b) Stats Index (16b) NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Reserved (4b) Rx (1b) X Pl PT Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale 7 3 2 1

45 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) Rsv (8b) (4b) Out Port Reserved SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

46 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Buffer Handle(24b) Rsv (3b) Port (4b) V 1 Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

47 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each Buffer Handle(24b) Rsv (3b) Port (4b) V 1 Ethernet DA[47-16] (32b) Ethernet DA[15-0](16b) Ethernet SA[31-0] (32b) Ethernet SA[47-32](16b) Ethernet Type(16b) Reserved (16b) SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

48 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN Reserved (8b) Buffer Handle(24b) 64KW SRAM In Plugin (4b) In Port (4b) Rsv (8b) QID(16b) 32KW Each Frame Length (16b) Stats Index (16b) SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

49 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN Flags: PassThrough/Classify (1b): Reserved (7b) Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) In Port (4b) Plugin Flags (8b) Rsv Out 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

50 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Flags: Why pkt is being sent to XScale xScale xScale Rsv (8b) Out Port (4b) Buffer Handle(24b) In Plugin (4b) In Port (4b) Flags (8b) TCAM QID(16b) Assoc. Data ZBT-SRAM SRAM Frame Length (16b) Stats Index (16b) 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

51 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Flags: PassThrough/Classify (1b): Reserved (7b) Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) In Port (4b) Plugin Flags (8b) Rsv Out TCAM Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

52 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Opcode (4b) Data (12b) Stats Index (16b) Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

53 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Assoc. Data ZBT-SRAM SRAM 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN 64KW SRAM 32KW Each SRAM Ring NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 Buffer Handle(24b) Reserved (8b) Scratch Ring xScale SRAM NN NN Ring Stats (1 ME) QM Copy Plugins SRAM FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale

54 ONL NP Router xScale Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs)
Flags: PassThrough/Classify (1b): Reserved (7b) Flags: Source (3b): Rx/XScale/Plugin PassThrough/Classify (1b): Reserved (4b) Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) In Port (4b) Plugin Flags (8b) Rsv Out Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) In Port (4b) Plugin Flags (8b) Rsv Out xScale 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) 64KW Flags: PassThrough/Classify (1b): Reserved (7b) Buf Handle(32b) InPort (4b) Reserved (12b) Eth. Frame Len (16b) Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) In Port (4b) Plugin Flags (8b) Rsv Out plugins What is the priority for servicing input rings In Port: Used as part of lookup key In Plugin: Used as part of lookup key Out Port: Used to tell QM, HF and Tx physical interface pkt is destined for SRAM Ring Scratch Ring NN Ring

55 ONL NP Router TCAM xScale Lookup Parse Copy QM SRAM Plugins
Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) In Port (4b) Plugin Flags (8b) Rsv Out Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) In Port (4b) Plugin Flags (8b) Rsv Out TCAM Assoc. Data ZBT-SRAM xScale Lookup Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) Rsv (8b) (4b) Out Port Parse Copy QM SRAM Plugins Frame Length (16b) Buffer Handle(24b) Stats Index (16b) QID(16b) In Plugin (4b) Rsv (8b) Reserved Port

56 ONL NP Router TCAM xScale Parse Lookup Copy QM SRAM Plugins
Assoc. Data ZBT-SRAM Input Data Buffer Handle In Plugin In Port Out Port Flags Source (3b): Rx/XScale/Plugin PassThrough/Classify (1b): Reserved (4b) QID Frame Length Stats Index Exception Bits (16b) TTL Expired IP Options present No Route Auxiliary Result Valid (1b) CopyVector (10b) NH IP/MAC (48b) QID (16b) LD (1b): Send to XScale Drop (1b): Drop pkt NH IP Valid (1b) NH MAC Valid (1b) IP_MCast Valid (1b) Sampling bits (2b) xScale Parse Lookup Copy QM SRAM Plugins Key (136b) Port/Plugin (4b) 0-4: Port 5-9: Plugin 15: XScale DAddr (32b) SAddr (32b) Protocol (8b) DPort (16b) Sport (16b) TCP Flags (12b) Exception Bits (16b) Control Flags PassThrough/Reclassify Primary Result Valid (1b) CopyVector (10b) NH IP/MAC (48b) QID (16b) LD (1b): Send to XScale Drop (1b): Drop pkt Valid Bits (3b) NH IP Valid (1b) NH MAC Valid (1b) IP_MCast Valid (1b)

57 Lookup Results Results of a lookup could be: 1 PF/RL Result:
IP Unicast: 1 packet sent to a Port Plugin Unicast: 1 packet sent to a Plugin Unicast with Plugin Copies: 0 or 1 packet sent to a port 1-5 copies sent to plugin(s) IP Multicast: 0-10 copies sent 1 to each of 5 ports and one to each of 5 plugins 1 Aux Filter Result: 0 or 1 copy sent to a Port 1-5 copies sent to plugins

58 PLC Main() { If (PassThrough) { Copy() } Else { Parse() if (!Drop) {
Lookup()

59 PLC Lookup() { write KEY to TCAM
use timestamp delay to wait appropriate time while !DoneBit // DONE Bit BUG Fix requires reading just first word read 1 word from Results Mailbox check DoneBit done read words 2 and 3 from Results Mailbox If (PrimaryFilter and RouteLookup results HIT) { compare priorities PrimaryResult.Valid  TRUE store higher priority result as Primary Result (read result from SRAM Bank0) } else if (PrimaryFilter results HIT) { PrimaryResults.*  PrimaryFilter.* (read result from SRAM Bank0) else if (RouterLookup results HIT) { PrimaryResults.*  RouteLookup.* (read result from SRAM Bank0) if (AuxiliaryFilter result HIT) { store result as Auxiliary Result (read result from SRAM Bank0) mark Auxiliary Result VALID

60 PLC Copy() { currentRefCnt  Read(Buffer Descriptor Ref Cnt)
copyCount  0 outputData.bufferHandle  inputData.bufferHandle outputData.QID  inputData.QID outputData.frameLength  inputData.frameLength outputData.statsIndex  inputData.statsIndex if (PassThrough) { // It came from either XScale or Plugin, process inputData copyCount  1 if (inputData.outPort == XScale) { // Do we need to include any additional flags when sending to XScale? outputData.outPort  inputData.outPort outputData.Flags  inputData.Flags outputData.inPort  inputData.inPort outputData.Plugin  inputData.Plugin // Packets to XScale do not (we think) need addition Header buf desc. sendToXScale() } if (inputData.outPort == {Port}) { // Pass Through pkt should already have MAC DAddr in buffer desc. // Pass Through pkt should not need any additional Header buf desc. sendToQM() if (inputData.outPort == {Plugin}) { // Packets to Plugins do not need addition Header buf desc. sendToPlugin(Plugin#) return

61 PLC else { // Process Lookup Results
// PrimaryResult is either Primary Filter or Route Lookup, depending on Priority if (PrimaryResult.Valid == TRUE) { if (PrimaryResult.IP_MCastValid == TRUE) { IP_MCast_Daddr = read DRAM MacDAddr = calculateMCast(IP_MCast_Daddr) } else { // Unicast if (countPorts(PrimaryResult.copyVector) > 1) { ILLEGAL if (PrimaryResult.NH_Mac_Valid == TRUE) { MacDAddr = PrimaryResult.NH_Address copyCount = copyCount + countOnes(PrimaryResult.copyVector); if (AuxiliaryResult.Valid == TRUE) { if (countPorts(AuxiliaryResult.copyVector) > 1) { copyCount = copyCount + countOnes(AuxialiaryResult.copyVector); update reference counter in pkt buffer descriptor for each copy{ if ((copy is going to QM) and ((copyCount + currentRefCnt) > 1)) { Add header SRAM buffer descriptor and header DRAM buffer sendCopy(header Buffer Descriptor) else { sendCopy(Pkt Buffer Descriptor)

62 ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
Ring Scratch Ring TCAM Assoc. Data ZBT-SRAM SRAM NN NN Ring 32KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN Mostly Unchanged 64KW SRAM 32KW Each New NN NN NN NN Plugin1 Plugin2 Plugin3 Plugin4 Plugin5 xScale SRAM Needs A Lot Of Mod. Needs Some Mod. Stats (1 ME) Tx, QM Parse Plugin XScale QM Copy Plugins FreeList Mgr (1 ME) SRAM

63 JST: Objectives for ONL Router
Reproduce approximately same functionality as current hardware router routes, filters (including sampling filters), stats, plugins Extensions multicast, explicit-congestion marking Use each NPU as separate 5 port router each responsible for half the external ports xScale on each NPU implements CP functions access to control variables, memory-resident statistics updating of routes, filters interaction with plugins through shared memory simple message buffer interface for request/response

64 JST: Unicast, ARP and Multicast
Each port has Ethernet header with fixed source MAC address – several cases for destination MAC address Case 1 – unicast packet with destination on attached subnet requires ARP to map dAdr to MAC address ARP cache holds mappings – issue ARP request on cache miss Case 2 – other unicast packets lookup must provide next-hop IP address then use ARP to obtain MAC address, as in case 1 Case 3 – Multicast packet lookup specifies copy-vector and QiD destination MAC address formed from IP multicast address Could avoid ARP in some cases e.g. point-to-point link but little advantage, since ARP mechanism required anyway Do we learn MAC Addresses from received pkts?

65 JST: Proposed Approach
Lookup does separate route lookup and filter lookup at most one match for route, up to two for filter (primary, aux) combine route lookup with ARP cache lookup xScale adds routes for multi-access subnets, based on ARP Route lookup for unicast, stored keys are (rcv port)+(dAdr prefix) lookup key is (rcv port)+(dAdr) result includes Port/Plugin, QiD, next-hop IP or MAC address, valid next-hop bit for multicast, stored keys are (rcv port)+(dAdr)+(sAdr prefix) lookup key is (rcv port)+(dAdr)+(sAdr) result includes 10 bit copy vector, QiD Filter lookup stored key is IP 5-tuple + TCP flags – arbitrary bit masks allowed lookup key is IP 5-tuple + flags if applicable result includes Port/Plugin or copy vector, QiD, next-hop IP or MAC address, valid next-hop bit, primary-aux bit, priority Destination MAC address passed through QM via being written in the buffer descriptor? Do we have 48 bits to spare? Yes, we actually have 14 free bytes. Enough for a full (non-vlan) ethernet header.

66 JST: Lookup Processing
On receiving unicast packet, do route & filter lookups if MAC address returned by route (or higher priority primary filter) is valid, queue the packet and continue else, pass packet to xScale, marking it as no-MAC leave it to xScale to generate ARP request, handle reply, insert route and re-inject packet into data path On receiving multicast packet, do route & filter lookups take higher priority result from route lookup or primary filter format MAC multicast address copy to queues specified by copy vector if matching auxiliary filter, filter supplies MAC address

67 Extra Slides

68 ONL NP Router TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy
(3 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (2 ME)

69 ONL NP Router TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy
Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Port Buf Handle(32b) Port (8b) Reserved Eth. Frame Len (16b) TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (2 ME) Buf Handle(24b) Frm Offset (16b) Frm Length(16b) Port (8) Buffer Handle(24b) Rsv (3b) Port (4b) V 1 Buffer Handle(24b) Rsv (3b) Port (4b) V 1

70 ONL NP Router Parse Lookup Do IP Router checks Extract lookup key
Buf Handle(24b) Frm Offset (16b) Frm Length(16b) Port (8) Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Port TCAM Copy Port: Identifies Source MAC Addr Write it to buffer descriptor or let HF determine it via port? Unicast: Valid MAC: Write MAC Addr to Buffer descriptor and queue pkt No Valid MAC: Prepare pkt to be sent to XScale for ARP processing Multicast: Calculate Ethernet multicast Dst MAC Addr Fct(IP Multicast Dst Addr) Write Dst MAC Addr to buf desc. Same for all copies! For each bit set in copy bit vector: Queue a packet to port represented by bit in bit vector. Reference Count in buffer desc. Parse, Lookup, PHF&Copy (3 MEs) Parse Do IP Router checks Extract lookup key Lookup Perform lookups – potentially three lookups: Route Lookup Primary Filter lookup Auxiliary Filter lookup

71 Notes Need a reference count for multicast. (in buffer descriptor)
How to handle freeing buffer for multicast packet? Drops can take place in the following blocks: Parse QM Plugin Tx Mux  Parse Reclassify bit For traffic that does not get reclassified after coming from a Plugin or the XScale we need all the data that the QM will need: QID Stats Index Output Port If a packet matches an Aux filter AND it needs ARP processing, the ARP processing takes precedence and we do not process the Aux filter result. Does anything other than ARP related traffic go to the XScale? IP exceptions like expired TTL? Can users direct traffic for delivery to the XScale and add processing there? Probably not if we are viewing the XScale as being like our CPs in the NSP implementation.

72 Notes Combining Parse/Lookup/Copy
Dispatch loop Build settings TCAM mailboxes (there are 128 contexts) So with 24 threads we can have up to 5 TCAM contexts per thread. Rewrite Lookup in C Input and Output on Scratch rings Configurable priorities on Mux inputs Xscale, Plugins, Rx Should we allow plugins to write directly to QM input scratch ring for packets that do not need reclassification? If we allow this is there any reason for a plugin to send a packet back through Parse/Lookup/Copy if it wants it to NOT be reclassified? We can give Plugins the capability to use NN rings between themselves to chain plugins.

73 ONL NP Router xScale xScale add configurable per port delay (up to 150 ms total delay) add large SRAM ring TCAM Assoc. Data ZBT-SRAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (4 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (1 ME) Stats (1 ME) large SRAM ring Each output has common set of QiDs Multicast copies use same QiD for all outputs QiD ignored for plugin copies Plugin Plugin Plugin Plugin Plugin xScale SRAM large SRAM ring Plugin write access to QM Scratch Ring

74 ONL NP Router Each output has common set of QiDs
xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (4 MEs) Queue Manager (1 ME) HdrFmt (1 ME) Tx (1 ME) Plugin2 Plugin3 Plugin4 Plugin5 Each output has common set of QiDs Multicast copies use same QiD for all outputs QiD ignored for plugin copies Stats (1 ME) NN NN NN NN Plugin1 xScale SRAM

75 Lookup Results Results of a lookup could be:
1 PF/RL Result: IP Unicast: 1 packet sent to a Port Plugin Unicast: 1 packet sent to a Plugin Unicast with Plugin Copies: 0 or 1 packet sent to a port 1-5 copies sent to plugin(s) IP Multicast: 0-10 copies sent 1 to each of 5 ports and one to each of 5 plugins 1 Aux Filter Result: 0 or 1 copy sent to a Port 1-5 copies sent to plugins Valid Combinations of the Above: (A1 or A3) and (B1 or B3) Potentially two different unicast MAC DAddresses needed (A1 or A3) and B2 A1 and (B1 or B3) A2 and B2 A4 and B4 Potentially 1 unicast MAC DAddr and 1 multicast MAC DAddr needed

76 PLC Input Data Control Flags Key (136b) Primary Result
Buffer Handle In Plugin In Port Out Port Flags Source (3b): Rx/XScale/Plugin PassThrough/Classify (1b): Reserved (4b) QID Frame Length Stats Index Control Flags PassThrough/Reclassify Key (136b) Port/Plugin (4b) 0-4: Port 5-9: Plugin 15: XScale DAddr (32b) SAddr (32b) Protocol (8b) DPort (16b) Sport (16b) TCP Flags (12b) Exception Bits (16b) TTL Expired IP Options present No Route Primary Result Valid (1b) CopyVector (10b) NH IP/MAC (48b) QID (16b) LD (1b): Send to XScale Drop (1b): Drop pkt Valid Bits (3b) NH IP Valid (1b) NH MAC Valid (1b) IP_MCast Valid (1b) Auxiliary Result Sampling bits (2b) Output Data Buffer Handle Plugin (To XScale only) In Port (To XScale only) Out Port (To XScale or QM only) Flags (To XScale only) QID Frame Length Stats Index


Download ppt "An NP-Based Router for the Open Network Lab"

Similar presentations


Ads by Google