David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block.

Slides:



Advertisements
Similar presentations
8086 Ahad.
Advertisements

Supercharging PlanetLab A High Performance,Multi-Alpplication,Overlay Network Platform Reviewed by YoungSoo Lee CSL.
Computer Organization and Architecture
Senior Project with the SPP Michael Williamson. Communicating with a Slice Slice-RMP library using a Unix Domain Socket ◦ RPC-Like ◦ Slice application.
Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,
Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.
John DeHart ONL NP Router Block Design Review: Lookup (Part of the PLC Block)
Jon Turner, John DeHart, Fred Kuhns Computer Science & Engineering Washington University Wide Area OpenFlow Demonstration.
Michael Wilson Block Design Review: ONL Header Format.
1 - Charlie Wiseman - 05/11/07 Design Review: XScale Charlie Wiseman ONL NP Router.
Michael Wilson Block Design Review: Line Card Key Extract (Ingress and Egress)
Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar.
Microprocessor Microprocessor (cont..) It is a 16 bit μp has a 20 bit address bus can access upto 220 memory locations ( 1 MB). It can support.
David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Freelist Manager.
John DeHart Block Design Review: Lookup for IPv4 MR, LC Ingress and LC Egress.
Brandon Heller Block Design Review: Substrate Decap and IPv4 Parse.
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
1 - Charlie Wiseman, Shakir James - 05/11/07 Design Review: Plugin Framework Charlie Wiseman and Shakir James ONL.
John DeHart An NP-Based Router for the Open Network Lab Memory Map.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
David M. Zar Block Design Review: PlanetLab Line Card Header Format.
1 - John DeHart, Jing Lu - 3/8/2016 SRAM ONL NP Router Rx (2 ME) HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) TCAM SRAM Mux (1 ME) Tx (1 ME) QM (1 ME) xScale.
Mart Haitjema Block Design Review: ONL NP Router Multiplexer (MUX)
John DeHart Netgames Plugin Issues. 2 - JDD - 6/13/2016 SRAM ONL NP Router Rx (2 ME) HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) TCAM SRAM Mux (1 ME) Tx.
Supercharged PlanetLab Platform, Control Overview
Flow Stats Module James Moscola September 12, 2007.
ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
SPP Version 1 Router Plans and Design
Design of the Control Unit for Single-Cycle Instruction Execution
An NP-Based Router for the Open Network Lab Design
An NP-Based Router for the Open Network Lab
An NP-Based Ethernet Switch for the Open Network Lab Design
ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,
STRUCTURE OF A ROUTER We represent a router as a black box that accepts incoming packets from one of the input ports (interfaces), uses a routing table.
Design of a Diversified Router: Common Router Framework
Design of a Diversified Router: Project Management
ONL NP Router Plugins Shakir James, Charlie Wiseman, Ken Wong, John DeHart {scj1, cgw1, kenw,
STRUCTURE OF A ROUTER We represent a router as a black box that accepts incoming packets from one of the input ports (interfaces), uses a routing table.
An NP-Based Router for the Open Network Lab
Design of a Diversified Router: IPv4 MR (Dedicated NP)
Flow Stats Module James Moscola September 6, 2007.
Design of the Control Unit for One-cycle Instruction Execution
An NP-Based Router for the Open Network Lab Overview by JST
ONL Stats Engine David M. Zar Applied Research Laboratory Computer Science and Engineering Department.
8086 Ahad.
Supercharged PlanetLab Platform, Control Overview
Next steps for SPP & ONL 2/6/2007
IXP Based Router for ONL: Architecture
An NP-Based Router for the Open Network Lab
QM Performance Analysis
SPP V1 Memory Map John DeHart Applied Research Laboratory Computer Science and Engineering Department.
Planet Lab Memory Map David M. Zar Applied Research Laboratory Computer Science and Engineering Department.
Code Review for IPv4 Metarouter Header Format
Code Review for IPv4 Metarouter Header Format
SPP Version 1 Router Plans and Design
An NP-Based Router for the Open Network Lab Meeting Notes
An NP-Based Router for the Open Network Lab Project Information
An NP-Based Router for the Open Network Lab Design
Implementing an OpenFlow Switch on the NetFPGA platform
SPP Router Plans and Design
IXP Based Router for ONL: Architecture
Branch instructions We’ll implement branch instructions for the eight different conditions shown here. Bits 11-9 of the opcode field will indicate the.
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
Design of a Diversified Router: Project Management
Chapter 11 Processor Structure and function
STRUCTURE OF A ROUTER We represent a router as a black box that accepts incoming packets from one of the input ports (interfaces), uses a routing table.
STRUCTURE OF A ROUTER We represent a router as a black box that accepts incoming packets from one of the input ports (interfaces), uses a routing table.
Presentation transcript:

David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block

2 - David M. Zar - 9/11/2015 Stats Engine The Stats Engine is a single ME devoted to accepting messages in a scratch ring and performing increment and add operations to counters. »All MEs that need to update counters will use the Stats Engine »Operations supported will be Atomic increment (+1) Atomic add (+data) »Format of the commands will be Opcode(4b) Data (12b) Index (16b)

3 - David M. Zar - 9/11/2015 SRAM ONL NP Router Rx (2 ME) HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) TCAM SRAM Mux (1 ME) Tx (1 ME) QM (1 ME) xScale Assoc. Data ZBT-SRAM Plugin0Plugin1 Plugin2 Plugin3Plugin4 NN FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale Stats (1 ME) QM Copy Plugins SRAM NN SRAM Ring Scratch Ring NN Ring NN SRAM 64KW Each

4 - David M. Zar - 9/11/2015 MEs -> Stats Block Stats Opcode (4b) Index (16b)Data (12b)

5 - David M. Zar - 9/11/2015 Opcodes Opcode – »0011+1, +data pre-q counter specified in Index »0111+1, +data post-q counter specified in Index » pre-q counter specified in Index » post-q counter specified in Index »0001+data pre-q counter specified in Index »0101+data post-q counter specified in Index »1011+1, +data global register specified in Index » global register specified in Index »1001+data global register specified in Index (not implemented – 4/23/07) Opcode(4b) Data (12b) Index (16b)

6 - David M. Zar - 9/11/2015 Stats Counters Each Index specifies a group of four counters »Pre-Q packet count »Pre-Q byte count »Post-Q packet count »Post-Q byte count The packet counters get updated when the +1 instructions are specified (opcodes 0-1-) The byte counter get updated when the +data instructions are specified (opcodes 0--1) For plug-ins, the use for each counter can be redefined but the opcodes do not change (i.e. each stats index corresponds to two incrementers and two adders).

7 - David M. Zar - 9/11/2015 Global Registers For system-wide counters, we define a separate set of global registers to handle them. »RX (packet and byte, 5 ports  10 words) »TX (packet and byte, 5 ports  10 words) »Drop counts (10 words) »Plug-in use (four per plug-in  20 words) »Per ME error counters (8 words) » = 58 so reserve 64 words for these The register gets incremented when the +1 instructions are specified (opcodes 101-) The register gets added to updated when the +data instructions are specified (opcodes 10-1) The RX and TX counters will be assigned on even-word boundaries (lsb = 0) so we associate the packet and byte counters, together, and can do the +1, +data instruction on them in one command (1011 opcode) For plug-ins, the use of each register is under the control of the plug-in »Four independent counters »Two sets of two counters »One set of two and two independent

8 - David M. Zar - 9/11/2015 ONL Router Counter Registers (in dl_system.h) // RX Per Port registers: (Updated by MUX) ONL_ROUTER_RX_PORT0_PKT_CNTR ONL_ROUTER_RX_PORT0_BYTE_CNTR ONL_ROUTER_RX_PORT1_PKT_CNTR ONL_ROUTER_RX_PORT1_BYTE_CNTR ONL_ROUTER_RX_PORT2_PKT_CNTR ONL_ROUTER_RX_PORT2_BYTE_CNTR ONL_ROUTER_RX_PORT3_PKT_CNTR ONL_ROUTER_RX_PORT3_BYTE_CNTR ONL_ROUTER_RX_PORT4_PKT_CNTR ONL_ROUTER_RX_PORT4_BYTE_CNTR // TX Per Port registers: (Updated by HF) ONL_ROUTER_TX_PORT0_PKT_CNTR ONL_ROUTER_TX_PORT0_BYTE_CNTR ONL_ROUTER_TX_PORT1_PKT_CNTR ONL_ROUTER_TX_PORT1_BYTE_CNTR ONL_ROUTER_TX_PORT2_PKT_CNTR ONL_ROUTER_TX_PORT2_BYTE_CNTR ONL_ROUTER_TX_PORT3_PKT_CNTR ONL_ROUTER_TX_PORT3_BYTE_CNTR ONL_ROUTER_TX_PORT4_PKT_CNTR ONL_ROUTER_TX_PORT4_BYTE_CNTR // IP Drop registers (Updated by PLC) ONL_ROUTER_IP_HEC_DROP_CNTR ONL_ROUTER_IP_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_HDR_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_VERSION_ERR_DROP_CNTR

9 - David M. Zar - 9/11/2015 ONL Router Counter Registers (cont.) // PLC Drop registers (Updated by Parse, Lookup or Copy) ONL_ROUTER_PLC_TO_PLUGIN_DROP_CNTR ONL_ROUTER_PLC_TO_XSCALE_DROP_CNTR // QM Drop registers (Updated by QM) ONL_ROUTER_QUEUE_OVERFLOW_DROP_CNTR // XScale Drop registers (Updated by XScale) ONL_ROUTER_XSCALE_DROP_CNTR // Rx Drop registers (Updated by Rx) ONL_ROUTER_RX__DROP_CNTR // Tx Drop registers (Updated by Tx) ONL_ROUTER_TX_DROP_CNTR // Per Block Generic Error Counters ONL_ROUTER_RX_GENERIC_ERROR_CNTR ONL_ROUTER_MUX_GENERIC_ERROR_CNTR ONL_ROUTER_PLC_GENERIC_ERROR_CNTR ONL_ROUTER_QM_GENERIC_ERROR_CNTR ONL_ROUTER_HF_GENERIC_ERROR_CNTR ONL_ROUTER_TX_GENERIC_ERROR_CNTR ONL_ROUTER_STATS_GENERIC_ERROR_CNTR ONL_ROUTER_FREELISTMGR_GENERIC_ERROR_CNTR

10 - David M. Zar - 9/11/2015 ONL Router Counter Registers (cont.) // Plugin 0 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_0_CNTR_0 ONL_ROUTER_PLUGIN_0_CNTR_1 ONL_ROUTER_PLUGIN_0_CNTR_2 ONL_ROUTER_PLUGIN_0_CNTR_3 // Plugin 2 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_1_CNTR_0 ONL_ROUTER_PLUGIN_1_CNTR_1 ONL_ROUTER_PLUGIN_1_CNTR_2 ONL_ROUTER_PLUGIN_1_CNTR_3 // Plugin 2 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_2_CNTR_0 ONL_ROUTER_PLUGIN_2_CNTR_1 ONL_ROUTER_PLUGIN_2_CNTR_2 ONL_ROUTER_PLUGIN_2_CNTR_3 // Plugin 3 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_3_CNTR_0 ONL_ROUTER_PLUGIN_3_CNTR_1 ONL_ROUTER_PLUGIN_3_CNTR_2 ONL_ROUTER_PLUGIN_3_CNTR_3 // Plugin 4 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_4_CNTR_0 ONL_ROUTER_PLUGIN_4_CNTR_1 ONL_ROUTER_PLUGIN_4_CNTR_2 ONL_ROUTER_PLUGIN_4_CNTR_3

11 - David M. Zar - 9/11/2015 Stats Counter Priority There are two levels of priority for Stats Counters »High-priority (high-speed) are kept in local memory. There are 64 sets of counters for the router and 64 for the plug-ins »Low-priority (low-speed) are in SRAM. There are = of these. Stats Counters point to the high-priority counters while are low-priority counters. Using low-priority Stats Counters to count events that happen at high speed may degrade system performance (being a pre-Q counter on a high-priority queue, for example) Plug-ins need to be aware of the segmentation of priority so they can use the proper priority counters based on needs Global Registers are always high-priority Eight threads used »Seven threads process messages from the input scratch ring »One thread writes 8W chunks of the local memory counters/registers to SRAM so that each counter/register is updated in SRAM several times a second.

12 - David M. Zar - 9/11/2015 Stats ME Local Memory Map Global Registers0 63 Reserved Stats Counters (router) 64*4W = 256W Stats Counters (plug-ins) 64*4W = 256W

13 - David M. Zar - 9/11/2015 Stats Pseudocode While (true and ctx={0:6}) { dl_source_scr_1word() decode_opcode() case (opcode) { Global Register: lm_addr = index 127) { do slow_opcode; } else { lm_addr = (128*4) + (index << 4); do fast_opcode;} } While (true and ctx=7) { offset = 0; for (l_mem=0; l_mem<(64*4); l_mem=l_mem+8) { sram_write(GLOBAL_REGS_BASE, offset, l_mem, 8); offset = offset + 32; } offset = 0; for (l_mem=(128*4); l_mem<(128*16); l_mem=l_mem+8) { sram_write(ONL_STATS_BASE, offset, l_mem, 8); offset = offset + 32; } }

14 - David M. Zar - 9/11/2015 Stats Function Calls Defined in counter_util.uc: »_WU_preq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data »_WU_preq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1 »_WU_preq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data »_WU_postq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data »_WU_postq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1 »_WU_postq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data »_WU_global_register_add(reg_num, tx_reg, update_sig, error_addr)// +1 »_WU_global_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data »_WU_global_register_update(reg_num, tx_reg, data, update_sig, error_addr)// +1 & +data

15 - David M. Zar - 9/11/2015 Performance Targets How many packets processed per second? »To hit 5 Gb rate: 76B per min IPv4 packet (64 min Enet Frame + 12B IFS) 1.4Ghz clock rate 5 Gb/sec * 1B/8b * packet/76B = 8.23 Mp/sec 1.4Gcycle/sec * 1 sec/ 8.23 Mp = 170 cycles per packet compute budget: 170 cycles latency budget: (threads*170) Ø 7 threads: 1190 cycles How many count requests per packet (typical packet)? »RX per-port count »TX per-port count »Preq-Q stats index »Post-Q stats index Total counts = 8.23 Mp/sec * 4 counts/sec = Mcounts/sec

16 - David M. Zar - 9/11/2015 Stats Block Diagram Read Scratch Ring LM_ADDR = (index << 4) Global Register? (4 CLK) Index > 127? (3 CLK) Slow Counter LM_ADDR = (index << 2) Y N N Y Decode Opcode (3C) +data? (3C) +1? (3C) LM_ADDR++ = *LM_ADDR + data LM_ADDR = *LM_ADDR + 1 N N Y Y SCR READ: 60L + 2C Worst case (fast) is for Stats Counters: 20 Clocks + 60 Cycles Latency

17 - David M. Zar - 9/11/2015 Performance Results Total fast counts: »Count time is, effectively, 20 cycles (all 60 cycles of latency are hidden) »1400 Mcycles/sec  20 cycles/count = 70 Mcounts/sec* »Target is Mcounts/sec. Slow counts: »Count time is about 150 – 60 = 90 cycles (the SRAM latency is not completely hidden) »1400/150 = 15.6 Mcounts.sec SRAM Write-back »After each count thread has had the chance to run, the write-back thread writes one 8-word block of local memory to SRAM. »Measured performance is 20 ms for a full write-back (50 updates per second) »This will slow down the counting, but only by 19 cycles every 7 th count (when the counter is fully-loaded) or less than 3 instructions per count thread. *In simulation, only 17 cycles were measured for >82 Mcounts/sec

18 - David M. Zar - 9/11/2015 Lookup File locations Code »src/applications/ONL_Router/src/freelistMgr/freelistMgr.uc »Src/library/dataplane/counter_util.uc Include Paths »src/applications/ONL_Router/src/dispatch_loop/ONL/ dl_source.h and dl_source.uc Ø dl_source() and dl_sink() functions »Other, standard, include paths (Intel SDK provided)