David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block
2 - David M. Zar - 9/11/2015 Stats Engine The Stats Engine is a single ME devoted to accepting messages in a scratch ring and performing increment and add operations to counters. »All MEs that need to update counters will use the Stats Engine »Operations supported will be Atomic increment (+1) Atomic add (+data) »Format of the commands will be Opcode(4b) Data (12b) Index (16b)
3 - David M. Zar - 9/11/2015 SRAM ONL NP Router Rx (2 ME) HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) TCAM SRAM Mux (1 ME) Tx (1 ME) QM (1 ME) xScale Assoc. Data ZBT-SRAM Plugin0Plugin1 Plugin2 Plugin3Plugin4 NN FreeList Mgr (1 ME) Tx, QM Parse Plugin XScale Stats (1 ME) QM Copy Plugins SRAM NN SRAM Ring Scratch Ring NN Ring NN SRAM 64KW Each
4 - David M. Zar - 9/11/2015 MEs -> Stats Block Stats Opcode (4b) Index (16b)Data (12b)
5 - David M. Zar - 9/11/2015 Opcodes Opcode – »0011+1, +data pre-q counter specified in Index »0111+1, +data post-q counter specified in Index » pre-q counter specified in Index » post-q counter specified in Index »0001+data pre-q counter specified in Index »0101+data post-q counter specified in Index »1011+1, +data global register specified in Index » global register specified in Index »1001+data global register specified in Index (not implemented – 4/23/07) Opcode(4b) Data (12b) Index (16b)
6 - David M. Zar - 9/11/2015 Stats Counters Each Index specifies a group of four counters »Pre-Q packet count »Pre-Q byte count »Post-Q packet count »Post-Q byte count The packet counters get updated when the +1 instructions are specified (opcodes 0-1-) The byte counter get updated when the +data instructions are specified (opcodes 0--1) For plug-ins, the use for each counter can be redefined but the opcodes do not change (i.e. each stats index corresponds to two incrementers and two adders).
7 - David M. Zar - 9/11/2015 Global Registers For system-wide counters, we define a separate set of global registers to handle them. »RX (packet and byte, 5 ports 10 words) »TX (packet and byte, 5 ports 10 words) »Drop counts (10 words) »Plug-in use (four per plug-in 20 words) »Per ME error counters (8 words) » = 58 so reserve 64 words for these The register gets incremented when the +1 instructions are specified (opcodes 101-) The register gets added to updated when the +data instructions are specified (opcodes 10-1) The RX and TX counters will be assigned on even-word boundaries (lsb = 0) so we associate the packet and byte counters, together, and can do the +1, +data instruction on them in one command (1011 opcode) For plug-ins, the use of each register is under the control of the plug-in »Four independent counters »Two sets of two counters »One set of two and two independent
8 - David M. Zar - 9/11/2015 ONL Router Counter Registers (in dl_system.h) // RX Per Port registers: (Updated by MUX) ONL_ROUTER_RX_PORT0_PKT_CNTR ONL_ROUTER_RX_PORT0_BYTE_CNTR ONL_ROUTER_RX_PORT1_PKT_CNTR ONL_ROUTER_RX_PORT1_BYTE_CNTR ONL_ROUTER_RX_PORT2_PKT_CNTR ONL_ROUTER_RX_PORT2_BYTE_CNTR ONL_ROUTER_RX_PORT3_PKT_CNTR ONL_ROUTER_RX_PORT3_BYTE_CNTR ONL_ROUTER_RX_PORT4_PKT_CNTR ONL_ROUTER_RX_PORT4_BYTE_CNTR // TX Per Port registers: (Updated by HF) ONL_ROUTER_TX_PORT0_PKT_CNTR ONL_ROUTER_TX_PORT0_BYTE_CNTR ONL_ROUTER_TX_PORT1_PKT_CNTR ONL_ROUTER_TX_PORT1_BYTE_CNTR ONL_ROUTER_TX_PORT2_PKT_CNTR ONL_ROUTER_TX_PORT2_BYTE_CNTR ONL_ROUTER_TX_PORT3_PKT_CNTR ONL_ROUTER_TX_PORT3_BYTE_CNTR ONL_ROUTER_TX_PORT4_PKT_CNTR ONL_ROUTER_TX_PORT4_BYTE_CNTR // IP Drop registers (Updated by PLC) ONL_ROUTER_IP_HEC_DROP_CNTR ONL_ROUTER_IP_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_HDR_LENGTH_ERR_DROP_CNTR ONL_ROUTER_IP_VERSION_ERR_DROP_CNTR
9 - David M. Zar - 9/11/2015 ONL Router Counter Registers (cont.) // PLC Drop registers (Updated by Parse, Lookup or Copy) ONL_ROUTER_PLC_TO_PLUGIN_DROP_CNTR ONL_ROUTER_PLC_TO_XSCALE_DROP_CNTR // QM Drop registers (Updated by QM) ONL_ROUTER_QUEUE_OVERFLOW_DROP_CNTR // XScale Drop registers (Updated by XScale) ONL_ROUTER_XSCALE_DROP_CNTR // Rx Drop registers (Updated by Rx) ONL_ROUTER_RX__DROP_CNTR // Tx Drop registers (Updated by Tx) ONL_ROUTER_TX_DROP_CNTR // Per Block Generic Error Counters ONL_ROUTER_RX_GENERIC_ERROR_CNTR ONL_ROUTER_MUX_GENERIC_ERROR_CNTR ONL_ROUTER_PLC_GENERIC_ERROR_CNTR ONL_ROUTER_QM_GENERIC_ERROR_CNTR ONL_ROUTER_HF_GENERIC_ERROR_CNTR ONL_ROUTER_TX_GENERIC_ERROR_CNTR ONL_ROUTER_STATS_GENERIC_ERROR_CNTR ONL_ROUTER_FREELISTMGR_GENERIC_ERROR_CNTR
10 - David M. Zar - 9/11/2015 ONL Router Counter Registers (cont.) // Plugin 0 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_0_CNTR_0 ONL_ROUTER_PLUGIN_0_CNTR_1 ONL_ROUTER_PLUGIN_0_CNTR_2 ONL_ROUTER_PLUGIN_0_CNTR_3 // Plugin 2 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_1_CNTR_0 ONL_ROUTER_PLUGIN_1_CNTR_1 ONL_ROUTER_PLUGIN_1_CNTR_2 ONL_ROUTER_PLUGIN_1_CNTR_3 // Plugin 2 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_2_CNTR_0 ONL_ROUTER_PLUGIN_2_CNTR_1 ONL_ROUTER_PLUGIN_2_CNTR_2 ONL_ROUTER_PLUGIN_2_CNTR_3 // Plugin 3 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_3_CNTR_0 ONL_ROUTER_PLUGIN_3_CNTR_1 ONL_ROUTER_PLUGIN_3_CNTR_2 ONL_ROUTER_PLUGIN_3_CNTR_3 // Plugin 4 Counters (for use however Plugin writer wants to use them) ONL_ROUTER_PLUGIN_4_CNTR_0 ONL_ROUTER_PLUGIN_4_CNTR_1 ONL_ROUTER_PLUGIN_4_CNTR_2 ONL_ROUTER_PLUGIN_4_CNTR_3
11 - David M. Zar - 9/11/2015 Stats Counter Priority There are two levels of priority for Stats Counters »High-priority (high-speed) are kept in local memory. There are 64 sets of counters for the router and 64 for the plug-ins »Low-priority (low-speed) are in SRAM. There are = of these. Stats Counters point to the high-priority counters while are low-priority counters. Using low-priority Stats Counters to count events that happen at high speed may degrade system performance (being a pre-Q counter on a high-priority queue, for example) Plug-ins need to be aware of the segmentation of priority so they can use the proper priority counters based on needs Global Registers are always high-priority Eight threads used »Seven threads process messages from the input scratch ring »One thread writes 8W chunks of the local memory counters/registers to SRAM so that each counter/register is updated in SRAM several times a second.
12 - David M. Zar - 9/11/2015 Stats ME Local Memory Map Global Registers0 63 Reserved Stats Counters (router) 64*4W = 256W Stats Counters (plug-ins) 64*4W = 256W
13 - David M. Zar - 9/11/2015 Stats Pseudocode While (true and ctx={0:6}) { dl_source_scr_1word() decode_opcode() case (opcode) { Global Register: lm_addr = index 127) { do slow_opcode; } else { lm_addr = (128*4) + (index << 4); do fast_opcode;} } While (true and ctx=7) { offset = 0; for (l_mem=0; l_mem<(64*4); l_mem=l_mem+8) { sram_write(GLOBAL_REGS_BASE, offset, l_mem, 8); offset = offset + 32; } offset = 0; for (l_mem=(128*4); l_mem<(128*16); l_mem=l_mem+8) { sram_write(ONL_STATS_BASE, offset, l_mem, 8); offset = offset + 32; } }
14 - David M. Zar - 9/11/2015 Stats Function Calls Defined in counter_util.uc: »_WU_preq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data »_WU_preq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1 »_WU_preq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data »_WU_postq_update(reg_num, tx_reg, data, update_sig, error_addr) // +1 & +data »_WU_postq_register_add(reg_num, tx_reg, update_sig, error_addr) // +1 »_WU_postq_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data »_WU_global_register_add(reg_num, tx_reg, update_sig, error_addr)// +1 »_WU_global_register_add(reg_num, tx_reg, data, update_sig, error_addr) // +data »_WU_global_register_update(reg_num, tx_reg, data, update_sig, error_addr)// +1 & +data
15 - David M. Zar - 9/11/2015 Performance Targets How many packets processed per second? »To hit 5 Gb rate: 76B per min IPv4 packet (64 min Enet Frame + 12B IFS) 1.4Ghz clock rate 5 Gb/sec * 1B/8b * packet/76B = 8.23 Mp/sec 1.4Gcycle/sec * 1 sec/ 8.23 Mp = 170 cycles per packet compute budget: 170 cycles latency budget: (threads*170) Ø 7 threads: 1190 cycles How many count requests per packet (typical packet)? »RX per-port count »TX per-port count »Preq-Q stats index »Post-Q stats index Total counts = 8.23 Mp/sec * 4 counts/sec = Mcounts/sec
16 - David M. Zar - 9/11/2015 Stats Block Diagram Read Scratch Ring LM_ADDR = (index << 4) Global Register? (4 CLK) Index > 127? (3 CLK) Slow Counter LM_ADDR = (index << 2) Y N N Y Decode Opcode (3C) +data? (3C) +1? (3C) LM_ADDR++ = *LM_ADDR + data LM_ADDR = *LM_ADDR + 1 N N Y Y SCR READ: 60L + 2C Worst case (fast) is for Stats Counters: 20 Clocks + 60 Cycles Latency
17 - David M. Zar - 9/11/2015 Performance Results Total fast counts: »Count time is, effectively, 20 cycles (all 60 cycles of latency are hidden) »1400 Mcycles/sec 20 cycles/count = 70 Mcounts/sec* »Target is Mcounts/sec. Slow counts: »Count time is about 150 – 60 = 90 cycles (the SRAM latency is not completely hidden) »1400/150 = 15.6 Mcounts.sec SRAM Write-back »After each count thread has had the chance to run, the write-back thread writes one 8-word block of local memory to SRAM. »Measured performance is 20 ms for a full write-back (50 updates per second) »This will slow down the counting, but only by 19 cycles every 7 th count (when the counter is fully-loaded) or less than 3 instructions per count thread. *In simulation, only 17 cycles were measured for >82 Mcounts/sec
18 - David M. Zar - 9/11/2015 Lookup File locations Code »src/applications/ONL_Router/src/freelistMgr/freelistMgr.uc »Src/library/dataplane/counter_util.uc Include Paths »src/applications/ONL_Router/src/dispatch_loop/ONL/ dl_source.h and dl_source.uc Ø dl_source() and dl_sink() functions »Other, standard, include paths (Intel SDK provided)