QM Performance Analysis

QM Performance Analysis
John DeHart

ONL NP Router xScale xScale (3 Rings?) SRAM TCAM SRAM Rx (2 ME) Mux
Assoc. Data ZBT-SRAM xScale (3 Rings?) Small SRAM Ring Large SRAM Ring Scratch Ring SRAM TCAM LD Except Errors SRAM NN NN Ring 64KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN Mostly Unchanged 64KW xScale 64KW 64KW 64KW 64KW 64KW Plugin System Update Requests Plugin Ctrl Msgs 512W 512W 512W 512W 512W New NN NN NN NN 512W 512W Plugin0 Plugin1 Plugin2 Plugin3 Plugin4 SRAM Needs A Lot Of Mod. Rx Mux HF Copy Plugins Tx Needs Some Mod. Stats (1 ME) Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) SRAM

Performance What is our performance target? To hit 5 Gb rate:
Minimum Ethernet frame: 76B 64B frame + 12B InterFrame Spacing 5 Gb/sec * 1B/8b * packet/76B = 8.22 Mpkt/sec IXP ME processing: 1.4Ghz clock rate 1.4Gcycle/sec * 1 sec/ 8.22 Mp = cycles per packet compute budget: (MEs*170) 1 ME: 170 cycles 2 ME: 340 cycles 3 ME: 510 cycles 4 ME: 680 cycles latency budget: (threads*170) 1 ME: 8 threads: 1360 cycles 2 ME: 16 threads: 2720 cycles 3 ME: 24 threads: 4080 cycles 4 ME: 32 threads: 5440 cycles

QM Performance 1 ME using 7 threads:
Threads each run once per iteration Enqueue thread and Dequeue threads run in parallel Their latencies can overlap Freelist management thread runs in isolation of the other threads 1 Enqueue Thread Processes a batch of 5 packets per iteration 5 Dequeue Threads Each processes 1 packet per iteration 1 Freelist management Thread Maintains state of freelist once every iteration Each iteration can enqueue and dequeue 5 packets Total latency for an iteration: 5 * 170 cycles = 850 cycles Sum of: Latency of Freelist management thread Combined latency of Enqueue thread and Dequeue Threads Compute budget: (FL_cpu/5) + (DQ_cpu) + (ENQ_cpu/5) <= 170 cycles Current (June 2007) BEST CASE (all queues already loaded) estimates: FL_cpu: 41 cycles DQ_cpu: 216 cycles Execution cycles ranges from: 128 cycles to 202 cycles Abort cycles ranges from: 47 cycles to 70 cycles Seems to be a strange progression: DQ0 <= DQ1 <= DQ2 <= DQ3 <= DQ4 ENQ_cpu: 501 cycles Total latency per iteration: 1600 – 1800 cycles (FL_cpu/5) + (DQ_cpu) + (ENQ_cpu/5) = = 324 cycles

QM Performance Improvements
These are simple improvements that might save us 10’s of cycles each. Change the way we read the scratch ring on input to Enqueue Currently we do this (Each gets the input data for 1 pkt): .xfer_order $rdata_a .xfer_order $rdata_b .xfer_order $rdata_c .xfer_order $rdata_d .xfer_order $rdata_e scratch[get, $rdata_a[0], 0, ring, 3], sig_done[sram_sig0] scratch[get, $rdata_b[0], 0, ring, 3], sig_done[sram_sig1] scratch[get, $rdata_c[0], 0, ring, 3], sig_done[sram_sig2] scratch[get, $rdata_d[0], 0, ring, 3], sig_done[sram_sig3] scratch[get, $rdata_e[0], 0, ring, 3], sig_done[sram_sig4] The fifth scratch get always causes a stall since the cmd fifo on the ME is only 4 deep. And when it stalls it also causes an abort of it and the following 2 instructions. Total of 15 cycles consumed by abort (3 cycles) and stall (12 cycles) This seems more efficient: .xfer_order $rdata_a, $rdata_b .xfer_order $rdata_c, $rdata_d scratch[get, $rdata_a[0], 0, ring, 6], sig_done[sram_sig0] scratch[get, $rdata_c[0], 0, ring, 6], sig_done[sram_sig1] scratch[get, $rdata_e[0], 0, ring, 3], sig_done[sram_sig2] If there is just one pkt in input ring, then it shows up in $rdata_e If there are two pkts, then they show up in $rdata_a and $rdata_b If there are three pkts, then they show up in $rdata_a, $rdata_b and $rdata_e If there are four pkts, then they show up in $rdata_a, $rdata_b, $rdata_c and $rdata_d If there are five pkts, then they show up in $rdata_a, $rdata_b, $rdata_c, $rdata_d and $rdata_e Local Memory Operations Are all done as uncoordinated macro calls. Each call to the macro sets the Local Memory Address CSR This instruction seems to put a command in the ME cmd fifo which may stall if that fifo is full. Combine _qm_read_enqueue_request and do_enqueue macros When we load pkt_descriptor[n] we can immediately test it instead of testing it again later.

QM Snapshots Breakpoint set at start of maintain_fl() macro in FL management Thread All queues should be already loaded Run for one iteration ENQ processes 5 pkts Each of the 5 DQ thread processes 1 pkt Rx reports 10 packets received and Tx reports 5 packets transmitted

200Byte Eth Frames With 200 Byte Eth Frames packets, 5 ports sending at full rate: Dequeue can not keep up. After about 1030 packets we start discarding in Enqueue because the queues are full. Queue thresholds were set to 0xfff Port rates were set to 0x1000 (greater than 1Gb/s)

400Byte Eth Frames With 400 Byte Eth Frames packets, 5 ports sending at full rate: Queues build up eventually. I suspect there is an inherent problem in the way that dequeue is working that causes it to not be able to keep up. Tx is flow controlling the dequeue engines in this case. This seems to be what is causing the queues to build up.

More snapshots (June 13, 2007)

More snapshots (June 13-15, 2007)
QM Totals Enqueue “WORST CASE”:Every pkt causes queue to be evicted by Enqueue and new one loaded. “BEST CASE”: Queues are always already loaded, nothing gets evicted.

QM Performance Analysis

Similar presentations

Presentation on theme: "QM Performance Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

QM Performance Analysis

Similar presentations

Presentation on theme: "QM Performance Analysis"— Presentation transcript:

Similar presentations

About project

Feedback