QM Performance Analysis

Slides:

Advertisements

Similar presentations

Engineering Patrick Crowley, John DeHart, Mart Haitjema, Fred Kuhns, Jyoti Parwatikar, Ritun Patney, Jon Turner, Charlie Wiseman, Mike Wilson, Ken Wong,

Advertisements

Paper Review Building a Robust Software-based Router Using Network Processors.

John DeHart ONL NP Router Block Design Review: Lookup (Part of the PLC Block)

David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Jon Turner, John DeHart, Fred Kuhns Computer Science & Engineering Washington University Wide Area OpenFlow Demonstration.

Michael Wilson Block Design Review: ONL Header Format.

1 - Charlie Wiseman - 05/11/07 Design Review: XScale Charlie Wiseman ONL NP Router.

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

Michael Wilson Block Design Review: Line Card Key Extract (Ingress and Egress)

Intro  Scratchpad rings and queues.  First – In – Firs – Out (FIFO) data structure.  Rings are fixed-sized, circular FIFO.  Queues not fixed-size.

Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar.

David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Freelist Manager.

John DeHart Block Design Review: Lookup for IPv4 MR, LC Ingress and LC Egress.

Brandon Heller Block Design Review: Substrate Decap and IPv4 Parse.

Queue Manager and Scheduler on Intel IXP John DeHart Amy Freestone Fred Kuhns Sailesh Kumar.

1 - Charlie Wiseman, Shakir James - 05/11/07 Design Review: Plugin Framework Charlie Wiseman and Shakir James ONL.

John DeHart An NP-Based Router for the Open Network Lab Memory Map.

David M. Zar Block Design Review: PlanetLab Line Card Header Format.

1 - John DeHart, Jing Lu - 3/8/2016 SRAM ONL NP Router Rx (2 ME) HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) TCAM SRAM Mux (1 ME) Tx (1 ME) QM (1 ME) xScale.

Mart Haitjema Block Design Review: ONL NP Router Multiplexer (MUX)

SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.

John DeHart Netgames Plugin Issues. 2 - JDD - 6/13/2016 SRAM ONL NP Router Rx (2 ME) HdrFmt (1 ME) Parse, Lookup, Copy (3 MEs) TCAM SRAM Mux (1 ME) Tx.

Memory – Caching: Writes

Supercharged PlanetLab Platform, Control Overview

Speed up on cycle time Stalls – Optimizing compilers for pipelining

Flow Stats Module James Moscola September 12, 2007.

ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,

SLP1 design Christos Gentsos 9/4/2014.

Design of a Diversified Router: Memory Usage

SPP Version 1 Router Plans and Design

An NP-Based Router for the Open Network Lab Design

An NP-Based Router for the Open Network Lab

An NP-Based Ethernet Switch for the Open Network Lab Design

ONL NP Router xScale xScale TCAM SRAM Rx (2 ME) Mux (1 ME) Parse,

Design of a Diversified Router: Common Router Framework

Design of a Diversified Router: Project Management

ONL NP Router Plugins Shakir James, Charlie Wiseman, Ken Wong, John DeHart {scj1, cgw1, kenw,

An NP-Based Router for the Open Network Lab Hardware

An NP-Based Router for the Open Network Lab

CMPT 886: Computer Architecture Primer

Flow Stats Module James Moscola September 6, 2007.

Design of a Diversified Router: Monitoring

An NP-Based Router for the Open Network Lab Overview by JST

ONL Stats Engine David M. Zar Applied Research Laboratory Computer Science and Engineering Department.

Supercharged PlanetLab Platform, Control Overview

Next steps for SPP & ONL 2/6/2007

IXP Based Router for ONL: Architecture

An NP-Based Router for the Open Network Lab

An NP-Based Router for the Open Network Lab

Design of a Diversified Router: Project Assignments and Status Updates

SPP V1 Memory Map John DeHart Applied Research Laboratory Computer Science and Engineering Department.

SDK Demo/Tutorial John DeHart.

SPP Version 1 Router Traffic Examples

Planet Lab Memory Map David M. Zar Applied Research Laboratory Computer Science and Engineering Department.

Code Review for IPv4 Metarouter Header Format

Code Review for IPv4 Metarouter Header Format

SPP Version 1 Router Plans and Design

An NP-Based Router for the Open Network Lab Meeting Notes

Adapted from slides by Sally McKee Cornell University

An NP-Based Router for the Open Network Lab Project Information

An NP-Based Router for the Open Network Lab Design

ECE232: Hardware Organization and Design

SPP Router Plans and Design

IXP Based Router for ONL: Architecture

SPP Version 1 Router QM Design

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang

Design of a Diversified Router: Project Management

Lecture 5: Pipeline Wrap-up, Static ILP

Presentation transcript:

QM Performance Analysis John DeHart

ONL NP Router xScale xScale (3 Rings?) SRAM TCAM SRAM Rx (2 ME) Mux Assoc. Data ZBT-SRAM xScale (3 Rings?) Small SRAM Ring Large SRAM Ring Scratch Ring SRAM TCAM LD Except Errors SRAM NN NN Ring 64KW Rx (2 ME) Mux (1 ME) Parse, Lookup, Copy (3 MEs) QM (1 ME) HdrFmt (1 ME) Tx (1 ME) NN Mostly Unchanged 64KW xScale 64KW 64KW 64KW 64KW 64KW Plugin System Update Requests Plugin Ctrl Msgs 512W 512W 512W 512W 512W New NN NN NN NN 512W 512W Plugin0 Plugin1 Plugin2 Plugin3 Plugin4 SRAM Needs A Lot Of Mod. Rx Mux HF Copy Plugins Tx Needs Some Mod. Stats (1 ME) Tx, QM Parse Plugin XScale FreeList Mgr (1 ME) SRAM

Performance What is our performance target? To hit 5 Gb rate: Minimum Ethernet frame: 76B 64B frame + 12B InterFrame Spacing 5 Gb/sec * 1B/8b * packet/76B = 8.22 Mpkt/sec IXP ME processing: 1.4Ghz clock rate 1.4Gcycle/sec * 1 sec/ 8.22 Mp = 170.3 cycles per packet compute budget: (MEs*170) 1 ME: 170 cycles 2 ME: 340 cycles 3 ME: 510 cycles 4 ME: 680 cycles latency budget: (threads*170) 1 ME: 8 threads: 1360 cycles 2 ME: 16 threads: 2720 cycles 3 ME: 24 threads: 4080 cycles 4 ME: 32 threads: 5440 cycles

QM Performance 1 ME using 7 threads: Threads each run once per iteration Enqueue thread and Dequeue threads run in parallel Their latencies can overlap Freelist management thread runs in isolation of the other threads 1 Enqueue Thread Processes a batch of 5 packets per iteration 5 Dequeue Threads Each processes 1 packet per iteration 1 Freelist management Thread Maintains state of freelist once every iteration Each iteration can enqueue and dequeue 5 packets Total latency for an iteration: 5 * 170 cycles = 850 cycles Sum of: Latency of Freelist management thread Combined latency of Enqueue thread and Dequeue Threads Compute budget: (FL_cpu/5) + (DQ_cpu) + (ENQ_cpu/5) <= 170 cycles Current (June 2007) BEST CASE (all queues already loaded) estimates: FL_cpu: 41 cycles DQ_cpu: 216 cycles Execution cycles ranges from: 128 cycles to 202 cycles Abort cycles ranges from: 47 cycles to 70 cycles Seems to be a strange progression: DQ0 <= DQ1 <= DQ2 <= DQ3 <= DQ4 ENQ_cpu: 501 cycles Total latency per iteration: 1600 – 1800 cycles (FL_cpu/5) + (DQ_cpu) + (ENQ_cpu/5) = 8 + 216 + 100 = 324 cycles

QM Performance Improvements These are simple improvements that might save us 10’s of cycles each. Change the way we read the scratch ring on input to Enqueue Currently we do this (Each gets the input data for 1 pkt): .xfer_order $rdata_a .xfer_order $rdata_b .xfer_order $rdata_c .xfer_order $rdata_d .xfer_order $rdata_e scratch[get, $rdata_a[0], 0, ring, 3], sig_done[sram_sig0] scratch[get, $rdata_b[0], 0, ring, 3], sig_done[sram_sig1] scratch[get, $rdata_c[0], 0, ring, 3], sig_done[sram_sig2] scratch[get, $rdata_d[0], 0, ring, 3], sig_done[sram_sig3] scratch[get, $rdata_e[0], 0, ring, 3], sig_done[sram_sig4] The fifth scratch get always causes a stall since the cmd fifo on the ME is only 4 deep. And when it stalls it also causes an abort of it and the following 2 instructions. Total of 15 cycles consumed by abort (3 cycles) and stall (12 cycles) This seems more efficient: .xfer_order $rdata_a, $rdata_b .xfer_order $rdata_c, $rdata_d scratch[get, $rdata_a[0], 0, ring, 6], sig_done[sram_sig0] scratch[get, $rdata_c[0], 0, ring, 6], sig_done[sram_sig1] scratch[get, $rdata_e[0], 0, ring, 3], sig_done[sram_sig2] If there is just one pkt in input ring, then it shows up in $rdata_e If there are two pkts, then they show up in $rdata_a and $rdata_b If there are three pkts, then they show up in $rdata_a, $rdata_b and $rdata_e If there are four pkts, then they show up in $rdata_a, $rdata_b, $rdata_c and $rdata_d If there are five pkts, then they show up in $rdata_a, $rdata_b, $rdata_c, $rdata_d and $rdata_e Local Memory Operations Are all done as uncoordinated macro calls. Each call to the macro sets the Local Memory Address CSR This instruction seems to put a command in the ME cmd fifo which may stall if that fifo is full. Combine _qm_read_enqueue_request and do_enqueue macros When we load pkt_descriptor[n] we can immediately test it instead of testing it again later.

QM Snapshots Breakpoint set at start of maintain_fl() macro in FL management Thread All queues should be already loaded Run for one iteration ENQ processes 5 pkts Each of the 5 DQ thread processes 1 pkt Rx reports 10 packets received and Tx reports 5 packets transmitted

QM Snapshots Breakpoint set at start of maintain_fl() macro in FL management Thread All queues should be already loaded Run for one iteration ENQ processes 5 pkts Each of the 5 DQ thread processes 1 pkt Rx reports 10 packets received and Tx reports 5 packets transmitted

200Byte Eth Frames With 200 Byte Eth Frames packets, 5 ports sending at full rate: Dequeue can not keep up. After about 1030 packets we start discarding in Enqueue because the queues are full. Queue thresholds were set to 0xfff Port rates were set to 0x1000 (greater than 1Gb/s)

400Byte Eth Frames With 400 Byte Eth Frames packets, 5 ports sending at full rate: Queues build up eventually. I suspect there is an inherent problem in the way that dequeue is working that causes it to not be able to keep up. Tx is flow controlling the dequeue engines in this case. This seems to be what is causing the queues to build up.

More snapshots (June 13, 2007)

More snapshots (June 13, 2007)

More snapshots (June 13, 2007)

More snapshots (June 13, 2007)

More snapshots (June 13-15, 2007) QM Totals Enqueue “WORST CASE”:Every pkt causes queue to be evicted by Enqueue and new one loaded. “BEST CASE”: Queues are always already loaded, nothing gets evicted.