Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar.

Slides:



Advertisements
Similar presentations
Device Layer and Device Drivers
Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius
Module R2 Overview. Process queues As processes enter the system and transition from state to state, they are stored queues. There may be many different.
The Linux Kernel: Memory Management
1 Computer System Overview OS-1 Course AA
Computer System Overview Chapter 1. Basic computer structure CPU Memory memory bus I/O bus diskNet interface.
CSCI2413 Lecture 6 Operating Systems Memory Management 2 phones off (please)
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.
Virtual Memory.
John DeHart ONL NP Router Block Design Review: Lookup (Part of the PLC Block)
David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Stats Block.
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
FINAL MPX DELIVERABLE Due when you schedule your interview and presentation.
Michael Wilson Block Design Review: ONL Header Format.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner.
Device Drivers CPU I/O Interface Device Driver DEVICECONTROL OPERATIONSDATA TRANSFER OPERATIONS Disk Seek to Sector, Track, Cyl. Seek Home Position.
1 - Charlie Wiseman - 05/11/07 Design Review: XScale Charlie Wiseman ONL NP Router.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Paging.
4P13 Week 3 Talking Points 1. Process State 2 Process Structure Catagories – Process identification: the PID and the parent PID – Signal state: signals.
Michael Wilson Block Design Review: Line Card Key Extract (Ingress and Egress)
Intro  Scratchpad rings and queues.  First – In – Firs – Out (FIFO) data structure.  Rings are fixed-sized, circular FIFO.  Queues not fixed-size.
Review °Apply Principle of Locality Recursively °Manage memory to disk? Treat as cache Included protection as bonus, now critical Use Page Table of mappings.
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
Processes and Virtual Memory
David M. Zar Applied Research Laboratory Computer Science and Engineering Department ONL Freelist Manager.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
John DeHart Block Design Review: Lookup for IPv4 MR, LC Ingress and LC Egress.
Brandon Heller Block Design Review: Substrate Decap and IPv4 Parse.
Queue Manager and Scheduler on Intel IXP John DeHart Amy Freestone Fred Kuhns Sailesh Kumar.
1 - Charlie Wiseman, Shakir James - 05/11/07 Design Review: Plugin Framework Charlie Wiseman and Shakir James ONL.
John DeHart An NP-Based Router for the Open Network Lab Memory Map.
CS/CoE 536 : Lockwood 1 CS/CoE 536 Reconfigurable System On Chip Design Lecture 11 : Priority and Per-Flow Queuing in Machine Problem 3 (Revision 2) Washington.
David M. Zar Block Design Review: PlanetLab Line Card Header Format.
Mart Haitjema Block Design Review: ONL NP Router Multiplexer (MUX)
WINLAB Open Cognitive Radio Platform Architecture v1.0 WINLAB – Rutgers University Date : July 27th 2009 Authors : Prasanthi Maddala,
EmuOS Phase 3 Design Brendon Drew Will Mosley Anna Clayton
Flow Stats Module James Moscola September 12, 2007.
Protection of System Resources
CSC 4250 Computer Architectures
CS510 Operating System Foundations
Design of a Diversified Router: Memory Usage
Design of a Diversified Router: Common Router Framework
Design of a Diversified Router: Project Management
Process management Information maintained by OS for process management
An NP-Based Router for the Open Network Lab
Design of a Diversified Router: IPv4 MR (Dedicated NP)
Flow Stats Module James Moscola September 6, 2007.
Documentation for Each Block
Design of a Diversified Router: Line Card
An NP-Based Router for the Open Network Lab Overview by JST
ONL Stats Engine David M. Zar Applied Research Laboratory Computer Science and Engineering Department.
Next steps for SPP & ONL 2/6/2007
IXP Based Router for ONL: Architecture
QM Performance Analysis
Design of a Diversified Router: Project Assignments and Status Updates
SPP V1 Memory Map John DeHart Applied Research Laboratory Computer Science and Engineering Department.
Planet Lab Memory Map David M. Zar Applied Research Laboratory Computer Science and Engineering Department.
Code Review for IPv4 Metarouter Header Format
Code Review for IPv4 Metarouter Header Format
An NP-Based Router for the Open Network Lab Meeting Notes
Design of a Diversified Router: Memory Usage
Implementing an OpenFlow Switch on the NetFPGA platform
IXP Based Router for ONL: Architecture
Design of a High Performance PlanetLab Node: Line Card
CS 3410, Spring 2014 Computer Science Cornell University
Page Allocation and Replacement
Design of a Diversified Router: Project Management
Presentation transcript:

Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar

2 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Overview n QM/Scheduler »Function: –Enqueue and Dequeue from queues –Scheduling algorithm (5-ports, N queue per port, WDRR across queues) –Drop Policy –RR port scheduling, rate controlled »Memory Accesses: –SRAM: l Q-Array Reads and Writes l Scheduling Data Structure Reads and Writes l QLength Data Structure Reads and Writes l Queue weight, discard threshold, and port rates Reads l Retrieve Packet Length from Buffer Descriptor Reads Lookup Phy Int Rx Switch Tx QM/Schd Key Extract Hdr Format SWITCHSWITCH Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Buffer Handle(24b) Rsv (3b) Port (4b) V1V1 V: Valid Bit Rsv (4b) Port (4b)

3 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Data Structures Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Buffer Handle(24b) Rsv (3b) Port (4b) V1V1 Rsv (4b) Port (4b) Queue id (20b) Queue length QID(20b) Tail Valid Qlen Valid Head Valid CAM (16 entries) Discard threshold Weight quantum :::: Local memory (16 entries) Queue head/tail/count SRAM Q-array (16 entries) :::: High level Cache Arch. Queue length Discard threshold Weight quantum :::: Q params (Per queue) Head Tail Count :::: Q Descrpt. (Per queue) xxx LW0-1 LW2 xxx LW3-7 Pkt_Size (16b) xxx Buf. Descrpt. SRAM Enqueuer Dequeuer

4 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Interface n Scratch Ring Interface »For both ingress and egress n Threads used: 7 »Thread 0: Free list maintenance and initialization »Thread 1-5: Dequeue for port 0-4 »Thread 6: Enqueue for all 5 ports n Threads are synchronized after each round »A round enqueues up to 5 packets »Dequeues up to 5 packets, one for each port Lookup Phy Int Rx Switch Tx QM/Schd Key Extract Hdr Format SWITCHSWITCH Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Buffer Handle(24b) Rsv (3b) Port (4b) V1V1 V: Valid Bit Rsv (4b) Port (4b)

5 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Thread Synchronization Note that in the enqueue thread, signal A is not used, it is implemented Using a register which is set by thread 0 and reset by enqueuer

6 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Resource Usage n Local memory: 1512 bytes »#define PAR_CACHE_LM_BASE0x0 »#define PORT_DATA_LM_BASE0x100 »#define BBUF_FL_LM_BASE0x1a8 »#define BBUF_LM_BASE0x1fc »#define FL_LM_BASE0x598 n SRAM »Queue descriptors (16B per queue) »Queue parameters (16B per queue) »Port rates (4B per port) »Free lists »Batch buffers n Enqueue: »15 signals, 16 RD xfer, 10 WR xfer n Dequeue: »9 signals, QM uses 4 RD xfer, 1 WR xfer. SCH used more xfers

7 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Local Memory Map (JDD, 4/1/08) PAR Cache Port Data Batch Buffers (21 * 44Bytes) Free List (>=40 * 4Bytes) 0x000 0x9FF 0x100 0x1FC 0x598 Batch Buf FL 0x1A8 0x1A7 0x1FB 0x597 0x680 Port Rate Control Data 0x690 Unallocated residualResult written here n Port Data Structure: »0: Old Tail LM »1: Old Tail SRAM »2: head SRAM »3: tail SRAM »4: tail offset (first empty slot) »5: nexthead LM »6: LM (head|tail) »7: unused

8 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Data Consistency Precautions n Only one thread (dequeue or enqueue) reads in the queue parameters of a Queue »Flags are used to ensure that when thread x is reading in the Q param –thread y doesn’t read them –Also, thread y waits until thread x stores the data read into cache »Flags are stored in local memory –Three flags are used, (head valid, tail valid, and Q param valid) –Head valid implies dequeue thread has cached the Q descriptor –Tail valid implies enqueue thread has cached the Q descriptor –Both valid means, both head and tail are cached n Before a thread swaps out »Move relevant register contents (flags, queue length) into the local memory n After a thread resumes »Move relevant local memory data back to register n Cache contents are refreshed after every 4k iterations n Port rate in register are refreshed every 4k iterations

9 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Initialization n Thread 0 initializes all shared data-structure ??? »CAM and Q-array (cam_clear and Q-array empty) »Memory controller variables –Set SRAM Channel CSR to ignore cellcount and eop bit in the buffer handle »Local memory –Queue parameter cache (all zeroes) –Scheduling data structures (set by scheduler) »SRAM –Queue parameters (length, weight quantum, discard threshold) –Queue descriptors (all zeroes) –Port rates (as per token bucket) –Free list (set by free list macro) –Scheduling data structure (set by scheduler)

10 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Enqueue Thread n Operates in batch mode (5 packets at a time) »Read 5 requests from the scratch ring »Check CAM for the 5 queue ids read »If miss –Evict LRU entry (write back queue params and descr) –Read queue params from SRAM into cache –Read queue descriptor into Q-array –Update CAM »check for discard –If discard, call dl_drop_buf »If admit –Send enqueue command to Q-array –Check if queue was already active l If not call add_queue_to_tail –Update the queue length in cache –Write back queue length (in future may want to do less often)

11 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Dequeue Thread (per port) n One thread handles one port »Done for the round if port rate $$tx_q_flow_control is set or port is inactive (port_active macro) or tokens are over »If current batch is done, call get_head macro »If batch buffer is non-empty then consider the first queue_id –Check CAM for the queue_id –If miss l Evict LRU entry (write back queue params and descr) l Read queue params from SRAM into cache l Read queue descriptor into Q-array and Update CAM –If Hit or after data is ready l Send dequeue command to Q-array l Call dl_sink_1ME_SCR_1words –Read the pkt_length from buffer descriptor –Update queue length (and write back) and the credit l If credit 0 then add_queue_to_tail l If queue_length <= 0 OR credit <= 0 then incr. batch_index l If batch_index = 5 OR queue_id = 0 then call advance_head

12 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Enqueue Thread Read 15 words from scratch For 5 q_ids, check CAM hit: If miss, write back LRU and read queue param/descriptor Admit? enqueue / update Q params Active? add_queue_to_tail() (x instr) Write back the queue length 28 inst. 40/31 inst. per Q 202/157 inst. total Per packet 41 if discard If admit: 62+add_q_2_tail Total 205 / 310+5x + 6 inst. for signals For all 5 requests: Worst case: 545+5x All discard: 395 All accept/hit: 500+5x 2x5 Writes 1, 3 words 2x5 Reads 3, 2 words 2x5 Writes 1, 1 word SCH reads dl_drop_buf() Loop around x = 18-49

13 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Dequeue Thread (per port) Rate_control If curr_queue = 0, get_head() Update cache, dequeue 27 inst. 34 inst. Worst case: 320 Best: 170 Check CAM, evict, load 32/44 inst. 24+ inst. Send tx_msg, read pkt_len 13 inst. Update credit/q_len, Wr q_len Adv_head: inst Add_queu..: inst Overheads: 13 inst add_queue_to_tail() advance_head() 1 Read (once / 16K cycles) 2 Writes 1, 3 words 2 Reads 3, 2 words 1 Read 1 Write Write_old_tail and loop around

14 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Dequeue Rate Control (Updated by JDD) n Token bucket »The unit of port_rate is bytes per 4096 clocks (ME clock/16 MHz). »curr_time is the counts of 16 clocks (ME clock/16 MHz). »last_time is the time when the last packet was sent. »IF PORT IS INAVTIVE THEN tokens = 4095 »ELSE IF (tokens = 4095) –SEND PACKET –last_time := curr_time –tokens = tokens  pkt_length »ELSE –result = ((curr_time – last_time) x port_rate) + residualReslt // 16 x 16 multiply –residualResult = (result > 22 // save bits shifted out to add back in next time –Tokens = min [ 4095, tokens + (result >> 10) ] –IF (tokens > 0) l SEND PACKET l last_time := curr_time l tokens = tokens  pkt_length n Port rates »Must be specified in LSB 16-bits »1 unit = 683 Kbps »Max port rate = 64K = 44.8 Gbps Reserved (16b) Port rate (16b)

15 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Performance Analysis n Dequeue thread runs much longer than the enqueue thread »Dequeue –1273 cycles in case of a cache miss and add_queue_to_tail() and advance_head() –867 cycles in case of cache hit and no scheduler calls »Enqueue –876 cycles in case of all 5 cache misses –342 cycles in case of a single enqueue and cache hit n Dequeue takes more time due to memory accesses »Read Queue_param: 110 cycles »Dequeue: 120 cycles »Read pkt_len: 110 cycles n There are few idle cycles at present »Can be removed by giving higher priority to dequeue threads

16 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 File locations (in …/IPv4_MR/) n Code »src/qm/PL/common_macros.uc »src/qm/PL/dequeue.uc »src/qm/PL/enqueue.uc »src/qm/PL/fl_macros.uc »src/qm/PL/qm.h »src/qm/PL/qm.uc »src/qm/PL/sched_macros.h n Includes »../dispatch_loop/dl_source_WU.uc –dl_buf_drop() and dl_sink_1ME_SCR_1words() functions »Also uses local memory read and write macros (localmem.uc)

17 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Queue Manager Validation n Tested »Threshold length discards (set length at 0, and tested if packets are enqueued) »Enqueue –Single port, single queue active –Multiple ports/queues active –Cache hit/miss (not all scenarios are tested) »Dequeue –Rate control partially tested (set the port rate at 0, and see is packet are dequeued) –Partial fairness test (set quantum at 0, and see if packets are dequeued) –Multiple active ports/queues n Both queue manager enabled »There is one bug concerning the Q-array contention

18 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Cycle Budget »76B packet »1.4 Ghz clock rate »1.4Gcycle/sec »% Gbps => 170 cycles per packet –Dequeue worst-case = 320 inst. (best case 170 inst.) –Dequeue worst-case = x inst. for 5 packets

19 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Scheduling Structure Overview Batch Buffer Port 0 Port 4 HeadNext HeadTail ………… SRAM Next Pointer Queue 0 Credits 0 Queue 4 Credits 4 … Batch Buffer Batch Buffers in SRAM Stack in Local Memory Stack in SRAM Free List (for SRAM Batch Buffers) Stack in Local Memory Batch Buffer Free List (for LM Batch Buffers)

20 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Scheduling Structure Interface n Scheduling structure macros contained in \src\qm\PL\sched_macros.uc »add_queue_to_tail(queue, credits, port) »get_head(port, head_ptr) »advance_head(port, sig_a, sig_b) »port_active(port, label) »write_old_tail(port, sig_a, sig_b) n Free list macro contained in \src\qm\PL\fl_macros.uc »maintain_fl()