Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar
2 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Overview n QM/Scheduler »Function: –Enqueue and Dequeue from queues –Scheduling algorithm (5-ports, N queue per port, WDRR across queues) –Drop Policy –RR port scheduling, rate controlled »Memory Accesses: –SRAM: l Q-Array Reads and Writes l Scheduling Data Structure Reads and Writes l QLength Data Structure Reads and Writes l Queue weight, discard threshold, and port rates Reads l Retrieve Packet Length from Buffer Descriptor Reads Lookup Phy Int Rx Switch Tx QM/Schd Key Extract Hdr Format SWITCHSWITCH Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Buffer Handle(24b) Rsv (3b) Port (4b) V1V1 V: Valid Bit Rsv (4b) Port (4b)
3 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Data Structures Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Buffer Handle(24b) Rsv (3b) Port (4b) V1V1 Rsv (4b) Port (4b) Queue id (20b) Queue length QID(20b) Tail Valid Qlen Valid Head Valid CAM (16 entries) Discard threshold Weight quantum :::: Local memory (16 entries) Queue head/tail/count SRAM Q-array (16 entries) :::: High level Cache Arch. Queue length Discard threshold Weight quantum :::: Q params (Per queue) Head Tail Count :::: Q Descrpt. (Per queue) xxx LW0-1 LW2 xxx LW3-7 Pkt_Size (16b) xxx Buf. Descrpt. SRAM Enqueuer Dequeuer
4 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Interface n Scratch Ring Interface »For both ingress and egress n Threads used: 7 »Thread 0: Free list maintenance and initialization »Thread 1-5: Dequeue for port 0-4 »Thread 6: Enqueue for all 5 ports n Threads are synchronized after each round »A round enqueues up to 5 packets »Dequeues up to 5 packets, one for each port Lookup Phy Int Rx Switch Tx QM/Schd Key Extract Hdr Format SWITCHSWITCH Frame Length (16b) Buffer Handle(32b) Stats Index (16b) QID(20b) Rsv (4b) Buffer Handle(24b) Rsv (3b) Port (4b) V1V1 V: Valid Bit Rsv (4b) Port (4b)
5 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Thread Synchronization Note that in the enqueue thread, signal A is not used, it is implemented Using a register which is set by thread 0 and reset by enqueuer
6 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Resource Usage n Local memory: 1512 bytes »#define PAR_CACHE_LM_BASE0x0 »#define PORT_DATA_LM_BASE0x100 »#define BBUF_FL_LM_BASE0x1a8 »#define BBUF_LM_BASE0x1fc »#define FL_LM_BASE0x598 n SRAM »Queue descriptors (16B per queue) »Queue parameters (16B per queue) »Port rates (4B per port) »Free lists »Batch buffers n Enqueue: »15 signals, 16 RD xfer, 10 WR xfer n Dequeue: »9 signals, QM uses 4 RD xfer, 1 WR xfer. SCH used more xfers
7 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Local Memory Map (JDD, 4/1/08) PAR Cache Port Data Batch Buffers (21 * 44Bytes) Free List (>=40 * 4Bytes) 0x000 0x9FF 0x100 0x1FC 0x598 Batch Buf FL 0x1A8 0x1A7 0x1FB 0x597 0x680 Port Rate Control Data 0x690 Unallocated residualResult written here n Port Data Structure: »0: Old Tail LM »1: Old Tail SRAM »2: head SRAM »3: tail SRAM »4: tail offset (first empty slot) »5: nexthead LM »6: LM (head|tail) »7: unused
8 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Data Consistency Precautions n Only one thread (dequeue or enqueue) reads in the queue parameters of a Queue »Flags are used to ensure that when thread x is reading in the Q param –thread y doesn’t read them –Also, thread y waits until thread x stores the data read into cache »Flags are stored in local memory –Three flags are used, (head valid, tail valid, and Q param valid) –Head valid implies dequeue thread has cached the Q descriptor –Tail valid implies enqueue thread has cached the Q descriptor –Both valid means, both head and tail are cached n Before a thread swaps out »Move relevant register contents (flags, queue length) into the local memory n After a thread resumes »Move relevant local memory data back to register n Cache contents are refreshed after every 4k iterations n Port rate in register are refreshed every 4k iterations
9 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Initialization n Thread 0 initializes all shared data-structure ??? »CAM and Q-array (cam_clear and Q-array empty) »Memory controller variables –Set SRAM Channel CSR to ignore cellcount and eop bit in the buffer handle »Local memory –Queue parameter cache (all zeroes) –Scheduling data structures (set by scheduler) »SRAM –Queue parameters (length, weight quantum, discard threshold) –Queue descriptors (all zeroes) –Port rates (as per token bucket) –Free list (set by free list macro) –Scheduling data structure (set by scheduler)
10 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Enqueue Thread n Operates in batch mode (5 packets at a time) »Read 5 requests from the scratch ring »Check CAM for the 5 queue ids read »If miss –Evict LRU entry (write back queue params and descr) –Read queue params from SRAM into cache –Read queue descriptor into Q-array –Update CAM »check for discard –If discard, call dl_drop_buf »If admit –Send enqueue command to Q-array –Check if queue was already active l If not call add_queue_to_tail –Update the queue length in cache –Write back queue length (in future may want to do less often)
11 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Dequeue Thread (per port) n One thread handles one port »Done for the round if port rate $$tx_q_flow_control is set or port is inactive (port_active macro) or tokens are over »If current batch is done, call get_head macro »If batch buffer is non-empty then consider the first queue_id –Check CAM for the queue_id –If miss l Evict LRU entry (write back queue params and descr) l Read queue params from SRAM into cache l Read queue descriptor into Q-array and Update CAM –If Hit or after data is ready l Send dequeue command to Q-array l Call dl_sink_1ME_SCR_1words –Read the pkt_length from buffer descriptor –Update queue length (and write back) and the credit l If credit 0 then add_queue_to_tail l If queue_length <= 0 OR credit <= 0 then incr. batch_index l If batch_index = 5 OR queue_id = 0 then call advance_head
12 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Enqueue Thread Read 15 words from scratch For 5 q_ids, check CAM hit: If miss, write back LRU and read queue param/descriptor Admit? enqueue / update Q params Active? add_queue_to_tail() (x instr) Write back the queue length 28 inst. 40/31 inst. per Q 202/157 inst. total Per packet 41 if discard If admit: 62+add_q_2_tail Total 205 / 310+5x + 6 inst. for signals For all 5 requests: Worst case: 545+5x All discard: 395 All accept/hit: 500+5x 2x5 Writes 1, 3 words 2x5 Reads 3, 2 words 2x5 Writes 1, 1 word SCH reads dl_drop_buf() Loop around x = 18-49
13 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Dequeue Thread (per port) Rate_control If curr_queue = 0, get_head() Update cache, dequeue 27 inst. 34 inst. Worst case: 320 Best: 170 Check CAM, evict, load 32/44 inst. 24+ inst. Send tx_msg, read pkt_len 13 inst. Update credit/q_len, Wr q_len Adv_head: inst Add_queu..: inst Overheads: 13 inst add_queue_to_tail() advance_head() 1 Read (once / 16K cycles) 2 Writes 1, 3 words 2 Reads 3, 2 words 1 Read 1 Write Write_old_tail and loop around
14 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Dequeue Rate Control (Updated by JDD) n Token bucket »The unit of port_rate is bytes per 4096 clocks (ME clock/16 MHz). »curr_time is the counts of 16 clocks (ME clock/16 MHz). »last_time is the time when the last packet was sent. »IF PORT IS INAVTIVE THEN tokens = 4095 »ELSE IF (tokens = 4095) –SEND PACKET –last_time := curr_time –tokens = tokens pkt_length »ELSE –result = ((curr_time – last_time) x port_rate) + residualReslt // 16 x 16 multiply –residualResult = (result > 22 // save bits shifted out to add back in next time –Tokens = min [ 4095, tokens + (result >> 10) ] –IF (tokens > 0) l SEND PACKET l last_time := curr_time l tokens = tokens pkt_length n Port rates »Must be specified in LSB 16-bits »1 unit = 683 Kbps »Max port rate = 64K = 44.8 Gbps Reserved (16b) Port rate (16b)
15 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Performance Analysis n Dequeue thread runs much longer than the enqueue thread »Dequeue –1273 cycles in case of a cache miss and add_queue_to_tail() and advance_head() –867 cycles in case of cache hit and no scheduler calls »Enqueue –876 cycles in case of all 5 cache misses –342 cycles in case of a single enqueue and cache hit n Dequeue takes more time due to memory accesses »Read Queue_param: 110 cycles »Dequeue: 120 cycles »Read pkt_len: 110 cycles n There are few idle cycles at present »Can be removed by giving higher priority to dequeue threads
16 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 File locations (in …/IPv4_MR/) n Code »src/qm/PL/common_macros.uc »src/qm/PL/dequeue.uc »src/qm/PL/enqueue.uc »src/qm/PL/fl_macros.uc »src/qm/PL/qm.h »src/qm/PL/qm.uc »src/qm/PL/sched_macros.h n Includes »../dispatch_loop/dl_source_WU.uc –dl_buf_drop() and dl_sink_1ME_SCR_1words() functions »Also uses local memory read and write macros (localmem.uc)
17 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Queue Manager Validation n Tested »Threshold length discards (set length at 0, and tested if packets are enqueued) »Enqueue –Single port, single queue active –Multiple ports/queues active –Cache hit/miss (not all scenarios are tested) »Dequeue –Rate control partially tested (set the port rate at 0, and see is packet are dequeued) –Partial fairness test (set quantum at 0, and see if packets are dequeued) –Multiple active ports/queues n Both queue manager enabled »There is one bug concerning the Q-array contention
18 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Cycle Budget »76B packet »1.4 Ghz clock rate »1.4Gcycle/sec »% Gbps => 170 cycles per packet –Dequeue worst-case = 320 inst. (best case 170 inst.) –Dequeue worst-case = x inst. for 5 packets
19 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Scheduling Structure Overview Batch Buffer Port 0 Port 4 HeadNext HeadTail ………… SRAM Next Pointer Queue 0 Credits 0 Queue 4 Credits 4 … Batch Buffer Batch Buffers in SRAM Stack in Local Memory Stack in SRAM Free List (for SRAM Batch Buffers) Stack in Local Memory Batch Buffer Free List (for LM Batch Buffers)
20 - Amy M. Freestone, Sailesh Kumar - 11/26/2015 Scheduling Structure Interface n Scheduling structure macros contained in \src\qm\PL\sched_macros.uc »add_queue_to_tail(queue, credits, port) »get_head(port, head_ptr) »advance_head(port, sig_a, sig_b) »port_active(port, label) »write_old_tail(port, sig_a, sig_b) n Free list macro contained in \src\qm\PL\fl_macros.uc »maintain_fl()