Download presentation
Presentation is loading. Please wait.
Published byDuane Brett Marshall Modified over 9 years ago
1
Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner
2
2 - Sailesh Kumar - 10/23/2015 Agenda and Overview n In this paper, we »Introduce the potential bottlenecks in high speed queuing systems »We only address the bottlenecks associated with off-chip SRAM n Overview of queuing system n Bottlenecks n Our Solution n Conclusion
3
3 - Sailesh Kumar - 10/23/2015 High Speed Packet Queuing Systems n Packet Queues are crucial to isolate traffic (QoS, etc) »Modern systems can have thousands or may be million queues »Queues must operate at link rates (OC192, OC768, …) »Memory must be efficiently utilized –Dynamic sharing among queues –Minimal wastage due to fragmentation, etc … »Power consumption, chip area, cost, … n Most shared memory queuing systems consists of: »A packet buffer »A queuing subsystem to store per queue’s control information –Supports enqueue and dequeue »A scheduler to make dequeue decisions
4
4 - Sailesh Kumar - 10/23/2015 High Speed Packet Queuing Systems n Packet Queues are crucial to isolate traffic (QoS, etc) »Modern systems can have thousands or may be million queues »Queues must operate at link rates (OC192, OC768, …) »Memory must be efficiently utilized –Dynamic sharing among queues –Minimal wastage due to fragmentation, etc … »Power consumption, chip area, cost, … n Most shared memory queuing systems consists of: »A packet buffer »A queuing subsystem to store per queue’s control information –Supports enqueue and dequeue »A scheduler to make dequeue decisions n Currently we concentrate on the queuing subsystem
5
5 - Sailesh Kumar - 10/23/2015 Assumptions n Considering only hardware implementation »Shared memory linked-list queues n Packets are broken into cells and are stored n Implicit mapping »Will explain shortly n Separate memory for »Packet storage (DRAM) »Queues linked-list »Queue descriptor (head, tail, etc)
6
6 - Sailesh Kumar - 10/23/2015 A shared memory queuing subsystem
7
7 - Sailesh Kumar - 10/23/2015 Limitations of such a Queuing subsystem n Dequeue throughput determined by SRAM latency »SRAM latency > 20 ns; 64 byte cells implies 25 Gbps »Might not be desirable in systems: –with speedup –With multipass support n Fair queuing (scheduling) algorithms performance »Most algorithms require packet lengths to make decisions »Since packet lengths are stored in SRAM, each scheduling decision will take at least 20 ns n Cost »Per bit cost of SRAM is still 100 times that of DRAM »A 512 MB packet buffer needs 40 MB SRAM (64B cells) –Cost of SRAM can be 8 times that of DRAM –12 SRAM chips versus 4 DRAM (power, area, etc)
8
8 - Sailesh Kumar - 10/23/2015 Scope for Improvement n Reduce the impact of SRAM latency n Reduce SRAM bandwidth n Is it possible to control the cost by replacing the SRAM with DRAM? n With all of above, is it possible to ensure a good worst case throughput
9
9 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n Each node of the linked-list maps to multiple buffers »A occupancy or offset tracks the fill level of nodes »Note that only head and tail node can be partially filled n Fewer SRAM access is needed (1/X times on average)
10
10 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n To Prove the effectiveness, we must consider 2 cases »1. when queues remain near empty »2. backlogged queues with near-empty heads
11
11 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n To Prove the effectiveness, we must consider 2 cases »1. when queues remain near empty »2. backlogged queues with near-empty heads Not a difficult case as SRAM accesses are not required
12
12 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n To Prove the effectiveness, we must consider 2 cases »1. when queues remain near empty »2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation
13
13 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n To Prove the effectiveness, we must consider 2 cases »1. when queues remain near empty »2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation We show that by adding few request queues, performance remains high with good prob.
14
14 - Sailesh Kumar - 10/23/2015 Queuing model with request queues n Arrival process makes enqueue requests »Requests are stored in the enqueuer queue n Departure process makes dequeue requests »Requests requiring SRAM access are stored in dequeuer queue »Output queue holds the dequeued cells n Elastic buffer holds few free nodes
15
15 - Sailesh Kumar - 10/23/2015 Queuing model with request queues n We use a discrete time Markov model and show that a relatively small sized request queues (with 8-16 entries) ensures good throughput with very high probability »Experimental results for few example systems are shown in the paper
16
16 - Sailesh Kumar - 10/23/2015 Queue Descriptor size n Note that every list node now consists of X buffers »Potentially X packets can be stored at the head node »Thus queue descriptor may need to hold the length of all X packets »Large queue descriptor memory, might not be desirable n We introduce a clever encoding of packet lengths and boundary which results in coarse grained scheduling
17
17 - Sailesh Kumar - 10/23/2015 Queue Descriptor size n Note that every list node now consists of X buffers »Potentially X packets can be stored at the head node »Thus queue descriptor may need to hold the length of all X packets »Large queue descriptor memory, might not be desirable n We introduce a clever encoding of packet lengths and boundary which results in coarse grained scheduling
18
18 - Sailesh Kumar - 10/23/2015 Coarse grained scheduling n Coarse grained scheduling makes schedule decisions based upon the cell count »May result in short term error »Error can be easily compensated within a X cell window »Thus long term fairness is ensured n The jitter remains negligible for all practical purposes Not needed Log 2 (XC+P)
19
19 - Sailesh Kumar - 10/23/2015 Conclusions n We presented the linked-list queue bottlenecks and tried to address them n The current work might have been used in some production systems n Our aim was to present these ideas to the research community n Further research »Our ideas can be extended to asynchronous packet buffers »Thus no need for segmentation of packets into cells –Improved space and bandwidth efficiency
20
20 - Sailesh Kumar - 10/23/2015 Questions ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.