Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner.

Slides:



Advertisements
Similar presentations
Chapter 4 Memory Management Basic memory management Swapping
Advertisements

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
More on File Management
Computer Networking Lecture 20 – Queue Management and QoS.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Disks and RAID.
Segmented Hash: An Efficient Hash Table Implementation for High Performance Networking Subsystems Sailesh Kumar Patrick Crowley.
1 Statistical Analysis of Packet Buffer Architectures Gireesh Shrimali, Isaac Keslassy, Nick McKeown
XCP: Congestion Control for High Bandwidth-Delay Product Network Dina Katabi, Mark Handley and Charlie Rohrs Presented by Ao-Jan Su.
Router Architecture : Building high-performance routers Ian Pratt
Memory Management Design & Implementation Segmentation Chapter 4.
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)
1 Lecture 16: Virtual Memory Today: DRAM innovations, virtual memory (Sections )
1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group
High speed TCP’s. Why high-speed TCP? Suppose that the bottleneck bandwidth is 10Gbps and RTT = 200ms. Bandwidth delay product is packets (1500.
1 OR Project Group II: Packet Buffer Proposal Da Chuang, Isaac Keslassy, Sundar Iyer, Greg Watson, Nick McKeown, Mark Horowitz
CS 104 Introduction to Computer Science and Graphics Problems Operating Systems (4) File Management & Input/Out Systems 10/14/2008 Yang Song (Prepared.
Memory Management.
1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
12/13/99 Page 1 IRAM Network Interface Ioannis Mavroidis IRAM retreat January 12-14, 2000.
Analysis of a Memory Architecture for Fast Packet Buffers Sundar Iyer, Ramana Rao Kompella & Nick McKeown (sundaes,ramana, Departments.
Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.
Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.
Buffer Management for Shared- Memory ATM Switches Written By: Mutlu Apraci John A.Copelan Georgia Institute of Technology Presented By: Yan Huang.
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.
Switching Techniques Student: Blidaru Catalina Elena.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
© 2006 Cisco Systems, Inc. All rights reserved. Optimizing Converged Cisco Networks (ONT) Module 4: Implement the DiffServ QoS Model.
Data Structures - Queues
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Jon Turner (and a cast of thousands) Washington University Design of a High Performance Active Router Active Nets PI Meeting - 12/01.
1. Memory Manager 2 Memory Management In an environment that supports dynamic memory allocation, the memory manager must keep a record of the usage of.
1 Optical Burst Switching (OBS). 2 Optical Internet IP runs over an all-optical WDM layer –OXCs interconnected by fiber links –IP routers attached to.
CAMP: Fast and Efficient IP Lookup Architecture Sailesh Kumar, Michela Becchi, Patrick Crowley, Jonathan Turner Washington University in St. Louis.
Peacock Hash: Deterministic and Updatable Hashing for High Performance Networking Sailesh Kumar Jonathan Turner Patrick Crowley.
Designing Packet Buffers for Internet Routers Friday, October 23, 2015 Nick McKeown Professor of Electrical Engineering and Computer Science, Stanford.
8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.
Cisco 3 - Switching Perrine. J Page 16/4/2016 Chapter 4 Switches The performance of shared-medium Ethernet is affected by several factors: data frame broadcast.
Segmented Hash: An Efficient Hash Table Implementation for High Performance Networking Subsystems Sailesh Kumar Patrick Crowley.
Block Design Review: Queue Manager and Scheduler Amy M. Freestone Sailesh Kumar.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection Sailesh Kumar Sarang Dharmapurikar Fang Yu Patrick Crowley Jonathan.
Cisco Network Devices Chapter 6 powered by DJ 1. Chapter Objectives At the end of this Chapter you will be able to:  Identify and explain various Cisco.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Overview High Performance Packet Processing Challenges
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Queue Manager and Scheduler on Intel IXP John DeHart Amy Freestone Fred Kuhns Sailesh Kumar.
Block-Based Packet Buffer with Deterministic Packet Departures Hao Wang and Bill Lin University of California, San Diego HSPR 2010, Dallas.
Dynamic Pipelining: Making IP-Lookup Truly Scalable Jahangir Hasan T. N. Vijaykumar Presented by Sailesh Kumar.
1 A Deficit Round Robin 20MB/s Layer 2 Switch Muraleedhara Navada Francois Labonte.
Techniques for Fast Packet Buffers Sundar Iyer, Nick McKeown Departments of Electrical Engineering & Computer Science, Stanford.
Memory Management.
Jonathan Walpole Computer Science Portland State University
Chapter 2 Memory and process management
Topics discussed in this section:
Buffer Management and Arbiter in a Switch
Disks and RAID.
Empirically Characterizing the Buffer Behaviour of Real Devices
Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.
Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson
HEXA: Compact Data Structures for Faster Packet Processing
Overview Continuation from Monday (File system implementation)
Switching Techniques.
EE 122: Lecture 7 Ion Stoica September 18, 2001.
Buffer Management for Shared-Memory ATM Switches
Javad Ghaderi, Tianxiong Ji and R. Srikant
Duo Liu, Bei Hua, Xianghui Hu, and Xinan Tang
COMP755 Advanced Operating Systems
Presentation transcript:

Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner

2 - Sailesh Kumar - 10/23/2015 Agenda and Overview n In this paper, we »Introduce the potential bottlenecks in high speed queuing systems »We only address the bottlenecks associated with off-chip SRAM n Overview of queuing system n Bottlenecks n Our Solution n Conclusion

3 - Sailesh Kumar - 10/23/2015 High Speed Packet Queuing Systems n Packet Queues are crucial to isolate traffic (QoS, etc) »Modern systems can have thousands or may be million queues »Queues must operate at link rates (OC192, OC768, …) »Memory must be efficiently utilized –Dynamic sharing among queues –Minimal wastage due to fragmentation, etc … »Power consumption, chip area, cost, … n Most shared memory queuing systems consists of: »A packet buffer »A queuing subsystem to store per queue’s control information –Supports enqueue and dequeue »A scheduler to make dequeue decisions

4 - Sailesh Kumar - 10/23/2015 High Speed Packet Queuing Systems n Packet Queues are crucial to isolate traffic (QoS, etc) »Modern systems can have thousands or may be million queues »Queues must operate at link rates (OC192, OC768, …) »Memory must be efficiently utilized –Dynamic sharing among queues –Minimal wastage due to fragmentation, etc … »Power consumption, chip area, cost, … n Most shared memory queuing systems consists of: »A packet buffer »A queuing subsystem to store per queue’s control information –Supports enqueue and dequeue »A scheduler to make dequeue decisions n Currently we concentrate on the queuing subsystem

5 - Sailesh Kumar - 10/23/2015 Assumptions n Considering only hardware implementation »Shared memory linked-list queues n Packets are broken into cells and are stored n Implicit mapping »Will explain shortly n Separate memory for »Packet storage (DRAM) »Queues linked-list »Queue descriptor (head, tail, etc)

6 - Sailesh Kumar - 10/23/2015 A shared memory queuing subsystem

7 - Sailesh Kumar - 10/23/2015 Limitations of such a Queuing subsystem n Dequeue throughput determined by SRAM latency »SRAM latency > 20 ns; 64 byte cells implies 25 Gbps »Might not be desirable in systems: –with speedup –With multipass support n Fair queuing (scheduling) algorithms performance »Most algorithms require packet lengths to make decisions »Since packet lengths are stored in SRAM, each scheduling decision will take at least 20 ns n Cost »Per bit cost of SRAM is still 100 times that of DRAM »A 512 MB packet buffer needs 40 MB SRAM (64B cells) –Cost of SRAM can be 8 times that of DRAM –12 SRAM chips versus 4 DRAM (power, area, etc)

8 - Sailesh Kumar - 10/23/2015 Scope for Improvement n Reduce the impact of SRAM latency n Reduce SRAM bandwidth n Is it possible to control the cost by replacing the SRAM with DRAM? n With all of above, is it possible to ensure a good worst case throughput

9 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n Each node of the linked-list maps to multiple buffers »A occupancy or offset tracks the fill level of nodes »Note that only head and tail node can be partially filled n Fewer SRAM access is needed (1/X times on average)

10 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n To Prove the effectiveness, we must consider 2 cases »1. when queues remain near empty »2. backlogged queues with near-empty heads

11 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n To Prove the effectiveness, we must consider 2 cases »1. when queues remain near empty »2. backlogged queues with near-empty heads Not a difficult case as SRAM accesses are not required

12 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n To Prove the effectiveness, we must consider 2 cases »1. when queues remain near empty »2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation

13 - Sailesh Kumar - 10/23/2015 Buffer Aggregation n To Prove the effectiveness, we must consider 2 cases »1. when queues remain near empty »2. backlogged queues with near-empty heads Legitimate concern, throughput can be same as a system without buffer aggregation We show that by adding few request queues, performance remains high with good prob.

14 - Sailesh Kumar - 10/23/2015 Queuing model with request queues n Arrival process makes enqueue requests »Requests are stored in the enqueuer queue n Departure process makes dequeue requests »Requests requiring SRAM access are stored in dequeuer queue »Output queue holds the dequeued cells n Elastic buffer holds few free nodes

15 - Sailesh Kumar - 10/23/2015 Queuing model with request queues n We use a discrete time Markov model and show that a relatively small sized request queues (with 8-16 entries) ensures good throughput with very high probability »Experimental results for few example systems are shown in the paper

16 - Sailesh Kumar - 10/23/2015 Queue Descriptor size n Note that every list node now consists of X buffers »Potentially X packets can be stored at the head node »Thus queue descriptor may need to hold the length of all X packets »Large queue descriptor memory, might not be desirable n We introduce a clever encoding of packet lengths and boundary which results in coarse grained scheduling

17 - Sailesh Kumar - 10/23/2015 Queue Descriptor size n Note that every list node now consists of X buffers »Potentially X packets can be stored at the head node »Thus queue descriptor may need to hold the length of all X packets »Large queue descriptor memory, might not be desirable n We introduce a clever encoding of packet lengths and boundary which results in coarse grained scheduling

18 - Sailesh Kumar - 10/23/2015 Coarse grained scheduling n Coarse grained scheduling makes schedule decisions based upon the cell count »May result in short term error »Error can be easily compensated within a X cell window »Thus long term fairness is ensured n The jitter remains negligible for all practical purposes Not needed Log 2 (XC+P)

19 - Sailesh Kumar - 10/23/2015 Conclusions n We presented the linked-list queue bottlenecks and tried to address them n The current work might have been used in some production systems n Our aim was to present these ideas to the research community n Further research »Our ideas can be extended to asynchronous packet buffers »Thus no need for segmentation of packets into cells –Improved space and bandwidth efficiency

20 - Sailesh Kumar - 10/23/2015 Questions ?