A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.

Slides:

Advertisements

Similar presentations

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

Advertisements

Supercharging PlanetLab A High Performance,Multi-Alpplication,Overlay Network Platform Reviewed by YoungSoo Lee CSL.

1 An Efficient, Hardware-based Multi-Hash Scheme for High Speed IP Lookup Hot Interconnects 2008 Socrates Demetriades, Michel Hanna, Sangyeun Cho and Rami.

1 CNPA B Nasser S. Abouzakhar Queuing Disciplines Week 8 – Lecture 2 16 th November, 2009.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

Chapter 8 Hardware Conventional Computer Hardware Architecture.

Router Architecture : Building high-performance routers Ian Pratt

What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.

1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

1 Architectural Results in the Optical Router Project Da Chuang, Isaac Keslassy, Nick McKeown High Performance Networking Group

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

Min-Sheng Lee Efficient use of memory bandwidth to improve network processor throughput Jahangir Hasan 、 Satish ChandraPurdue University T. N. VijaykumarIBM.

EE 122: Router Design Kevin Lai September 25, 2002.

CS 268: Lecture 12 (Router Design) Ion Stoica March 18, 2002.

ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.

Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.

Router Architectures An overview of router architectures.

Pipelined Two Step Iterative Matching Algorithms for CIOQ Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York, Stony Brook.

Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.

1 IP routers with memory that runs slower than the line rate Nick McKeown Assistant Professor of Electrical Engineering and Computer Science, Stanford.

Switching Techniques Student: Blidaru Catalina Elena.

Paper Review Building a Robust Software-based Router Using Network Processors.

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Addressing Queuing Bottlenecks at High Speeds Sailesh Kumar Patrick Crowley Jonathan Turner.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.

1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.

Packet switching network Data is divided into packets. Transfer of information as payload in data packets Packets undergo random delays & possible loss.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.

High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

T. S. Eugene Ngeugeneng at cs.rice.edu Rice University1 COMP/ELEC 429 Introduction to Computer Networks Lecture 18: Quality of Service Slides used with.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.

Overview High Performance Packet Processing Challenges

Queue Manager and Scheduler on Intel IXP John DeHart Amy Freestone Fred Kuhns Sailesh Kumar.

Lecture Note on Switch Architectures. Function of Switch.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

1 A quick tutorial on IP Router design Optics and Routing Seminar October 10 th, 2000 Nick McKeown

Queue Scheduling Disciplines

Data Communication Networks Lec 13 and 14. Network Core- Packet Switching.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

A Classification for Access Control List To Speed Up Packet-Filtering Firewall CHEN FAN, LONG TAN, RAWAD FELIMBAN and ABDELSHAKOUR ABUZNEID Department.

Network layer (addendum) Slides adapted from material by Nick McKeown and Kevin Lai.

Addressing: Router Design

Architecture & Organization 1

Cache Memory Presentation I

CMSC 611: Advanced Computer Architecture

Lecture 12: Cache Innovations

Architecture & Organization 1

Network Core and QoS.

Data Communication Networks

Yiannis Nikolakopoulos

Implementing an OpenFlow Switch on the NetFPGA platform

EE 122: Lecture 7 Ion Stoica September 18, 2001.

COMP/ELEC 429 Introduction to Computer Networks

Network Core and QoS.

Presentation transcript:

A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering

2 - Sailesh Kumar - 8/18/2015 Packet processing systems n ASIC based »High performance »Low configurability »May be expensive when volumes are low n Network processors (NP) based »Very high degree of configurability »High volumes can result in low cost »Challenges in matching the ASIC performance n In this paper we concentrate on queuing bottlenecks associated with NP based packet processors

3 - Sailesh Kumar - 8/18/2015 Basic NP architecture n Chip-Multiprocessor (CMP) architecture »A pool of relatively simple processors »Has dedicated hardware units for common cases »Provides high speed interconnection between processors »Integrates network interface and memory controllers n Group of processors collectively performs the packet processing task (queuing, scheduling, etc) n Best case performance is N times higher when each processor operates in parallel n Example, Intel’s IXP architecture

4 - Sailesh Kumar - 8/18/2015 Intel’s IXP2850 Introduction n CMP architecture n 16 RISC type processors called Microengines (ME) n 8 hardware thread contexts on each ME n SPI4.2 and CSIX interface cores n 3 Rambus DRAM and 4 QDR SRAM controllers n Various hardware units, like Hash, CAM, etc. n Typically MEs are arranged in pipeline and groups of MEs collectively perform the packet processing task.

5 - Sailesh Kumar - 8/18/2015 Why does PP need queues? n Routers and switch fabrics are packet processing systems which, »Receives packets at input ports »Classify and identifies the next hop for the packet »Transmit the packet at the appropriate output port n (Ingress rate > output link capacity) Due to, »traffic from many input ports destined to one output port »Bursty nature of internet traffic »Statistical oversubscription n Implications: »Unbounded delay for all flows »Packet loss across every active flow »A single misbehaving flow can affect all other flows

6 - Sailesh Kumar - 8/18/2015 Solution n Keep queues for every flow or group of flows »Put arriving packets into appropriate queue »Treat each queue such that resources are allocated fairly to all flows »Send packets from queues such that each will receive fair share of the aggregate link bandwidth n In fact, queues are the fundamental data structure in any packet processing system. It ensures »Fair allocation of resources (bandwidth, buffer space, etc) »Isolate misbehaving and high priority flows »Guaranteed traffic treatment, delay, bandwidth, QoS n Conclusion: any packet processing system must handle large number of queues at very high speed

7 - Sailesh Kumar - 8/18/2015 A simple queuing model n DRAM space is divided such that it can hold each arriving packets in an address space »Called as buffer n SRAM keeps the queue descriptor (QD) and next pointers »QD is the set of head, tail addresses and length of any queue »Next pointers are the address of next buffer in a queue n We need two categories of queues »A queue to keep all the free buffers available (i.e. free buffer queue) »Another set of queues to keep the buffers which holds the packets belonging to various flows (i.e. virtual queues) –These enables isolation of flows

8 - Sailesh Kumar - 8/18/2015 A simple queuing model (cont)

9 - Sailesh Kumar - 8/18/2015 Queuing Operation n For each arriving packet »A buffer is dequeued from the free buffer queue »Packet is written into it »Buffer is enqueued into the appropriate queue »Queue descriptors are updated n Thus enqueue operation into any queue involves »Update of free queue descriptor –Read followed by write »Update of virtual queue descriptor –Read followed by write n Free queue descriptor is kept on-chip, so updates fast n However, virtual queue descriptors are off-chip and hence their updates are slow

10 - Sailesh Kumar - 8/18/2015 Queuing Operation in a NP n To achieve high throughput, a group of processors and associated threads are used to collectively perform the queuing »Each thread handles a Packet at a time and enqueues/dequeues it into the appropriate queue n When arriving/departing packets all belong to different queues, such a scheme effectively speeds up the operation linearly with increasing number of threads n However, when packets belong to the same queue, the entire operation gets serialized, and threads start competing for the same queue descriptor »Multiple processors/threads doesn’t result in any benefit

11 - Sailesh Kumar - 8/18/2015 Operation Read QD A UpdateWrite QD A Thread 0 Read QD B UpdateWrite QD B Read QD C UpdateWrite QD C Thread 1 Read QD D Update Read QD E UpdateWrite QD E Thread 2 Read QD F Read QD G UpdateWrite QD G Thread x Read QD H Read QD A UpdateWrite QD A Thread 0 Wait Wait for thread 0 UpdateWrite QD A Thread 1 Wait Thread 2 Wait Thread x Read QD A n What if all threads access the same queue n If all threads access different queues

12 - Sailesh Kumar - 8/18/2015 Solution n Accelerate the serialized operations »Use mechanisms which will enable serialized operations run relatively faster n This can be done by putting a small on chip cache to hold the queue descriptors currently being accessed n Thus all threads but the first thread will be able to update the queue descriptor relatively much faster »In situations where threads access different queue descriptors, the operation will go as it is »When threads access the same queue descriptor, even if the operation gets serialized, each operation will be very fast

13 - Sailesh Kumar - 8/18/2015 Queuing cache n Thus, queuing cache will sit between the memory hierarchy and MEs. »Whenever queue descriptors are accessed, they will be put into the cache n Questions »Size of cache? »Eviction policy? n Intuitively the size of cache should be same as the maximum number of threads that are collectively performing the queuing operation »Because only so many QDs can be accessed at a time n The eviction policy can be Least Recently Used (LRU)

14 - Sailesh Kumar - 8/18/2015 Operation with queuing cache Read QD A UpdateWrite QD LRU Thread 0 Read QD B UpdateWrite QD LRU Read QD C UpdateWrite QD LRU Thread 1 Read QD D Update Read QD E UpdateWrite QD LRU Thread 2 Read QD F Read QD G UpdateWrite QD LRU Thread x Read QD H Read QD A Update Thread 0 Wait Wait for QD A Thread 1 Wait Thread 2 Wait Thread x n If all threads access the same queue Update Wait Update n If all threads access different queues

15 - Sailesh Kumar - 8/18/2015 Performance comparison n For a 200 MHz DDR SRAM with SRAM access latency of 80 ns and queuing cache access latency of 20 ns n And assuming that processor takes 10 ns to execute all the queuing related instructions associated with a single packet

16 - Sailesh Kumar - 8/18/2015 Our design approach n Since queuing is so common in NPs, it may be a very good idea to add the hardware level support for enqueue and dequeue operations n Queuing cache is the best place to put these functionalities, because then the queuing will be very fast in situation where it get serialized n Thus each NP will support standard instructions, like enqueue, dequeue, etc »These instructions will be sent to the queuing cache »Queuing cache will internally manage the pointers and also handle any contention when threads access the same queue n Also threads themselves are relived from the burden of synchronization and pointer management and can operate independently

17 - Sailesh Kumar - 8/18/2015 Implementation

18 - Sailesh Kumar - 8/18/2015 Intel’s Approach n Intel’s second-generation IXP network processors have support for queuing via, »SRAM controller, which holds queue descriptors and implements queuing operations, and »MEs, which support enqueue and dequeue instructions n Caching of queue descriptors is implemented using »A Q-array in memory controller »Any queuing operation precedes a transfer of queue descriptors from SRAM to Q-array »A CAM in kept in each ME –To keep track of which QD are cached and their position in the Q- array n CAM supports LRU which is used to evict entries from the Q-array

19 - Sailesh Kumar - 8/18/2015 Comparison n Reduced instruction count on each processor »If we move all the logic associated with the enqueues and dequeues to the queuing cache, software may become simple n Simple and modular software code for queuing tasks »No need for synchronization, etc n Queuing cache built near the memory controller results in significantly reduced on chip communication »Since queuing cache handles the pointer processing as well, the processors needn’t fetch the queue descriptors at all »Only communication between processors and queuing cache is instruction exchange n More scalable »Any number of MEs can participate in queuing »No local CAM per ME needed unlike Intel’s IXP approach

20 - Sailesh Kumar - 8/18/2015 Conclusion n Contributions »Brief qualitative and quantitative analysis of queuing cache »A proposal for efficient and scalable design n Future work »Comparison to other caching technique »Implementation to measure the real complexity n We believe that such a cache based centralized queuing hardware unit will make the future network processors more »Scalable and »Easy to program n Questions?