Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 QoS Support.

Slides:



Advertisements
Similar presentations
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 – QoS.
Advertisements

Courtesy: Nick McKeown, Stanford 1 Intro to Quality of Service Tahir Azim.
CSE Computer Networks Prof. Aaron Striegel Department of Computer Science & Engineering University of Notre Dame Lecture 20 – March 25, 2010.
Scheduling An Engineering Approach to Computer Networking.
EECB 473 Data Network Architecture and Electronics Lecture 3 Packet Processing Functions.
Real-Time Protocol (RTP) r Provides standard packet format for real-time application r Typically runs over UDP r Specifies header fields below r Payload.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #11 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”
Differentiated Services. Service Differentiation in the Internet Different applications have varying bandwidth, delay, and reliability requirements How.
1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
ATM Networks An Engineering Approach to Computer Networking.
ACN: IntServ and DiffServ1 Integrated Service (IntServ) versus Differentiated Service (Diffserv) Information taken from Kurose and Ross textbook “ Computer.
Service Disciplines for Guaranteed Performance Service Hui Zhang, “Service Disciplines for Guaranteed Performance Service in Packet-Switching Networks,”
Katz, Stoica F04 EECS 122: Introduction to Computer Networks Packet Scheduling and QoS Computer Science Division Department of Electrical Engineering and.
CS Summer 2003 Lecture 8. CS Summer 2003 Populating LFIB with LDP Assigned/Learned Labels Changes in the LFIB may be triggered routing or.
CS 268: Differentiated Services Ion Stoica February 25, 2003.
CSE 401N Multimedia Networking-2 Lecture-19. Improving QOS in IP Networks Thus far: “making the best of best effort” Future: next generation Internet.
1 Quality of Service Outline Realtime Applications Integrated Services Differentiated Services.
ACN: Congestion Control1 Congestion Control and Resource Allocation.
Computer Networking Lecture 17 – Queue Management As usual: Thanks to Srini Seshan and Dave Anderson.
School of Information Technologies IP Quality of Service NETS3303/3603 Weeks
CSc 461/561 CSc 461/561 Multimedia Systems Part C: 3. QoS.
Design of QoS Router Terrance Lee. Broadband Internet Architecture Intelligent Access Electronic Switch (Intserv or Diffserv) Switching /Routing QoS Security.
Spring 2002CS 4611 Quality of Service Outline Realtime Applications Integrated Services Differentiated Services.
Chapter 9 Classification And Forwarding. Outline.
Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
Network Processors and Web Servers CS 213 LECTURE 17 From: IBM Technical Report.
Computer Networks Switching Professor Hui Zhang
A Scalable, Cache-Based Queue Management Subsystem for Network Processors Sailesh Kumar, Patrick Crowley Dept. of Computer Science and Engineering.
Lecture Note on Network Processors. What Is a Network Processor? Processor optimized for processing communications related tasks. Often implemented with.
Packet Scheduling From Ion Stoica. 2 Packet Scheduling  Decide when and what packet to send on output link -Usually implemented at output interface 1.
Paper Review Building a Robust Software-based Router Using Network Processors.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 Network Processors : Building Block for Programmable High- Speed Networks Introduction to the.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CIS679: Scheduling, Resource Configuration and Admission Control r Review of Last lecture r Scheduling r Resource configuration r Admission control.
Integrated Services (RFC 1633) r Architecture for providing QoS guarantees to individual application sessions r Call setup: a session requiring QoS guarantees.
CSE679: QoS Infrastructure to Support Multimedia Communications r Principles r Policing r Scheduling r RSVP r Integrated and Differentiated Services.
CS Spring 2011 CS 414 – Multimedia Systems Design Lecture 23 - Multimedia Network Protocols (Layer 3) Klara Nahrstedt Spring 2011.
CSE QoS in IP. CSE Improving QOS in IP Networks Thus far: “making the best of best effort”
QOS مظفر بگ محمدی دانشگاه ایلام. 2 Why a New Service Model? Best effort clearly insufficient –Some applications need more assurances from the network.
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.
CONGESTION CONTROL and RESOURCE ALLOCATION. Definition Resource Allocation : Process by which network elements try to meet the competing demands that.
Advance Computer Networking L-5 TCP & Routers Acknowledgments: Lecture slides are from the graduate level Computer Networks course thought by Srinivasan.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Queueing and Scheduling Traffic is moved by connecting end-systems to switches, and switches to each other Traffic is moved by connecting end-systems to.
Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 ECSE-6600: Internet Protocols Informal Quiz #14 Shivkumar Kalyanaraman: GOOGLE: “Shiv RPI”
Packet Scheduling and Buffer Management Switches S.Keshav: “ An Engineering Approach to Networking”
Bjorn Landfeldt, The University of Sydney 1 NETS3303 Networked Systems.
CS640: Introduction to Computer Networks Aditya Akella Lecture 20 - Queuing and Basics of QoS.
Nick McKeown Spring 2012 Lecture 2,3 Output Queueing EE384x Packet Switch Architectures.
Efficient Cache Structures of IP Routers to Provide Policy-Based Services Graduate School of Engineering Osaka City University
Intel ® IXP2XXX Network Processor Architecture and Programming Prof. Laxmi Bhuyan Computer Science UC Riverside.
Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.
EE 122: Lecture 15 (Quality of Service) Ion Stoica October 25, 2001.
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
Scheduling CS 218 Fall 02 - Keshav Chpt 9 Nov 5, 2003 Problem: given N packet streams contending for the same channel, how to schedule pkt transmissions?
Lecture Note on Scheduling Algorithms. What is scheduling? A scheduling discipline resolves contention, “who is the next?” Goal: fairness and latency.
1 Fair Queuing Hamed Khanmirza Principles of Network University of Tehran.
Queue Scheduling Disciplines
Multicost (or QoS) routing For example: More generally, Minimize f(V)=f(V 1,…,V k ) over all paths.
Providing QoS in IP Networks
1 Lecture 15 Internet resource allocation and QoS Resource Reservation Protocol Integrated Services Differentiated Services.
04/02/08 1 Packet Scheduling IT610 Prof. A. Sahoo KReSIT.
Queue Management Jennifer Rexford COS 461: Computer Networks
EE 122: Quality of Service and Resource Allocation
EE 122: Lecture 18 (Differentiated Services)
Computer Science Division
EE 122: Differentiated Services
CIS679: Two Planes and Int-Serv Model
EECS 122: Introduction to Computer Networks Packet Scheduling and QoS
Presentation transcript:

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 1 QoS Support

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 2 What is QoS? q “Better performance” as described by a set of parameters or measured by a set of metrics. q Generic parameters: q Bandwidth q Delay, Delay-jitter q Packet loss rate (or loss probability) q Transport/Application-specific parameters: q Timeouts q Percentage of “important” packets lost

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 3 What is QoS (contd) ? q These parameters can be measured at several granularities: q “micro” flow, aggregate flow, population. q QoS considered “better” if q a) more parameters can be specified q b) QoS can be specified at a fine-granularity. q QoS vs CoS: CoS maps micro-flows to classes and may perform optional resource reservation per-class q QoS spectrum: Best Effort Leased Line

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 4 Example QoS q Bandwidth: r Mbps in a time T, with burstiness b q Delay: worst-case q Loss: worst-case or statistical r tokens per second b tokens <= R bps regulator bits slope r Arrival curve d BaBa slope r a time b*R/(R-r)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 5 Fundamental Problems q In a FIFO service discipline, the performance assigned to one flow is convoluted with the arrivals of packets from all other flows! q Cant get QoS with a “free-for-all” q Need to use new scheduling disciplines which provide “isolation” of performance from arrival rates of background traffic B Scheduling Discipline FIFO B

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 6 Fundamental Problems q Conservation Law (Kleinrock):  (i)W q (i) = K q Irrespective of scheduling discipline chosen: q Average backlog (delay) is constant q Average bandwidth is constant q Zero-sum game => need to “set-aside” resources for premium services

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 7 QoS Big Picture: Control/Data Planes

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 8 Eg: Integrated Services (IntServ) q An architecture for providing QOS guarantees in IP networks for individual application sessions q Relies on resource reservation, and routers need to maintain state information of allocated resources (eg: g) and respond to new Call setup requests

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 9 Call Admission q Call Admission: routers will admit calls based on their R-spec and T-spec and base on the current resource allocated at the routers to other calls.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 10 Token Bucket q Characterized by three parameters (b, r, R) q b – token depth q r – average arrival rate q R – maximum arrival rate (e.g., R link capacity) q A bit is transmitted only when there is an available token q When a bit is transmitted exactly one token is consumed r tokens per second b tokens <= R bps regulator time bits b*R/(R-r) slope R slope r

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 11 Per-hop Reservation  Given b,r,R and per-hop delay d  Allocate bandwidth r a and buffer space B a such that to guarantee d bits b slope r Arrival curve d BaBa slope r a

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 12 Mechanisms: Queuing/Scheduling q Use a few bits in header to indicate which queue (class) a packet goes into (also branded as CoS) q High $$ users classified into high priority queues, which also may be less populated => lower delay and low likelihood of packet drop q Ideas: priority, round-robin, classification, aggregation,... Class C Class B Class A Traffic Classes Traffic Sources $$$$$$ $$$ $

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 13 Mechanisms: Buffer Mgmt/Priority Drop q Ideas: packet marking, queue thresholds, differential dropping, buffer assignments Drop RED and BLUE packets Drop only BLUE packets

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 14 Classification

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 15 Why Classification? Providing Value­Added Services Some examples q Differentiated services q Regard traffic from Autonomous System #33 as `platinum­ grade’ q Access Control Lists q Deny udp host eq snmp q Committed Access Rate q Rate limit WWW traffic from sub­interface#739 to 10Mbps q Policy­based Routing q Route all voice traffic through the ATM network

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 16 Packet Classification Action ---- PredicateAction Classifier (Policy Database) Packet Classification Forwarding Engine Incoming Packet HEADERHEADER

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 17 Multi-field Packet Classification Given a classifier with N rules, find the action associated with the highest priority rule matching an incoming packet.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 18 Prefix matching: 1-d range problem / / / / /24 Most specific route = “longest matching prefix”

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 19 R5 Classification: 2D Geometry problem R4 R3 R2 R1 R7 P2 Field #1 Field #2 R6 Field #1Field #2Data P1 e.g. ( , *) e.g. (144.24/16, 64/24)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 20 Packet Classification References q T.V. Lakshman. D. Stiliadis. “High speed policy based packet forwarding using efficient multi-dimensional range matching”, Sigcomm 1998, pp q V. Srinivasan, S. Suri, G. Varghese and M. Waldvogel. “Fast and scalable layer 4 switching”, Sigcomm 1998, pp q V. Srinivasan, G. Varghese, S. Suri. “Fast packet classification using tuple space search”, Sigcomm q P. Gupta, N. McKeown, “Packet classification using hierarchical intelligent cuttings”, Hot Interconnects VII, q P. Gupta, N. McKeown, “Packet classification on multiple fields”, Sigcomm 1999.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 21 Proposed Schemes

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 22 Proposed Schemes (Contd.)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 23 Proposed Schemes (Contd.)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 24 Scheduling

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 25 Output Scheduling scheduler Allocating output bandwidth Controlling packet delay

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 26 Output Scheduling FIFO Fair Queueing

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 27 Motivation: Parekh-Gallager theorem q Let a connection be allocated weights at each WFQ scheduler along its path, so that the least bandwidth it is allocated is g q Let it be leaky-bucket regulated such that # bits sent in time [t 1, t 2 ] <= g(t 2 - t 1 ) +  q Let the connection pass through K schedulers, where the kth scheduler has a rate r(k) q Let the largest packet size in the network be P

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 28 Motivation q FIFO is natural but gives poor QoS q bursty flows increase delays for others q hence cannot guarantee delays Need round robin scheduling of packets q Fair Queueing q Weighted Fair Queueing, Generalized Processor Sharing

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 29 Scheduling: Requirements q An ideal scheduling discipline q is easy to implement: VLSI space, exec time q is fair: max-min fairness q provides performance bounds: q deterministic or statistical q granularity: micro-flow or aggregate flow q allows easy admission control decisions q to decide whether a new flow can be allowed

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 30 Choices: 1. Priority q Packet is served from a given priority level only if no packets exist at higher levels (multilevel priority with exhaustive service) q Highest level gets lowest delay q Watch out for starvation! q Usually map priority levels to delay classes Low bandwidth urgent messages Realtime Non-realtime Priority

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 31 Scheduling Policies: Choices #1 q Priority Queuing: classes have different priorities; class may depend on explicit marking or other header info, eg IP source or destination, TCP Port numbers, etc. q Transmit a packet from the highest priority class with a non-empty queue. Problem: starvation q Preemptive and non-preemptive versions

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 32 Scheduling Policies (more) q Round Robin: scan class queues serving one from each class that has a non-empty queue

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 33 Choices: 2. Work conserving vs. non-work-conserving q Work conserving discipline is never idle when packets await service q Why bother with non-work conserving?

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 34 Non-work-conserving disciplines q Key conceptual idea: delay packet till eligible q Reduces delay-jitter => fewer buffers in network q How to choose eligibility time? q rate-jitter regulator q bounds maximum outgoing rate q delay-jitter regulator q compensates for variable delay at previous hop

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 35 Do we need non-work-conservation? q Can remove delay-jitter at an endpoint instead q but also reduces size of switch buffers… q Increases mean delay q not a problem for playback applications q Wastes bandwidth q can serve best-effort packets instead q Always punishes a misbehaving source q can’t have it both ways q Bottom line: not too bad, implementation cost may be the biggest problem

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 36 Choices: 3. Degree of aggregation q More aggregation q less state q cheaper q smaller VLSI q less to advertise q BUT: less individualization q Solution q aggregate to a class, members of class have same performance requirement q no protection within class

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 37 Choices: 4. Service within a priority level q In order of arrival (FCFS) or in order of a service tag q Service tags => can arbitrarily reorder queue q Need to sort queue, which can be expensive q FCFS q bandwidth hogs win (no protection) q no guarantee on delays q Service tags q with appropriate choice, both protection and delay bounds possible: q eg: differential buffer management, packet drop

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 38 Weighted round robin q Serve a packet from each non-empty queue in turn q Unfair if packets are of different length or weights are not equal q Different weights, fixed packet size q serve more than one packet per visit, after normalizing to obtain integer weights q Different weights, variable size packets q normalize weights by mean packet size q e.g. weights {0.5, 0.75, 1.0}, mean packet sizes {50, 500, 1500} q normalize weights: {0.5/50, 0.75/500, 1.0/1500} = { 0.01, , }, normalize again {60, 9, 4}

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 39 Problems with Weighted Round Robin q With variable size packets and different weights, need to know mean packet size in advance q Can be unfair for long periods of time q E.g. q T3 trunk with 500 connections, each connection has mean packet length 500 bytes, 250 with weight 1, 250 with weight 10 q Each packet takes 500 * 8/45 Mbps = 88.8 microseconds q Round time =2750 * 88.8 = ms

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 40 Generalized Processor Sharing(GPS) q Assume a fluid model of traffic q Visit each non-empty queue in turn (RR) q Serve infinitesimal from each q Leads to “max-min” fairness q GPS is un-implementable! q We cannot serve infinitesimals, only packets

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 41 Fair Queuing (FQ) q Idea: serve packets in the order in which they would have finished transmission in the fluid flow system q Mapping bit-by-bit schedule onto packet transmission schedule q Transmit packet with the lowest F i at any given time q Variation: Weighted Fair Queuing (WFQ)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 42 FQ Example F=10 Flow 1 (arriving) Flow 2 transmitting Output F=2 F=5 F=8 Flow 1Flow 2 Output F=10 Cannot preempt packet currently being transmitted

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 43 WFQ: Practical considerations q For every packet, the scheduler needs to q classify it into the right flow queue and maintain a linked-list for each flow q schedule it for departure q Complexities of both are o(log [# of flows]) q first is hard to overcome (studied earlier) q second can be overcome by DRR

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 44 Deficit Round Robin Quantum size Good approximation of FQ Much simpler to implement

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 45 WFQ Problems q To get a delay bound, need to pick g q the lower the delay bounds, the larger g needs to be q large g => exclusion of more competitors from link q g can be very large, in some cases 80 times the peak rate! q Sources must be leaky-bucket regulated q but choosing leaky-bucket parameters is problematic q WFQ couples delay and bandwidth allocations q low delay requires allocating more bandwidth q wastes bandwidth for low-bandwidth low-delay sources

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 46 Delay-Earliest Due Date (EDD) q Earliest-due-date: packet with earliest deadline selected q Delay-EDD prescribes how to assign deadlines to packets q A source is required to send slower than its peak rate q Bandwidth at scheduler reserved at peak rate q Deadline = expected arrival time + delay bound q If a source sends faster than contract, delay bound will not apply q Each packet gets a hard delay bound q Delay bound is independent of bandwidth requirement q but reservation is at a connection’s peak rate q Implementation requires per-connection state and a priority queue

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 47 Rate-controlled scheduling q A class of disciplines q two components: regulator and scheduler q incoming packets are placed in regulator where they wait to become eligible q then they are put in the scheduler q Regulator shapes the traffic, scheduler provides performance guarantees q Considered impractical; interest waning after QoS decline

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 48 Examples q Recall q rate-jitter regulator q bounds maximum outgoing rate q delay-jitter regulator q compensates for variable delay at previous hop q Rate-jitter regulator + FIFO q similar to Delay-EDD q Rate-jitter regulator + multi-priority FIFO q gives both bandwidth and delay guarantees (RCSP) q Delay-jitter regulator + EDD q gives bandwidth, delay,and delay-jitter bounds (Jitter- EDD)

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 49 Stateful Solution Complexity q Data path q Per-flow classification q Per-flow buffer management q Per-flow scheduling q Control path q install and maintain per-flow state for data and control paths Classifier Buffer management Scheduler flow 1 flow 2 flow n output interface … Per-flow State

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 50 Differentiated Services Model q Edge routers: traffic conditioning (policing, marking, dropping), SLA negotiation q Set values in DS-byte in IP header based upon negotiated service and observed traffic. q Interior routers: traffic classification and forwarding (near stateless core!) q Use DS-byte as index into forwarding table Ingress Edge Router Egress Edge Router Interior Router

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 51 Diffserv Architecture Edge router: - per-flow traffic management - marks packets as in- profile and out-profile Core router: - per class TM - buffering and scheduling based on marking at edge - preference given to in-profile packets - Assured Forwarding scheduling... r b marking

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 52 Diff Serv: implementation q Classify flows into classes q maintain only per-class queues q perform FIFO within each class q avoid “curse of dimensionality”

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 53 Diff Serv q A framework for providing differentiated QoS q set Type of Service (ToS) bits in packet headers q this classifies packets into classes q routers maintain per-class queues q condition traffic at network edges to conform to class requirements May still need queue management inside the network

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 54 Network Processors (NPUs) Slides from Raj Yavatkar,

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 55 CPUs vs NPUs q What makes a CPU appealing for a PC q Flexibility: Supports many applications q Time to market: Allows quick introduction of new applications q Future proof: Supports as-yet unthought of applications q No-one would consider using fixed function ASICs for a PC

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 56 Why NPUs seem like a good idea q What makes a NPU appealing q Time to market: Saves 18months building an ASIC. Code re-use. q Flexibility: Protocols and standards change. q Future proof: New protocols emerge. q Less risk: Bugs more easily fixed in s/w. q Surely no-one would consider using fixed function ASICs for new networking equipment?

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 57 The other side of the NPU debate… q Jack of all trades, master of none q NPUs are difficult to program q NPUs inevitably consume more power, q …run more slowly and q …cost more than an ASIC q Requires domain expertise q Why would a/the networking vendor educate its suppliers? q Designed for computation rather than memory- intensive operations

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 58 NPU Characteristics q NPUs try hard to hide memory latency q Conventional caching doesn’t work q Equal number of reads and writes q No temporal or spatial locality q Cache misses lose throughput, confuse schedulers and break pipelines q Therefore it is common to use multiple processors with multiple contexts

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 59 Network Processors Load-balancing cache Off chip Memory Dispatch CPU Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Incoming packets dispatched to: 1. Idle processor, or 2. Processor dedicated to packets in this flow (to prevent mis-sequencing), or 3. Special-purpose processor for flow, e.g. security, transcoding, application-level processing.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 60 Network Processors Pipelining cache Off chip Memory CPU cache CPU cache CPU cache CPU Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Dedicated HW support, e.g. lookups Processing broken down into (hopefully balanced) steps, Each processor performs one step of processing.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 61 NPUs and Memory q Network processors and their memory q Packet processing is all about getting packets into and out of a chip and memory. q Computation is a side-issue. q Memory speed is everything: Speed matters more than size.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 62 NPUs and Memory Buffer MemoryLookupCountersSchedule StateClassification Program DataInstruction Code Typical NPU or packet-processor has 8-64 CPUs, 12 memory interfaces and 2000 pins

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 63 Intel IXP Network Processors q Microengines q RISC processors optimized for packet processing q Hardware support for multi-threading q Fast path q Embedded StrongARM/Xscale q Runs embedded OS and handles exception tasks q Slow path, Control plane ME 1ME 2ME n StrongARM SRAMDRAM Media/Fabric Interface Control Processor

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 64 NPU Building Blocks: Processors

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 65 Division of Functions

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 66 NPU Building Blocks: Memory

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 67 Memory Scaling

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 68 Memory Types

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 69 NPU Building Blocks: CAM and Ternary CAM CAM Operation: Ternary CAM (T-CAM):

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 70 Memory Caching vs CAM CACHE: Content Addressable Memory (CAM):

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 71 Ternary CAMs R R R R R4 ValueMask Priority Encoder Next Hop Associative Memory Using T-CAMs for Classification:

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 72 IXP: A Building Block for Network Systems q Example: IXP2800 q 16 micro-engines + XScale core q Up to 1.4 Ghz ME speed q 8 HW threads/ME q 4K control store per ME q Multi-level memory hierarchy q Multiple inter-processor communication channels q NPU vs. GPU tradeoffs q Reduce core complexity q No hardware caching q Simpler instructions  shallow pipelines q Multiple cores with HW multi- threading per chip MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 RDRAM Controller Intel® XScale™ Core Media Switch Fabric I/F PCI QDR SRAM Controller Scratch Memory Hash Unit Multi-threaded (x8) Microengine Array Per-Engine Memory, CAM, Signals Interconnect

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 73 IXP2800 Features q Half Duplex OC-192 / 10 Gb/sec Ethernet Network Processor q XScale Core q 700 MHz (half the ME) q 32 Kbytes instruction cache / 32 Kbytes data cache q Media / Switch Fabric Interface q 2 x 16 bit LVDS Transmit & Receive q Configured as CSIX-L2 or SPI-4 q PCI Interface q 64 bit / 66 MHz Interface for Control q 3 DMA Channels q QDR Interface (w/Parity) q (4) 36 bit SRAM Channels (QDR or Co-Processor) q Network Processor Forum LookAside-1 Standard Interface q Using a “clamshell” topology both Memory and Co-processor can be instantiated on same channel q RDR Interface q (3) Independent Direct Rambus DRAM Interfaces q Supports 4i Banks or 16 interleaved Banks q Supports 16/32 Byte bursts

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 74 Hardware Features to ease packet processing q Ring Buffers q For inter-block communication/synchronization q Producer-consumer paradigm q Next Neighbor Registers and Signaling q Allows for single cycle transfer of context to the next logical micro-engine to dramatically improve performance q Simple, easy transfer of state q Distributed data caching within each micro-engine q Allows for all threads to keep processing even when multiple threads are accessing the same data

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 75 XScale Core processor q Compliant with the ARM V5TE architecture q support for ARM’s thumb instructions q support for Digital Signal Processing (DSP) enhancements to the instruction set q Intel’s improvements to the internal pipeline to improve the memory-latency hiding abilities of the core q does not implement the floating-point instructions of the ARM V5 instruction set

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 76 Microengines – RISC processors q IXP 2800 has 16 microengines, organized into 4 clusters (4 MEs per cluster) q ME instruction set specifically tuned for processing network data q 40-bit x 4K control store q Six-stage pipeline in an instruction q On an average takes one cycle to execute q Each ME has eight hardware-assisted threads of execution q can be configured to use either all eight threads or only four threads q The non-preemptive hardware thread arbiter swaps between threads in round-robin order

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 77 MicroEngine v2 128 GPR Control Store 4K Instructions 128 GPR Local Memory 640 words 128 Next Neighbor 128 S Xfer Out 128 D Xfer Out Local CSRs CRC Unit 128 S Xfer In 128 D Xfer In LM Addr 1 LM Addr 0 D-Push Bus S-Push Bus D-Pull BusS-Pull Bus To Next Neighbor From Next Neighbor A_Operand B_Operand ALU_Out P-Random # 32-bit Execution Data Path Multiply Find first bit Add, shift, logical 2 per CTX CRC remain Lock 0-15 Status and LRU Logic (6-bit) TAGs 0-15 Status Entry# CAM Timers Timestamp Prev B B_op Prev A A_op

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 78 Registers available to each ME q Four different types of registers q general purpose, SRAM transfer, DRAM transfer, next-neighbor (NN) q 256, 32-bit GPRs q can be accessed in thread-local or absolute mode q 256, 32-bit SRAM transfer registers. q used to read/write to all functional units on the IXP2xxx except the DRAM q 256, 32-bit DRAM transfer registers q divided equally into read-only and write-only q used exclusively for communication between the MEs and the DRAM q Benefit of having separate transfer and GPRs q ME can continue processing with GPRs while other functional units read and write the transfer registers

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 79 Different Types of Memory Type of Memory Logical width (bytes) Size in bytesApprox unloaded latency (cycles) Special Notes Local to ME425603Indexed addressing post incr/decr On-chip scratch 416K60Atomic ops 16 rings w/at. get/put SRAM4256M150Atomic ops 64-elem q- array DRAM82G300Direct path to/from MSF

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 80 Resource Manager Library Control Plane PDK Control Plane Protocol Stacks Core Components IXA Software Framework Microengine Pipeline XScale™ Core Micro block Micro block Micro block Microblock Library Utility LibraryProtocol Library External Processors Hardware Abstraction Library Microengine C Language C/C++ Language Core Component Library

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 81 Micro-engine C Compiler q C language constructs q Basic types, q pointers, bit fields q In-line assembly code support q Aggregates q Structs, unions, arrays

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 82 What is a Microblock q Data plane packet processing on the microengines is divided into logical functions called microblocks q Coarse Grain and stateful q Example q 5-Tuple Classification, IPv4 Forwarding, NAT q Several microblocks running on a microengine thread can be combined into a microblock group. q A microblock group has a dispatch loop that defines the dataflow for packets between microblocks q A microblock group runs on each thread of one or more microengines q Microblocks can send and receive packets to/from an associated Xscale Core Component.

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 83 XScale™ Core Micro- engines Core Components and Microblocks User-written code Microblock Library Intel/3 rd party blocks Microblock Microblock Library Microblock Core Component Core Component Core Component Core Libraries Core Component Library Resource Manager Library

Shivkumar Kalyanaraman Rensselaer Polytechnic Institute 84 Debate about network processors Context Data cache(s) DataHdr Characteristics: 1. Stream processing. 2. Multiple flows. 3. Most processing on header, not data. 4. Two sets of data: packets, context. 5. Packets have no temporal locality, and special spatial locality. 6. Context has temporal and spatial locality. Characteristics: 1. Shared in/out bus. 2. Optimized for data with spatial and temporal locality. 3. Optimized for register accesses. The nail: The hammer: