Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development.

Slides:



Advertisements
Similar presentations
PhD 2 nd year panel Kevin lee October 2004 A Generic Programming Model for Network Processors Part Deux.
Advertisements

Computer Networking Lecture 20 – Queue Management and QoS.
Pathload A measurement tool for end-to-end available bandwidth Manish Jain, Univ-Delaware Constantinos Dovrolis, Univ-Delaware Sigcomm 02.
Winter 2004 UCSC CMPE252B1 CMPE 257: Wireless and Mobile Networking SET 3f: Medium Access Control Protocols.
Supercharging PlanetLab A High Performance,Multi-Alpplication,Overlay Network Platform Reviewed by YoungSoo Lee CSL.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Multi-Layer Switching Layers 1, 2, and 3. Cisco Hierarchical Model Access Layer –Workgroup –Access layer aggregation and L3/L4 services Distribution Layer.
Towards Virtual Routers as a Service 6th GI/ITG KuVS Workshop on “Future Internet” November 22, 2010 Hannover Zdravko Bozakov.
Extensible Networking Platform IWAN 2005 Extensible Network Configuration and Communication Framework Todd Sproull and John Lockwood
CCU EE&CTR1 Software Architecture Overview Nick Wang & Ting-Chao Hou National Chung Cheng University Control Plane-Platform Development Kit.
The War Between Mice and Elephants Presented By Eric Wang Liang Guo and Ibrahim Matta Boston University ICNP
Reconfigurable Network Topologies at Rack Scale
1 Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,
A Novel Approach for Transparent Bandwidth Conservation David Salyers, Aaron Striegel University of Notre Dame Department of Computer Science and Engineering.
Ubiquitous Component Remoting Support on Overlay Network Adaptation support with Ontology-based annotation Roaming support of wireless component communication.
Computer Networking Lecture 17 – Queue Management As usual: Thanks to Srini Seshan and Dave Anderson.
ECE 526 – Network Processing Systems Design IXP XScale and Microengines Chapter 18 & 19: D. E. Comer.
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
Router Construction II Outline Network Processors Adding Extensions Scheduling Cycles.
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Department of Electrical and Computer Engineering Kekai Hu, Harikrishnan Chandrikakutty, Deepak Unnikrishnan, Tilman Wolf, and Russell Tessier Department.
The Medium Access Control Sublayer Chapter 4. The Channel Allocation Problem Static Channel Allocation in LANs and MANs Dynamic Channel Allocation in.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Paper Review Building a Robust Software-based Router Using Network Processors.
© 2006 Cisco Systems, Inc. All rights reserved. Optimizing Converged Cisco Networks (ONT) Module 4: Implement the DiffServ QoS Model.
1 NETWORKED EMBEDDED SYSTEMS SRIKANTH SUBRAMANIAN.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
1 Liquid Software Larry Peterson Princeton University John Hartman University of Arizona
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye
Department of Electrical and Computer Engineering University of Massachusetts, Amherst Xin Huang and Tilman Wolf A Methodology.
© 2004, The Technology Firm Ethertype 886 from the Intel website Probe Packets and Settings AFT and ALB teams use probe packets. Probes.
Supporting Runtime Reconfiguration on Network Processors Kevin Lee Lancaster University
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 3 v3.0 Module 4 Switching Concepts.
Prepare by : Ihab shahtout.  Overview  To give an overview of fixed priority schedule  Scheduling and Fixed Priority Scheduling.
A dynamic optimization model for power and performance management of virtualized clusters Vinicius Petrucci, Orlando Loques Univ. Federal Fluminense Niteroi,
1 © 2003, Cisco Systems, Inc. All rights reserved. CCNA 3 v3.0 Module 4 Switching Concepts.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
EXTENSIBILITY, SAFETY AND PERFORMANCE IN THE SPIN OPERATING SYSTEM
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
July 12th 1999Kits Workshop 1 Active Networking at Washington University Dan Decasper.
Performance Analysis of Packet Classification Algorithms on Network Processors Deepa Srinivasan, IBM Corporation Wu-chang Feng, Portland State University.
Basic Memory Management 1. Readings r Silbershatz et al: chapters
High-Speed Policy-Based Packet Forwarding Using Efficient Multi-dimensional Range Matching Lakshman and Stiliadis ACM SIGCOMM 98.
Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.
Networking Components WILLIAM NELSON LTEC HUB  Device that operated on Layer 1 of the OSI stack.  All I/O flows out all other ports besides the.
Ning WengANCS 2005 Design Considerations for Network Processors Operating Systems Tilman Wolf 1, Ning Weng 2 and Chia-Hui Tai 1 1 University of Massachusetts.
© imec 2003 Designing an Operating System for a Heterogeneous Reconfigurable SoC Vincent Nollet, P. Coene, D. Verkest, S. Vernalde, R. Lauwereins IMEC,
© 2006 Cisco Systems, Inc. All rights reserved. Module 4: Implement the DiffServ QoS Model Lesson 4.3: Introducing Queuing Implementations.
Module 16: Distributed System Structures Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Apr 4, 2005 Distributed.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
1 Evaluation of Cooperative Web Caching with Web Polygraph Ping Du and Jaspal Subhlok Department of Computer Science University of Houston presented at.
RT-OPEX: Flexible Scheduling for Cloud-RAN Processing
Ioannis E. Venetis Department of Computer Engineering and Informatics
Measuring Service in Multi-Class Networks
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.
GEN: A GPU-Accelerated Elastic Framework for NFV
Adaptive Code Unloading for Resource-Constrained JVMs
Implementing an OpenFlow Switch on the NetFPGA platform
Jinquan Dai, Long Li, Bo Huang Intel China Software Center
Performance Evaluation of Computer Networks
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Introduction to Data Structure
Performance Evaluation of Computer Networks
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Presentation transcript:

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development Vinod Balakrishnan Openwave Systems Inc. ANCS 2005

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 2 Problem  Edge routers need to support sophisticated set of services  How to best use the numerous hardware resources that Network processors provide  Cores, multiple memory levels, inter core queuing, crypto assists  Workloads fluctuate over time Overview

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 3 Source: “A Case for Run-time Adaptation in Packet Processing Systems”, R. Kokku, et. al, Hotnets II, vol. 34, Issue 1, January, http_data avg Location: Network edge in front of a group of Internet clients Duration: 5 days Problem Workload variations There is no representative workload ! Overview

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 4 Problem  Edge routers need to support large sets of sophisticated services  How to best use the numerous hardware resources that Network processors provide  Cores, multiple memory levels, inter core queuing, crypto assists  Workloads fluctuate over time  There is no representative workload  Usually over provision to handle worst case Overview Run time adaptation Ability to change mapping of services to hardware resources

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 5 Adaptation Opportunities Ex. 3 MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Power-down unneeded processors MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Ex. 1 MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Change allocation to increase individual service performance MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Ex. 2 VPN Encry pt/Dec rypt MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScal® core IPv6 Comp ressio n and Forwa rding IPv4 Compression and Forwarding VPN Encrypt/Decrypt Support a large set of services in the “fast path”, according to use Overview

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 6 Theory of Operation B A A C B Traffic Mix MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core Executable binaries A A B C XScale ME B C A C B, C Checkpoint processors A B B C C Bind resources Resource Abstraction Layer (RAL) Run-time system System Monitor Queue info Linker Resource Mapping Overview

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 7 Rate based Monitoring Rarr Rdep  Observe queue between two stages  Arrival/departure rates indicative of processing needs Monitoring  Assumption: R dep scales linearly.  So for a stage running on n cores, R dep = n * R dep1 Q size R arr = Current arrival rate R dep = Current departure rate R worst = Worst case arrival rate t sw = Time to switch on a core

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 8 Q adapt Buffer space to handle Allocation policy  Number of Cores = R / R dep1  If R = R worst, system directly moves to worst case provisioned state  Only request cores as needed  NumCores (R arr ) = R arr / R dep1 Policy R arr R dep  If R arr >> R dep, request allocation of processors, immediately  How many? function of (R arr / R dep1 )  If R arr slightly larger, let queue grow till Q, then request allocation of one processor  If R arr slightly larger, let queue grow till Q adapt, then request allocation of one processor worst burst

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 9 De-allocation policy  While increasing allocation, latch R dep1  if R arr / R dep1 < current allocation  Request de-allocation of one core  Hysterisis: Wait for some cycles before requesting de- allocation again  Avoids fluctuations for transient dips in arrival rate Policy

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 10 Theory of Operation A C B B A A C B Traffic Mix MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core Executable binaries A A B C XScale ME B C A C B, C A B B C C Resource Abstraction Layer (RAL) Run-time system System Monitor Queue info Resource Allocator Triggers Linker Resource Mapping Overview

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 11 Resource allocator  Handles requests for allocation/de-allocation from individual stages  Aware of global system state and decides  specific processor to allocate/free  to de-allocate or migrate stage when no free processors available  Steal only when arrival rate < arrival rate for requesting stage  whether request is declined Resource Allocation

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 12 Theory of Operation System Evaluation A C B B A Traffic Mix MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core Executable binaries A A B C XScale ME B C A C B, C A B B C C Resource Abstraction Layer (RAL) Run-time system System Monitor Queue info Resource Allocator Triggers Mapping Linker Resource Mapping Overview

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 13 Experimental setup  Radisys, Inc. ENP-2611*  600MHz Intel® IXP2400 Processor  MontaVista Linux*  3 optical Gigabit Ethernet ports  IXIA* traffic generator for packet stimulus * Third party brands/names are property of their respective owners Results

Resource allocation PolicyMonitoringMechanismsOverview Conclusion 14 Adaptation Costs  Overhead due to function calls to resource abstraction layer  14% performance degradation for processing min size packets at line rate  Overall adaptation time is:  Binding time + (checkpointing and loading time number of cores)  Binding time + (checkpointing and loading time * number of cores)  Cumulative effects: ~100ms  Dominated by cost of binding mechanism Results

Resource allocation PolicyMonitoringMechanismsOverview Conclusion 15 Adaptation benefits Testing Methodology  Need to measure ability of system to handle long term workload variations  Systems compared  Static system (Profile driven compilation)  Adaptive system Results

Resource allocation PolicyMonitoringMechanismsOverview Conclusion 16 L3 forwarder L2 bridge L3 forwarder L2 bridge L3 forwarder MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core L2 bridge L3 forwarder L2 bridge L3 forwarder L2 bridge Profile Compiler Static binary Traffic System Performance Rx L2 classifier L3 forwarder L2 bridge Ethernet encapsulation Tx Adaptation benefits Testing Methodology Layer 3 switching application Results

Resource allocation PolicyMonitoringMechanismsOverview Conclusion 17 0%, 100% 20%, 80%40%, 60%50%, 50%60%, 40%80%, 20% 100%, 0% Benefits of run time adaptation Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Source: Intel Results

Resource allocation PolicyMonitoringMechanismsOverview Conclusion 18 Future work  Study ability of an adaptive system to handle short term fluctuations  Would it drop more packets than a non-adaptive system  Enable flow-aware run time adaptation  Explore more sophisticated resource allocation algorithms  support properties like fairness and performance guarantees Conclusion

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 19 Related work  Ease of programming  NP-Click: N Shah etc, NP-2 workshop 2003  Nova: L George, M Blume, ACM SIGPLAN 2003  Auto-Partitioning programming model: Intel, whitepaper 2003  Dynamic extensibility  Router plugins: D Decasper etc, SIGCOMM 1998  PromethOS: R Keller etc, IWAN 2002  VERA: S Karlin, L Peterson, Computer Networks 2002  NetBind: M Kounavis, Software Practice and experience, 2004  Load balancing  ShaRE: R Kokku, Ph.D Thesis UT Austin, 2005 Conclusion

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 20 Conclusion  Run time adaptation is an attractive approach for handling traffic fluctuations  Implemented a framework capable of adapting processing cores allocated to network services  Implemented a policy that  Automatically balances service pipeline  Overcomes the code store limitation of fixed control store processor cores Conclusion

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 21 Background

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 22 Checkpointing Leveraging domain characteristics  Finding the best checkpoint is easier in packet processing than in general domains  Characteristics of data-flow applications  Typically implemented as a dispatch loop  Dispatch loop is executed at high-frequency  Top of the dispatch loop has no stack information  Since compiler creates dispatch loop, compiler inserts checkpoints in the code Mechanisms

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 23 Why Have Binding? MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale™ Core A B A B MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale™ Core A B A B Want to be able to use the fastest implementations of resources available Now we can use NN rings, local locks Mechanisms

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 24 Binding  Goal: Use the fastest implementations of resources available  Resource abstraction  Programmer’s write to abstract resources (Packet channels, uniform memory, locks etc)  Must have little impact on run-time performance  Our approach: Adaptation time linking Mechanisms (4/6)

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 25 Resource binding approach Adaptation-time linking A microengine-based example Application.o fileRAL.o fileFinal.o file Application Code RAL Implementation 0 RAL Implementation 1 RAL Implementation 2 RAL Implementation 3 RAL Implementation 4 RAL Implementation 5 RAL Implementation 6 RAL calls are initially undefined Linker adjusts jump targets using import variable mechanism At run time, the RTS has the application.o file At run time, the RTS has the application.o file and the RAL.o file Process repeated after each adaptation Mechanisms (6/6)

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 26 Binding: The Value of Choosing the Right Resource Implementation on Intel® IXP2400 Processor # S-push/S-pull bytes % S-push/S-pull bandwidth Next-neighbor00% Scratchpad ring 40.47% SRAM ring w/stats 687.9% Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 27 Enterprise LAN Access Network MAN/WAN VPN Gateway Firewall Intrusion Detection Forwarding Switching XML & SSL acceleration L4-L7 switching Application acceleration Compression Monitoring (billing, QoS) Problem domain

Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 28 Determining Q adapt and monitoring interval Q adapt Buffer space to handle worst burst with n+1 cores R arr R dep Buffer space to handle worst burst with n cores Queue fill up while core comes online Policy  Want to maximize Q adapt  Q adapt function of queue monitoring interval Theoretical max Q adapt when queue depth can be detected instantaneously Q adapt