Presentation is loading. Please wait.

Presentation is loading. Please wait.

Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development.

Similar presentations


Presentation on theme: "Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development."— Presentation transcript:

1 Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development Vinod Balakrishnan Openwave Systems Inc. ANCS 2005

2 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 2 Problem  Edge routers need to support sophisticated set of services  How to best use the numerous hardware resources that Network processors provide  Cores, multiple memory levels, inter core queuing, crypto assists  Workloads fluctuate over time Overview

3 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 3 Source: “A Case for Run-time Adaptation in Packet Processing Systems”, R. Kokku, et. al, Hotnets II, vol. 34, Issue 1, January, 2004 0 50000 100000 150000 200000 01002003004005006007008009001000 http_data avg http://ita.ee.lbl.gov/html/contrib/UCB.home-IP-HTTP.html Location: Network edge in front of a group of Internet clients Duration: 5 days Problem Workload variations There is no representative workload ! Overview

4 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 4 Problem  Edge routers need to support large sets of sophisticated services  How to best use the numerous hardware resources that Network processors provide  Cores, multiple memory levels, inter core queuing, crypto assists  Workloads fluctuate over time  There is no representative workload  Usually over provision to handle worst case Overview Run time adaptation Ability to change mapping of services to hardware resources

5 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 5 Adaptation Opportunities Ex. 3 MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Power-down unneeded processors MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Ex. 1 MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Change allocation to increase individual service performance MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Ex. 2 VPN Encry pt/Dec rypt MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScal® core IPv6 Comp ressio n and Forwa rding IPv4 Compression and Forwarding VPN Encrypt/Decrypt Support a large set of services in the “fast path”, according to use Overview

6 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 6 Theory of Operation 1010101 0101010 1010101 0101010 1010 B A A C B Traffic Mix MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core Executable binaries 1010101 0101010 1010101 0101010 1010 A A B C XScale ME 1010101 0101010 1010101 0101010 1010 B C A C B, C Checkpoint processors A 1010101 0101010 1010101 0101010 1010 B B C C Bind resources Resource Abstraction Layer (RAL) Run-time system System Monitor Queue info Linker Resource Mapping Overview

7 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 7 Rate based Monitoring Rarr Rdep  Observe queue between two stages  Arrival/departure rates indicative of processing needs Monitoring  Assumption: R dep scales linearly.  So for a stage running on n cores, R dep = n * R dep1 Q size R arr = Current arrival rate R dep = Current departure rate R worst = Worst case arrival rate t sw = Time to switch on a core

8 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 8 Q adapt Buffer space to handle Allocation policy  Number of Cores = R / R dep1  If R = R worst, system directly moves to worst case provisioned state  Only request cores as needed  NumCores (R arr ) = R arr / R dep1 Policy R arr R dep  If R arr >> R dep, request allocation of processors, immediately  How many? function of (R arr / R dep1 )  If R arr slightly larger, let queue grow till Q, then request allocation of one processor  If R arr slightly larger, let queue grow till Q adapt, then request allocation of one processor worst burst

9 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 9 De-allocation policy  While increasing allocation, latch R dep1  if R arr / R dep1 < current allocation  Request de-allocation of one core  Hysterisis: Wait for some cycles before requesting de- allocation again  Avoids fluctuations for transient dips in arrival rate Policy

10 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 10 Theory of Operation A C B 1010101 0101010 1010101 0101010 1010 B A A C B Traffic Mix MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core Executable binaries 1010101 0101010 1010101 0101010 1010 A A B C XScale ME 1010101 0101010 1010101 0101010 1010 B C A C B, C A 1010101 0101010 1010101 0101010 1010 B B C C Resource Abstraction Layer (RAL) Run-time system System Monitor Queue info Resource Allocator Triggers Linker Resource Mapping Overview

11 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 11 Resource allocator  Handles requests for allocation/de-allocation from individual stages  Aware of global system state and decides  specific processor to allocate/free  to de-allocate or migrate stage when no free processors available  Steal only when arrival rate < arrival rate for requesting stage  whether request is declined Resource Allocation

12 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 12 Theory of Operation System Evaluation A C B 1010101 0101010 1010101 0101010 1010 B A Traffic Mix MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core Executable binaries 1010101 0101010 1010101 0101010 1010 A A B C XScale ME 1010101 0101010 1010101 0101010 1010 B C A C B, C A 1010101 0101010 1010101 0101010 1010 B B C C Resource Abstraction Layer (RAL) Run-time system System Monitor Queue info Resource Allocator Triggers Mapping Linker Resource Mapping Overview

13 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 13 Experimental setup  Radisys, Inc. ENP-2611*  600MHz Intel® IXP2400 Processor  MontaVista Linux*  3 optical Gigabit Ethernet ports  IXIA* traffic generator for packet stimulus * Third party brands/names are property of their respective owners Results

14 Resource allocation PolicyMonitoringMechanismsOverview Conclusion 14 Adaptation Costs  Overhead due to function calls to resource abstraction layer  14% performance degradation for processing min size packets at line rate  Overall adaptation time is:  Binding time + (checkpointing and loading time number of cores)  Binding time + (checkpointing and loading time * number of cores)  Cumulative effects: ~100ms  Dominated by cost of binding mechanism Results

15 Resource allocation PolicyMonitoringMechanismsOverview Conclusion 15 Adaptation benefits Testing Methodology  Need to measure ability of system to handle long term workload variations  Systems compared  Static system (Profile driven compilation)  Adaptive system Results

16 Resource allocation PolicyMonitoringMechanismsOverview Conclusion 16 L3 forwarder L2 bridge L3 forwarder L2 bridge L3 forwarder 1010101 0101010 1010101 0101010 1010 MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core L2 bridge L3 forwarder L2 bridge L3 forwarder L2 bridge Profile Compiler Static binary Traffic System Performance Rx L2 classifier L3 forwarder L2 bridge Ethernet encapsulation Tx Adaptation benefits Testing Methodology Layer 3 switching application Results

17 Resource allocation PolicyMonitoringMechanismsOverview Conclusion 17 0%, 100% 20%, 80%40%, 60%50%, 50%60%, 40%80%, 20% 100%, 0% Benefits of run time adaptation Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Source: Intel Results

18 Resource allocation PolicyMonitoringMechanismsOverview Conclusion 18 Future work  Study ability of an adaptive system to handle short term fluctuations  Would it drop more packets than a non-adaptive system  Enable flow-aware run time adaptation  Explore more sophisticated resource allocation algorithms  support properties like fairness and performance guarantees Conclusion

19 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 19 Related work  Ease of programming  NP-Click: N Shah etc, NP-2 workshop 2003  Nova: L George, M Blume, ACM SIGPLAN 2003  Auto-Partitioning programming model: Intel, whitepaper 2003  Dynamic extensibility  Router plugins: D Decasper etc, SIGCOMM 1998  PromethOS: R Keller etc, IWAN 2002  VERA: S Karlin, L Peterson, Computer Networks 2002  NetBind: M Kounavis, Software Practice and experience, 2004  Load balancing  ShaRE: R Kokku, Ph.D Thesis UT Austin, 2005 Conclusion

20 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 20 Conclusion  Run time adaptation is an attractive approach for handling traffic fluctuations  Implemented a framework capable of adapting processing cores allocated to network services  Implemented a policy that  Automatically balances service pipeline  Overcomes the code store limitation of fixed control store processor cores Conclusion

21 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 21 Background

22 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 22 Checkpointing Leveraging domain characteristics  Finding the best checkpoint is easier in packet processing than in general domains  Characteristics of data-flow applications  Typically implemented as a dispatch loop  Dispatch loop is executed at high-frequency  Top of the dispatch loop has no stack information  Since compiler creates dispatch loop, compiler inserts checkpoints in the code Mechanisms

23 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 23 Why Have Binding? MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale™ Core A B A B MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale™ Core A B A B Want to be able to use the fastest implementations of resources available Now we can use NN rings, local locks Mechanisms

24 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 24 Binding  Goal: Use the fastest implementations of resources available  Resource abstraction  Programmer’s write to abstract resources (Packet channels, uniform memory, locks etc)  Must have little impact on run-time performance  Our approach: Adaptation time linking Mechanisms (4/6)

25 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 25 Resource binding approach Adaptation-time linking A microengine-based example Application.o fileRAL.o fileFinal.o file Application Code RAL Implementation 0 RAL Implementation 1 RAL Implementation 2 RAL Implementation 3 RAL Implementation 4 RAL Implementation 5 RAL Implementation 6 RAL calls are initially undefined Linker adjusts jump targets using import variable mechanism At run time, the RTS has the application.o file At run time, the RTS has the application.o file and the RAL.o file Process repeated after each adaptation Mechanisms (6/6)

26 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 26 Binding: The Value of Choosing the Right Resource Implementation on Intel® IXP2400 Processor # S-push/S-pull bytes % S-push/S-pull bandwidth Next-neighbor00% Scratchpad ring 40.47% SRAM ring w/stats 687.9% Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.

27 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 27 Enterprise LAN Access Network MAN/WAN VPN Gateway Firewall Intrusion Detection Forwarding Switching XML & SSL acceleration L4-L7 switching Application acceleration Compression Monitoring (billing, QoS) Problem domain

28 Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 28 Determining Q adapt and monitoring interval Q adapt Buffer space to handle worst burst with n+1 cores R arr R dep Buffer space to handle worst burst with n cores Queue fill up while core comes online Policy  Want to maximize Q adapt  Q adapt function of queue monitoring interval Theoretical max Q adapt when queue depth can be detected instantaneously Q adapt


Download ppt "Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development."

Similar presentations


Ads by Google