Download presentation
Presentation is loading. Please wait.
Published byMavis Powers Modified over 9 years ago
1
Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development Vinod Balakrishnan Openwave Systems Inc. ANCS 2005
2
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 2 Problem Edge routers need to support sophisticated set of services How to best use the numerous hardware resources that Network processors provide Cores, multiple memory levels, inter core queuing, crypto assists Workloads fluctuate over time Overview
3
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 3 Source: “A Case for Run-time Adaptation in Packet Processing Systems”, R. Kokku, et. al, Hotnets II, vol. 34, Issue 1, January, 2004 0 50000 100000 150000 200000 01002003004005006007008009001000 http_data avg http://ita.ee.lbl.gov/html/contrib/UCB.home-IP-HTTP.html Location: Network edge in front of a group of Internet clients Duration: 5 days Problem Workload variations There is no representative workload ! Overview
4
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 4 Problem Edge routers need to support large sets of sophisticated services How to best use the numerous hardware resources that Network processors provide Cores, multiple memory levels, inter core queuing, crypto assists Workloads fluctuate over time There is no representative workload Usually over provision to handle worst case Overview Run time adaptation Ability to change mapping of services to hardware resources
5
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 5 Adaptation Opportunities Ex. 3 MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Power-down unneeded processors MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Ex. 1 MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Change allocation to increase individual service performance MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core IPv6 Compression and Forwarding IPv4 Compression and Forwarding Ex. 2 VPN Encry pt/Dec rypt MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScal® core IPv6 Comp ressio n and Forwa rding IPv4 Compression and Forwarding VPN Encrypt/Decrypt Support a large set of services in the “fast path”, according to use Overview
6
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 6 Theory of Operation 1010101 0101010 1010101 0101010 1010 B A A C B Traffic Mix MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core Executable binaries 1010101 0101010 1010101 0101010 1010 A A B C XScale ME 1010101 0101010 1010101 0101010 1010 B C A C B, C Checkpoint processors A 1010101 0101010 1010101 0101010 1010 B B C C Bind resources Resource Abstraction Layer (RAL) Run-time system System Monitor Queue info Linker Resource Mapping Overview
7
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 7 Rate based Monitoring Rarr Rdep Observe queue between two stages Arrival/departure rates indicative of processing needs Monitoring Assumption: R dep scales linearly. So for a stage running on n cores, R dep = n * R dep1 Q size R arr = Current arrival rate R dep = Current departure rate R worst = Worst case arrival rate t sw = Time to switch on a core
8
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 8 Q adapt Buffer space to handle Allocation policy Number of Cores = R / R dep1 If R = R worst, system directly moves to worst case provisioned state Only request cores as needed NumCores (R arr ) = R arr / R dep1 Policy R arr R dep If R arr >> R dep, request allocation of processors, immediately How many? function of (R arr / R dep1 ) If R arr slightly larger, let queue grow till Q, then request allocation of one processor If R arr slightly larger, let queue grow till Q adapt, then request allocation of one processor worst burst
9
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 9 De-allocation policy While increasing allocation, latch R dep1 if R arr / R dep1 < current allocation Request de-allocation of one core Hysterisis: Wait for some cycles before requesting de- allocation again Avoids fluctuations for transient dips in arrival rate Policy
10
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 10 Theory of Operation A C B 1010101 0101010 1010101 0101010 1010 B A A C B Traffic Mix MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core Executable binaries 1010101 0101010 1010101 0101010 1010 A A B C XScale ME 1010101 0101010 1010101 0101010 1010 B C A C B, C A 1010101 0101010 1010101 0101010 1010 B B C C Resource Abstraction Layer (RAL) Run-time system System Monitor Queue info Resource Allocator Triggers Linker Resource Mapping Overview
11
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 11 Resource allocator Handles requests for allocation/de-allocation from individual stages Aware of global system state and decides specific processor to allocate/free to de-allocate or migrate stage when no free processors available Steal only when arrival rate < arrival rate for requesting stage whether request is declined Resource Allocation
12
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 12 Theory of Operation System Evaluation A C B 1010101 0101010 1010101 0101010 1010 B A Traffic Mix MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core Executable binaries 1010101 0101010 1010101 0101010 1010 A A B C XScale ME 1010101 0101010 1010101 0101010 1010 B C A C B, C A 1010101 0101010 1010101 0101010 1010 B B C C Resource Abstraction Layer (RAL) Run-time system System Monitor Queue info Resource Allocator Triggers Mapping Linker Resource Mapping Overview
13
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 13 Experimental setup Radisys, Inc. ENP-2611* 600MHz Intel® IXP2400 Processor MontaVista Linux* 3 optical Gigabit Ethernet ports IXIA* traffic generator for packet stimulus * Third party brands/names are property of their respective owners Results
14
Resource allocation PolicyMonitoringMechanismsOverview Conclusion 14 Adaptation Costs Overhead due to function calls to resource abstraction layer 14% performance degradation for processing min size packets at line rate Overall adaptation time is: Binding time + (checkpointing and loading time number of cores) Binding time + (checkpointing and loading time * number of cores) Cumulative effects: ~100ms Dominated by cost of binding mechanism Results
15
Resource allocation PolicyMonitoringMechanismsOverview Conclusion 15 Adaptation benefits Testing Methodology Need to measure ability of system to handle long term workload variations Systems compared Static system (Profile driven compilation) Adaptive system Results
16
Resource allocation PolicyMonitoringMechanismsOverview Conclusion 16 L3 forwarder L2 bridge L3 forwarder L2 bridge L3 forwarder 1010101 0101010 1010101 0101010 1010 MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale® core L2 bridge L3 forwarder L2 bridge L3 forwarder L2 bridge Profile Compiler Static binary Traffic System Performance Rx L2 classifier L3 forwarder L2 bridge Ethernet encapsulation Tx Adaptation benefits Testing Methodology Layer 3 switching application Results
17
Resource allocation PolicyMonitoringMechanismsOverview Conclusion 17 0%, 100% 20%, 80%40%, 60%50%, 50%60%, 40%80%, 20% 100%, 0% Benefits of run time adaptation Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Source: Intel Results
18
Resource allocation PolicyMonitoringMechanismsOverview Conclusion 18 Future work Study ability of an adaptive system to handle short term fluctuations Would it drop more packets than a non-adaptive system Enable flow-aware run time adaptation Explore more sophisticated resource allocation algorithms support properties like fairness and performance guarantees Conclusion
19
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 19 Related work Ease of programming NP-Click: N Shah etc, NP-2 workshop 2003 Nova: L George, M Blume, ACM SIGPLAN 2003 Auto-Partitioning programming model: Intel, whitepaper 2003 Dynamic extensibility Router plugins: D Decasper etc, SIGCOMM 1998 PromethOS: R Keller etc, IWAN 2002 VERA: S Karlin, L Peterson, Computer Networks 2002 NetBind: M Kounavis, Software Practice and experience, 2004 Load balancing ShaRE: R Kokku, Ph.D Thesis UT Austin, 2005 Conclusion
20
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 20 Conclusion Run time adaptation is an attractive approach for handling traffic fluctuations Implemented a framework capable of adapting processing cores allocated to network services Implemented a policy that Automatically balances service pipeline Overcomes the code store limitation of fixed control store processor cores Conclusion
21
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 21 Background
22
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 22 Checkpointing Leveraging domain characteristics Finding the best checkpoint is easier in packet processing than in general domains Characteristics of data-flow applications Typically implemented as a dispatch loop Dispatch loop is executed at high-frequency Top of the dispatch loop has no stack information Since compiler creates dispatch loop, compiler inserts checkpoints in the code Mechanisms
23
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 23 Why Have Binding? MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale™ Core A B A B MEv2 10 MEv2 11 MEv2 12 MEv2 15 MEv2 14 MEv2 13 MEv2 9 MEv2 16 MEv2 2 MEv2 3 MEv2 4 MEv2 7 MEv2 6 MEv2 5 MEv2 1 MEv2 8 Intel XScale™ Core A B A B Want to be able to use the fastest implementations of resources available Now we can use NN rings, local locks Mechanisms
24
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 24 Binding Goal: Use the fastest implementations of resources available Resource abstraction Programmer’s write to abstract resources (Packet channels, uniform memory, locks etc) Must have little impact on run-time performance Our approach: Adaptation time linking Mechanisms (4/6)
25
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 25 Resource binding approach Adaptation-time linking A microengine-based example Application.o fileRAL.o fileFinal.o file Application Code RAL Implementation 0 RAL Implementation 1 RAL Implementation 2 RAL Implementation 3 RAL Implementation 4 RAL Implementation 5 RAL Implementation 6 RAL calls are initially undefined Linker adjusts jump targets using import variable mechanism At run time, the RTS has the application.o file At run time, the RTS has the application.o file and the RAL.o file Process repeated after each adaptation Mechanisms (6/6)
26
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 26 Binding: The Value of Choosing the Right Resource Implementation on Intel® IXP2400 Processor # S-push/S-pull bytes % S-push/S-pull bandwidth Next-neighbor00% Scratchpad ring 40.47% SRAM ring w/stats 687.9% Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
27
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 27 Enterprise LAN Access Network MAN/WAN VPN Gateway Firewall Intrusion Detection Forwarding Switching XML & SSL acceleration L4-L7 switching Application acceleration Compression Monitoring (billing, QoS) Problem domain
28
Results Resource allocation PolicyMonitoringMechanismsOverview Conclusion 28 Determining Q adapt and monitoring interval Q adapt Buffer space to handle worst burst with n+1 cores R arr R dep Buffer space to handle worst burst with n cores Queue fill up while core comes online Policy Want to maximize Q adapt Q adapt function of queue monitoring interval Theoretical max Q adapt when queue depth can be detected instantaneously Q adapt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.