Presentation is loading. Please wait.

Presentation is loading. Please wait.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Infrastructure and Protocols for Dedicated Bandwidth Channels Nagi Rao Computer Science.

Similar presentations


Presentation on theme: "O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Infrastructure and Protocols for Dedicated Bandwidth Channels Nagi Rao Computer Science."— Presentation transcript:

1 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Infrastructure and Protocols for Dedicated Bandwidth Channels Nagi Rao Computer Science and Mathematics Division Oak Ridge National Laboratory raons@ornl.gov March 14, 2005 1 st Annul Workshop of Cyber Security and Information Infrastructure Research Group (CSIIR) and Information Operations Center (IOC) Oak Ridge, TN Research Sponsored by Department of Energy National Science Foundation Defense Advanced Research Agency

2 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Collaborators  Steven Carter, Oak Ridge National Laboratory  Leon O. Chua, University of California at Berkeley  Jianbo Gao, University of Florida  Qishi Wu, Oak Ridge National Laboratory  William Wing, Oak Ridge National Laboratory Department of Energy High-Performance Networking Program National Science Foundation Advanced Network Infrastructure Program Defense Advanced Research Agency Network Modeling and Simulation Program Oak Ridge National Laboratory Laboratory Directed R&D Program Sponsors

3 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Outline of Presentation  Network Infrastructure Projects  DOE UltraScienceNet  NSF CHEETAH  Dynamics and Control of Transport Protocols  TCP AIMD Dynamics  Analytical Results  Experimental Results  New Class of Protocols  Throughput Stabilization for Control  Transport Protocol  Probabilistic Quickest Path Problem  Quickest path algorithm  Probabilistic algorithm

4 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Outline of Presentation  Network Infrastructure Projects  DOE UltraScienceNet  NSF CHEETAH  Dynamics and Control of Transport Protocols  TCP AIMD Dynamics  Analytical Results  Experimental Results  New Class of Protocols  Throughput Stabilization for Control  Transport Protocol  Probabilistic Quickest Path Problem  Quickest path algorithm  Probabilistic algorithm

5 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY  Science Objective: Understand supernova evolutions  DOE SciDAC Project: ORNL and 8 universities  Teams of field experts across the country collaborate on computations  Experts in hydrodynamics, fusion energy, high energy physics  Massive computational code  Terabyte in generated in a day currently  Archived at nearby HPSS  Visualized locally on clusters – only archival data  Desired network capabilities  Archive and supply massive amounts of data to supercomputers and visualization engines  Monitor, visualize, collaborate and steer computations Motivation for Networking Projects: Terascale Supernova Initiative (TSI) DOE large-scale science application Visualization channel Visualization control channel Steering channel

6 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY DOE UltraScience Net The Need  DOE large-scale science applications on supercomputers and experimental facilities require high-performance networking  Petabyte data sets, collaborative visualization and computational steering  Application areas span the disciplinary spectrum: high energy physics, climate, astrophysics, fusion energy, genomics, and others Promising Solution  High bandwidth and agile network capable of providing on-demand dedicated channels: multiple 10s Gbps to 150 Mbps  Protocols are simpler for high throughput and control channels Challenges: Several technologies need to be (fully) developed  User-/application-driven agile control plane:  Dynamic scheduling and provisioning  Security – encryption, authentication, authorization  Protocols, middleware, and applications optimized for dedicated channels Contacts: Bill Wing (wrw@ornl.gov)wrw@ornl.gov Nagi Rao (raons@ornl.gov)raons@ornl.gov

7 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY DOE UltraScience Net Connects ORNL, Chicago, Seattle and Sunnyvale:  Dynamically provisioned dedicated dual 10Gbps SONET links  Proximity to several DOE locations: SNS, NLCF, FNL, ANL, NERSC  Peering with ESnet, NSF CHEETAH and other networks Data Plane User Connections: Direct connections to: core switches –SONET channels MSPP – Ethernet channels Utilize UltraScience Net hosts Funded by U. S. DOE High-Performance Networking Program at Oak Ridge National Laboratory – $4.5M for 3 years

8 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Control-Plane  Phase I  Centralized VPN connectivity  TL1-based communication with core switches and MSPPs  User access via centralized web- based scheduler  Phase II  GMPLS direct enhancements and wrappers for TL1  User access via GMPLS and web to bandwidth scheduler  Inter-domain GMPLS-based interface  Allows users to logon to website  Request dedicated circuits  Based on cgi scripts Web-based User Interface and API  Computes path with target bandwidth  Is bandwidth available now?  Extension of Dijkstra’s algorithm  Provide all available slots  Extension of closed semi ring structure to sequences of reals  Both are polynomial-time algorithms  GMPLS does not have this capability Bandwidth Scheduler

9 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Objective: Develop the infrastructure and networking technologies to support a broad class of eScience projects and specifically the Terascale Supernova Initiative. Main Technical Components:  Optical network testbed  Transport protocols  Middleware and applications NSF CHEETAH: Circuit-switched High-speed End-to-End Transport ArcHitecture  Collaborative Project: $3.5M for 3 years  U. Virginia, ORNL, NC State, CUNY  Sponsor: National Science Foundation Contacts: Malathi Veeraraghavan(mv@cs.virginia.edu) Nagi Rao (raons@ornl.gov)raons@ornl.gov

10 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY CHEETAH Project concept  Network:  Create a network that on-demand offers end-to-end dedicated bandwidth channels to applications  Operate a PARALLE network to existing high-speed IP networks – NOT AN ALTERNATIVE!  Transport protocols:  Design to take advantage of dedicated and dual end-to-end paths  IP path and dedicated channel  eScience Application Requirements:  High-throughput file/data transfers  Interactive remote visualization  Remote computational steering  Multipoint collaborative computation

11 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY CHEETAH: Initial Configuration NC GbE/10GbE Ethernet Switch To hosts NCSU Control card OC192 card GbE/ 10GbE card GbE/10GbE Ethernet Switch To hosts ORNL MSPP OC192 card Control card GbE/ 10GbE card NCSU/MCNC/NLR MSPP OC192 card Control card Atlanta (NLR/SOX) MSPP OC192 card GbE/ 10GbE card GbE/10GbE Ethernet Switch To hosts To DC – Dragon Implements GMPLS protocols

12 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY  Peering  Coast-to-coast dedicated channels  Access to ORNL supercomputers and storage Applications:  TSI on larger scale Peering: UltraScience Net - CHEETAH

13 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Outline of Presentation  Network Infrastructure Projects  DOE UltraScienceNet  NSF CHEETAH  Dynamics and Control of Transport Protocols  TCP AIMD Dynamics  Analytical Results  Experimental Results  New Class of Protocols  Throughput Stabilization for Control  Transport Protocol  Probabilistic Quickest Path Problem  Quickest path algorithm  Probabilistic algorithm

14 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Transport Dynamics are Important  Data Transport: High bandwidth for large data transfers over dedicated channels  maintain suitable sending rate to achieve effective throughput  Control of end devices: Remote control of visualizations, computations and instruments  Jittery dynamics will destabilize the control loops  Will not be able to effectively execute interactive simulations

15 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Study of Transport Dynamics Understanding of transport dynamics:  Analytically showed that TCP-AIMD contains chaotic regimes  concept of w-update map  Internet traces are shown to be both chaotic and stochastic  underlying process is anomalous diffusion. Development and tuning of protocols:  Protocols for stable flows of fixed rate: ONTCOU  Based on classical Robbins-Monro method  Transport protocols with statistical stability: RUNAT  Combination of AIAD and Kiefer-Wolfowitz method

16 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Simulation Results: TCP-AIMD exhibits “complicated” trajectories TCP streams competing with each other (Veres and Boda 2000) TCP competing with UDP (Rao and Chua 2002) Analytical Results (Rao and Chua 2002): TCP-AIMD has chaotic regimes Developed state space analysis and Poincare maps Internet Measurements (2004): TCP-AIMD traces are a complicated mixture of stochastic and chaotic components Complicated TCP AIMD Dynamics - History Working Definition of Chaotic Trajectories: Nearby starting points will result in trajectories that move far apart at a rate determined by Lyapunov (>0) exponent Trajectories are non-periodic for some starting points The attractor is geometrically complicated

17 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Simplified View: Dynamics of TCP  Transport Control Protocol Outline  Uses window mechanism to send W bytes/RTT  Dynamically adjusts W to network and receiver state  Keeps increasing if no loses  Keeps shrinking if losses are detected  Slow start phase:  W increase exponentially until or loss  Congestion Control:  Additively increase W with delivered packets  Multiplicatively decrease with loss Slow start:a Congestion control:1/w time Early loss slows throughput time

18 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Chaotic Dynamics of TCP  Competing TCP streams: Window dynamics are chaotic  Hard to predict – resemblance to random noise  Hard to conclude from experiments – nearby orbits move faraway later  Hard to characterize – chaotic attractor  Poincare map of two window sizes  Two-streams case  Four streams case Veres and Boda (2000) did not rigorously establish chaos in a formal sense  Attractor could have been generated by periodic orbit with large period  We repeated the simulation and found only quasi periodic trajectories

19 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Noisy Nature of TCP (simulation)  Simple random traffic generates complicated attractors  TCP reacts to network traffic randomness  Jittery end-to-end delays  Do not need chaos to generate complicated attractors  Poincare map of message delay vs. window size TCP source Router: uniform random drops destination

20 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY TCP Competing with UDP (ns-2 simulation) As CBR rate is varied TCP competing with UDP/CBR at the router generates a variety of dynamics TCP/Reno sink UDP/CBR Router Poincare phase plot: Window-size W(t) vs. pkt end-to-end delay D(t) 2Mb, 10ms,DT 1.7Mb, 10ms,DT D(t) W(t) time UDP/CBR=1Mbs

21 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY TCP Competing with UDP UDP/CBR: 0Mbs UDP/CBR: 0.5Mbs UDP/CBR: 1.7Mbs UDP/CBR: 1.0Mbs UDP/CBR: 1.7Mbs UDP/CBR:1.75Mbs

22 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Summary of Our Analytical Results State-Space of TCP:  congestion window; packet delay including re-transmits;  acknowledgements since last MD; losses inferred since last AI  TCP-AIMD dynamics have two qualitatively different regimes  Regime one: high-lighted in usual TCP literature  increased with while  Regime two: high-lighted by and  decreases with  Its effect and duration is enhanced by network delay and high buffer occupancy  Trajectories move back and forth between these two regimes  We define Poincare that updates : w-update map M  M is 1-dimensional if Regime Two is short-lived  M is 2-dimensional and complicated if Regime Two is significant  M is qualitatively similar to tent map – generates chaotic trajectories

23 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Dynamics of Transitions Between Regimes  map for long TCP transfers Regime 1 t t Regime 2 Both regimes are unstable – Eigenvalue analysis ww

24 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY M: w-update map Given value, gives its next updated values after some time period (not fixed) Regime 1: Regime 2: depends on the number of dropped packets - buffer occupancy at that time - delay between source and bottleneck buffer Result: M is parametrized, and each piece resembles twisted version of classical tent-map Rao, Gao and Chua, chapter in Complex Dynamics in Communications Networks, 2004

25 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Question 1: How relevant are previous simulation and analytical results on chaotic trajectories? Answer: Relevant from an analysis perspective to certain extent. Question2: Do actual Internet TCP measurement exhibit chaotic behavior? Answer: Yes. They are more complicated than chaotic (deterministic). Internet Measurements – Joint work with Jianbo Gao

26 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Internet (net100) traces show that TCP-AIMD dynamics are complicated mixture of chaotic and stochastic regimes:  Chaotic – TCP-AIMD dynamics  Stochastic – TCP response to network traffic Basic Point: TCP Traces collected on all Internet connections showed complicated dynamics  classical “saw-tooth” profile is not seen even once  This is not a criticism against TCP, it was not intended for smooth dynamics Internet Measurements

27 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Cwnd time series for ORNL-LSU connection Connection: OC192 to Atlanta-Sox; Internet2 to Houston; LAnet to LSU Time series: cwnd=x(t) Collected at 1ms (approx) resolutions collected using net100 instruments

28 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Time-Dependent Exponent Plots Lorenz – chaotic Common envelope Informally, a measure of how separated close-by states become in time: Exponential separation is characteristic of chaotic regime Form state vectors of size m from time series x(t ), sampled denoted by x(1), x(2), …. For a two state vectors satisfyingwe define time-dependent exponent as Uniform Random Spread out

29 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Internet cwnd measurements: Both Stochastic and Chaotic Parts are Dominant TCP traces have: Common envelope – chaotic Spread out – stochastic at certain scales Observations: From analysis, chaotic dynamics are from AIMD Stochastic component is in response to network traffic; losses and RTT variations Gao and Rao, IEEE Comm Letters, 2005,in press

30 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Observation 1: Avoid AIMD-like behavior to avoid chaotic dynamics Challenge: Randomness is inherent in Internet connections – will not go away even if protocol is non-chaotic. Our Solution: Explicitly account for randomness in the protocol design – stochastic approximation Design of Transport Protocols with Smooth Dynamics

31 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Throughput Stabilization  Niche Application Requirement: Provide stable throughput at a target rate - typically much below peak bandwidth  High-priority channels  Commands for computational steering and visualization  Control loops for remote instrumentation  TCP AIMD is not suited for stable throughput  Complicated dynamics  Underflows with sustained traffic  Important Consideration  Stochasticity of Internet connections must be explicitly accounted for Rao, Wu and Iyengar, IEEE Comm Letters, 2004

32 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Stochastic Approximation: UDP window-based method Transport control loop Objective: adjust source rate to achieve (almost) fixed goodput at the destination application Difficulty: data packets and acks are subject to random processes Approach: Rely on statistical properties of data paths

33 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY UDP-Based Framework Send datagrams and wait for period Source Sending rate: Destination goodput: Loss rate Goodput regression: Loss regression:

34 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Channel Throughput profile Plot of receiving rate as a function of sending rate Its precise interpretation depends on:  Sending and receiving mechanisms  Definition of rates For protocol optimizations, it is important to use its own sending mechanism to generate the profile Window-based sending process for UDP datagrams: Send datagrams in a one step – window size Wait for time called idle-time or wait-time Sending rate at time resolution : This is an adhoc mechanism facilitated by 1GigE NIC

35 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Throughput Profile: Throughput and loss rates vs. sending rate (window size, cycle time) Objective: adjust source rate to yield the desired throughput at destination Typical day Christmas day Stabilization zone Peak zone

36 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Adaptation of source rate  Sending process: send datagrams and wait for duration  Adjust the window size  Adjust cycle-time  Both are special cases of classical Robbins- Monroe method target throughput noisy estimate

37 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Performance Guarantees  Summary: Stabilization is achieved with a high probability with a very simple estimation of source rate  Basic result: for the general update  We have

38 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Internet Measurements  ORNL-LSU connection (before recent upgrade)  Hosts with 10 M NIC  2000 mile network distance  ORNL-NYC – ESnet  NYC-DC-Hou – Abilene  HOU-LSU – Local n/s  ORNL-GaTech Connection  Hosts with GigE NICS  ORNL-Juniper router – 1Gig link  Juniper- ATL Sox – OC192 (1Gig link)  Sox-GaTech – 1Gig link

39 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY ORNL-LSU Connection

40 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Goodput Stabilization: ORNL-LSU Experimental Results  Case 1: Target goodput = 1.0 Mbps, rate control through congestion window, a = 0.8, Datagram acknowledging time ( ) vs. source rate (Mbps) & goodput (Mbps) Case 2. Target goodput = 2.0 Mbps, rate control through congestion window, a = 0.8,

41 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Throughput Stabilization: ORNL-GaTech T arget goodput = 20.0 Mbps, a = 0.8, adjust congestion window size Target goodput level = 2.0 Mbps, a = 0.8,, adjust sleep time

42 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY RUNAT: Reliable UDP-based Network Adaptive Transport  Transport protocol  Maximize connection utilization: Track peak goodput  Uses Keifer-Wolfowitz stochastic approximation to handle ACKs and losses Features:  Tailored to random loss rate and RTT  Segmented rate control  3 control zones: bottleneck link is underutilized, saturated, and overloaded  Explicit accounting for random components  Use stochastic approximation methods based on goodput estimates  TCP-friendliness  Rate-increasing and rate-decreasing coefficients are dynamically adjusted  Adaptable to diverse network environments  Measurements and control periods are not constant, but link-specific (use RTT). Wu and Rao, INFOCOM2005

43 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Three Zone of Goodput Profile  Three control zones  Zone I: Adaptive Increase  Bottleneck link is underutilized  Low packet loss due to occasional congestion or transmission errors  Fixed with increasing source rate  Zone II (transitional): dynamic KWSA method  Bottleneck link is saturated  Peak goodput falls within this zone  SA determines whether to increase or decrease source rate  Zone III: Adaptive Decrease  Bottleneck link is overloaded  Large packet loss due to network congestion  Back off to recover from congestion collapse Goodput regression sending rate r Zone I ~zero loss Zone III high loss Zone II low loss Stabilize sending rate at

44 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Segmented Rate Control Algorithm when Loss rate estimate: Basic Idea: Control sending rate based on loss rate estimate to achieve peak goodput

45 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Convergence Properties of RUNAT Informal Statement: If in zones I or III, it will exit to zone II If in zone II, it will converge to maximum throughput Condition A1: loss statistics vary slowly Condition A2: loss regression is differentiable and its derivative is monotonically increasing with respect to r in Phase II. Result: RUNAT in zone I or III, enters II in a finite number of steps almost surely; In zone II, RUNAT will almost surely converge to the peak goodput

46 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Experimental Results on link between ozy4 (ORNL) and robot (LSU) - Illustration of microscopic RUNAT behaviors during transfer of 20MB data Zone III (loss rate: 37.33%) Slow Start Zone II (loss rate: 3.33%) When far away from the saturation (peak) point, is adjusted to large values to quickly move towards the peak point. When approaching the saturation (peak) point, is adjusted to small values to slowly converge to and remain at the peak point. The decrement of source rate upon packet loss is determined by congestion levels (local loss rate measurements) and : higher congestion levels result in larger rate drops. The increment of source rate is determined by congestion levels (local loss rate measurements) and. Zone I (loss rate: 0%)

47 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Experimental Results on link between ozy4 (ORNL) and robot (LSU) - RUNAT transport performance during transfer of 2GB data with concurrent TCP transfer of 50MB data RUNAT throughput: 10.49Mbps TCP throughput: 0.376Mbps Single TCP throughput: 0.377Mbps Note: The low throughputs were due to the high traffic volume at the time of experiments. In a normal day with regular traffic volume, TCP is able to achieve 3~6Mbps and RUNAT may reach 15~30Mbps at lower loss rates without significantly affecting concurrent TCP on this link. Case 1: run RUNAT & TCP concurrently Case 2: run a single TCP only

48 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Experimental Results on link from ozy4 (ORNL) to orbitty (NC State) Transport method Data sent (MBytes) Throughput (Mbps) Is TCP friendly? iperf (TCP)1008.7YES Iperf (UDP)100095.6 NO (no congestion control, no reliability) FTP (TCP)10018.6YES SABUL100080.0 Seems NOT Concurrent TCP: 18.6Mbps  10.1Mbps RUNAT150080.0 Statistically YES Concurrent TCP: 18.6Mbps  18.2Mbps

49 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY ORNL-Atlanta-ORNL 1Gbps Channel Dell Dual Xeon 3.2GHz Dual Opteron 2.2 GHz OC192 ORNL-ATL GigE Juniper M160 Router at ORNL Juniper M160 Router at Atlanta GigE blade SONET blade IP loop  Host to Router  Dedicated 1GigE NIC  ORNL Router  Filter-based forwarding to override both at input and middle queues and disable other traffic to GigE interfaces  IP packets on both GigE interfaces are forwarded to out- going SONET port  Atlanta-SOX router  Default IP loopback  Only 1Gbps on OC192 link is used for production traffic – 9Gbps spare capacity

50 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1Gbps Dedicated IP Channel Non-Uniform Physical Channel: GigE – SONET – GigE ~500 network miles End-to-End IP Path Both GigE links are dedicated to the channel Other host traffic is handled through second NIC Routers, OC192 and hosts are lightly loaded IP-based Applications and Protocols are readily executed Dell Dual Xeon 3.2GHz Dual Opteron 2.2 GHz OC192 ORNL-ATL GigE Juniper M160 Router at ORNL Juniper M160 Router at Atlanta GigE blade SONET blade IP loopback

51 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Dedicated Hosts Hosts: Linux 2.4 kernel (Redhat, Suse) Two NICS: optical connection to Juniper M160 router copper connection Ethernet switch/router Disks: RAID 0 dual disks (140GB SCSI) XFS file system Peak disk data rate is ~1.2Gbps (IO Zone measurements) Disk is not a bottleneck for 1Gbps data rates

52 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY UDP goodput and loss profile Point in horizontal plane: Gooput plateau ~990Mbps Non-zero and random loss rate High gooput is received at non-trivial loss

53 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1GigE NICS Act as Rate Controllers GigE NIC Application Buffer Kernel buffer Juniper M160 Host Rate Limited 1Gbps Data rates could exceed 1Gbps Rate Limited 1Gbps Our window-based method: Flow rate from application to NIC is ON/OFF and exceeds 1Gbps at times Flow is regulated to 1Gps: NIC rate matches the link rates This method does not work well if NIC rate is higher than link rate or router port rate: - NIC may send at higher rate causing losses at router port

54 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Best Performance of Existing Protocols Disk-to-Disk Transfers (unet2 to unet1) Memory-to-Memory Transfers UDT: 958Mbps Both Iperf and throughput profiles indicated 990 Mbps levels Potentially such rates are achievable if disk access and protocol parameters are tuned Protocolgoodput tsunami919 Mbps UDT890 Mbps FOBS708 Mbps

55 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Hurricane Protocol  Composed based on principles and experiences with UDT and SABUL  was not easy for us to figure out all tweaks for pushing peak performance  UDP window-base flow-control  Nothing fundamentally new but needed for fine tuning  990 Mbps on dedicated 1Gbps connection disk-to-disk  No attempt for congestion control

56 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Hurricane Control Structure Sender receiver Receiver buffer Reordering datagrams Group k NACKs Reload lost datagrams TCP disk Send datagrams Different subtasks are handled by threads, which are woken up on demand Thread invocations are reduced by clustered NCKs instead of individual ACKS

57 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Hurricane

58 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Adhoc Optimizations M anual tuning of parameters  Wait-time parameter:  Initial value chosen from throughput profile  Empirically, goodput is “unimodel” in : pairwise measurements for binary search  Group size for k for NACKs  empirically, goodput is unimodel in k and is tuned  Disk-specific details  Reads done in batch – no input buffer  NAKs are handled using fseek – attached to the next batch  This tuning is not likely to be transferable to other configurations and different host loads  More work needed: automatic tuning and systematic analysis

59 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Outline of Presentation  Network Infrastructure Projects  DOE UltraScienceNet  NSF CHEETAH  Dynamics and Control of Transport Protocols  TCP AIMD Dynamics  Analytical Results  Experimental Results  New Class of Protocols  Throughput Stabilization for Control  Transport Protocol  Probabilistic Quickest Path Problem  Quickest path algorithm  Probabilistic algorithm

60 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Shortest Path Problem Classical Problem: Given a graph along with “distance” function on edges For path we define the path distance delay: for Compute a path with smallest path distance from source node to destination node Solved using Dijkstra’s Algorithm with complexity

61 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Quickest Path Problem Problem: Given a graph along with 1. “delay” function on edges 2. “bandwidth” function on edges For path we define the total delay: for Compute a path with smallest total delay from source node to destination node Solved using Chen and Chin’s Algorithm with complexity Important Observation: Subpath of a quickest path is not necessarily quickest sd 5,20 15,5 15,20 T(60)=32 T(60)=29 T(60)=52 T(60)=57

62 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Quickest Path Algorithm – Chin and Chen Let denote distinct bandwidths Let subnetwork - edges with bandwidth smaller than b are removed path with least delay in : Quickest path is given by Typically implemented using m invocations of Dijkstra’ algorithm m could be quite large

63 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Simple Probabilistic Quickest Path Algorithm Randomly choose a fraction of `s and compute only on For larger networks we only needed less than 10% shortest delay computations Question: Is there a fundamental reason for this?

64 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Analysis Critical Observation: For delay function is non-decreasing Its Vapnik and Chervonenkis dimension is 1 Makes it efficient to approximate it by random sampling Rao 2004, Theoretical Computer Science Optimal delay Approximation based on p shortest path computations Linear Approximation with p points

65 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Conclusions  TCP-AIMD Dynamics:  Analytically established chaotic dynamics  Analyzed Internet traces: combination of chaotic and stochastic dynamics  New Classes of Protocols  ONTCOU: achieve stable target flow level  RUNAT: statistical approach to congestion control  Based on Stochastic Approximation: convergence proof under general conditions  Experimental results are promising both on Internet and dedicated connections

66 O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Thank You


Download ppt "O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Infrastructure and Protocols for Dedicated Bandwidth Channels Nagi Rao Computer Science."

Similar presentations


Ads by Google