Improving the Reliability of Internet Paths with One-hop Source Routing Krishna Gummadi, Harsha Madhyastha Steve Gribble, Hank Levy, David Wetherall Department of Computer Science and Engineering University of Washington Seattle, WA
Reliability of Internet paths Enormous interest in understanding Internet path reliability Proposals to improve reliability using indirection routing –RON, Detour Current implementations maintain complex overlays that do not scale
This talk What are the failure characteristics of Internet paths? What do they imply about reliability benefits of indirection routing? Can a simple, stateless, scalable scheme realize these benefits? What benefits would end-users see in practice? –for a real-application, such as Web browsing
Outline Introduction Measurement study of Internet path failures One-hop source routing An implementation study of SOSR Conclusions
Measurement study of path failures We conducted a week long measurement study –probed 3,153 destinations from 67 Planetlab sites –each destination is probed from exactly one node Our goal is to answer the following: –How often do paths fail? –Where do failures occur? –How long do failures last?
Choosing destinations We want to understand how the network paths to servers and broadband hosts differ –it has implications for different workloads/apps Web transfers between servers and broadband hosts VOIP apps between broadband hosts We chose 3153 destinations: –378 popular web servers –1,139 broadband hosts –1,636 randomly selected IPs
Detecting path failures Each probe (response) is a TCP ACK (RST) packet –default probe frequency: one every 15 seconds Upon a single probe response loss, we: –increase probe frequency: one every 5 seconds till 10 consecutive probe responses are received –perform traceroute to detect failure location A path fails when 3 consecutive probes and traceroute fail
How often do paths fail? Failures do happen, but not frequently –on average each path sees 6 failures/week –server paths see 4 failures/week –broadband paths see 7 failures/week Most paths see at least one failure in a week –85% of all paths –78% of server paths –88% of broadband paths
Categories of failure locations Categories help distinguish between core and edge failures SourceDestination Local ISP Tier1 ISP source_side core dst_side last_hop
Where do paths fail? Server path failures occur throughout the network –very few (16%) last_hop failures –suggests network is the dominating cause for server unavailability
Where do paths fail? Most of the broadband failures happen on last_hop Excluding last_hop, server and broadband paths see similar number of failures
How long do failures last? Failure durations are highly skewed Majority of failures are short –median failure duration: 1-2 min for all paths –median path availability: 99.9% for all paths A non-negligible fraction of paths see long failures –tend to occur on last_hop –mean path availabilty: 99.6% for servers and 94.4% for broadband
Implications for indirection routing Failures happen often enough that they are worth fixing But, they are rare enough that recovery schemes should be inexpensive under normal conditions Failures near the end-nodes limit the performance of indirection routing –good news: servers see very few failures near end hosts –bad news: broadband hosts see many last_hop failures
Outline Introduction Measurement study of Internet path failures One-hop source routing An implementation study of SOSR Conclusions
One-hop source routing Use default path under normal conditions When default path fails, source attempts to recover by routing through an intermediary src dst X intermediate
Our goals Understand the potential reliability benefits of one- hop source routing Design a simple stateless, scalable scheme to realize this potential
Evaluating one-hop source routing For each path failure during the week-long trace –we sent probes via intermediaries at 39 Planetlab sites Compared the success of probes along default and intermediate paths –estimate the maximum potential of any one-hop scheme –estimate success rate of specific one-hop scheme
A failure is recoverable if any of the 39 intermediaries help Server failures more recoverable than broadband Almost all Internet core failures can be avoided through one-hop routing Potential of any one-hop routing scheme percent of failures that are recoverable serversbroadband src_side54%55% core92%90% dst_side79%66% last_hop41%12% all66%39%
What fraction of intermediaries help in recovery? For most failures, > half of the intermediaries avoid the failure All we need to do is find one of them! Suggests that a randomly selected intermediary might work 22 75%
How effective is a random policy? Random-k: Pick K intermediaries at random Random-4 delivers near-optimal success rate –requires no a priori probing or state
Recovery latency with random-4 Random-4 either helps early or not at all –nearly 60% failures recovered in 5-10 seconds After that, we have to wait for paths to self-repair So, initiate and abandon recovery early Server failures
Outline Introduction Measurement study of Internet path failures One-hop source routing An implementation study of SOSR Conclusions
SOSR implementation Validate random-4 policy in practice using a real application, Web browsing SOSR: Scalable One-hop Source Routing Implemented in linux –transparent to destinations (NAT on intermediate nodes) –transparent to applications on source node (netfilter) –extensible (can plug in policies)
Evaluating SOSR implementation Ran two clients one with and another without SOSR –repeatedly fetched Web pages from 982 popular servers –both machines located at UW –one request per second over 3 days Client 1: default wget command-line web browser Client 2: default wget + SOSR with random-4 policy –deployed intermediaries on 39 Planetlab nodes
User perceived benefits of SOSR wget succeeds 99.8% of time –Web seems pretty reliable A SOSR user sees only 20% fewer failures! –not clear whether SOSR matters for Web requestsfailures wget273, wget SOSR 273,978383
User perceived benefits of SOSR SOSR recovers from 56% of network failures But, can’t recover from application failures 62% of wget + SOSR failures are application related network level failures application level failures HTTP error codes TCP refused HTTP refused HTTP timeout wget wget SOSR
Conclusions What are the failure characteristics of Internet paths? –failures do happen, but they are short and infrequent –many occur on last_hop for broadband paths What do they imply about reliability benefits of indirection routing? –recovery must be cheap in the common case –one-hop source routing recovers from 66% of server and 39% of broadband path failures
Conclusions Can a simple, stateless, scalable scheme realize these benefits? –random-4 realizes the potential of any one-hop scheme –no cost in common case –no a priori probing or state needed What benefits would end-users see in practice for real applications? –Web users see only 20% fewer failures –many application-level failures
Conclusions Is indirection routing useful or not? –pessimistic view: not for the Web –optimistic view: perhaps for other applications, like VoIP
For more information Visit our research group website: