Measuring End-User Availability on the Web: Practical Experience Matthew Merzbacher (visiting research scientist) Dan Patterson (undergraduate) Recovery-Oriented Computing (ROC) University of California, Berkeley
E—Commerce Goal Non-stop Availability –24 hours/day –365 days/year How realistic is this goal? How do we measure availability? –To evaluate competing systems –To see how close we are to optimum
The State of the World Uptime measured in “nines” –Four nines == 99.99% uptime (just under an hour downtime per year) –Does not include scheduled downtime Manufacturers advertise six nines –Under 30s unscheduled downtime/year –May be true in perfect world –Not true in practice on real Internet
Measuring Availability Measuring “nines” of uptime is not sufficient –Reflects unrealistic operating conditions Must capture end-user’s experience –Server + Network + Client Client Machine and Client Software
Existing Systems Topaz, Porvio, SiteAngel –Measure response time, not availability –Monitor service-level agreements NetCraft –Measures availability, not performance or end-user experience We measured end-user experience and located common problems
Experiment “Hourly” small web transactions –From two relatively proximate sites (Mills CS, Berkeley CS) –To a variety of sites, including Internet Retailer (US and international) Search Engine Directory Service (US and international) Ran for 6+ months
Availability: Did the Transaction Succeed? AllRetailerSearchDirectory Raw (Overall) Ignoring local problems Ignoring local and network problems Ignoring local, network, and transient problems
Types of Errors Local (82%) Network: Medium (11%) Severe (4%) Server (2%) Corporate (1%)
Client Hardware Problems Dominate User Experience System-wide crashes Administration errors Power outages And many many more… –Many, if not most, caused or aggravated by human error
What About Speed?
Does Retry Help? Error TypeAllRetailerSearchDirectory Client Medium Network Severe Network Server Corporate n/a Green > 80%Red < 50%
What Guides Retry? Uniqueness of data Importance of data to user Loyalty of user to site Transience of information And more…
Conclusion Experiment modeled user experience Vast majority (81%) of errors were on the local end Almost all errors were in the “last mile” of service Retry doesn’t help for local errors –User may be aware of the problem and therefore less frustrated by it