1 The Content and Access Dynamics of a Busy Web Server: Findings and Implications Venkata N. Padmanabhan Microsoft Research Lili Qiu Cornell University SIGCOMM’2000, Stockholm, Sweden August 30, 2000
2 Outline Motivation Related Work Overview Content Dynamics Access Dynamics Summary & Implications Future Work
3 Motivation Solid understanding of Web workload is critical for designing robust and scalable systems Each of the Web components provides a unique perspective on the functioning of the Web Internet replica proxy replica proxy Clients Servers
4 Motivation (Cont.) Distinguishing features of our work Study MSNBC web site a large news server consistently ranked among the busiest sites in the Web Study content & access dynamics The dynamics of file modification and creation The dynamics of users access
5 Related Work Server-based study [ABC+96] observed File popularity follows Zipf’s distribution ( 1) Temporal locality in file accesses [AW96] found 10 invariants 10% files account for 90% accesses [MS97] Long latencies are not necessarily due to server over- loading or CGI traffic [AJ99] studied 1998 worldcup traces Significant volume of cache consistency traffic
6 Related Work (Cont.) Proxy workload characterization Page popularity follows a Zipf-like distribution, i.e. request frequency 1/i ( < 1) [BCF+99] Hit rate of proxy caches no more than 50% [DMF97,GB97] A substantial fraction of misses arises from first- time accesses [VDA+99] Significance in organizational membership [WVS+99] Client-based study [CBC95] and [BBB+98] report Change in file popularity and temporal locality
7 Overview MSNBC server site a large news site server cluster with 40 nodes 25 million accesses a day (HTML content alone) Period studied: Aug. – Oct. 99 & Dec. 17, 98 flash crowd Server logs HTTP access logs Content Replication System (CRS) logs HTML content logs Data analysis Content dynamics Access dynamics
8 Major Findings Content dynamics Modification history is a rough predictor Frequent but minimal file modifications Access dynamics Set of popular files remains stable for days Domain membership has a significant bearing on client accesses except during a flash crowd of global interest Zipf-like distribution of file popularity but with a much larger than at proxies Accesses to old documents account for most first- time misses hard to anticipate such accesses, and eliminate these first-time misses
9 Content Dynamics Period studied: 10/1/99 – 10/28/99 CDF of modification intervals Distinct knees in the CDF at one hour and one day Predictive power of modification history Modification history is a rough predictor of future modification interval Extent of change upon file modification Most file modifications are minimal delta encoding can be very useful
10 CDF of Modification Intervals Distinct knees in the CDF at one hour and one day
11 Predictive Power of Modification History Has significant bearing on cache consistency control algorithms, such as adaptive TTL Prediction algorithm studied Estimate the future modification interval as the mean of the past x samples Performance metrics Correlation coefficient between the predicted and actual values Error in prediction
12 Correlation Coefficient A larger averaging window size helps to predict the future modification interval up to a certain point.
13 Error in Prediction Averaging window: 16 samples Mean error: 226% Median error: 45% Percentage error in predicting file modification interval Modification history yields a rough predictor need alternative mechanism (e.g. call-back based invalidation) as backup
14 Extent of Change Upon File Modifications Compute delta using vdelta algorithm Metric as |vdelta(v1,v2)| |v1|+|v2| 2 Results In 77% cases, 1% In 96% cases, 10% Modification between successive versions is small Delta encoding can be very useful
15 Access Dynamics Correlation between content and access dynamics Impact of age on file popularity Causes of first-time misses Spatial locality in client accesses Domain membership is significant except when there is a “hot” event of global interest Temporal stability of file popularity The set of popular documents mostly remains stable over a timescale of days Distribution of file popularity Zipf-like distribution but with a much larger than at proxies
16 Impact of Age on Popularity For most documents, accesses are concentrated soon after creation
17 Causes of First-time Misses Up to 40% of cache misses are due to first time misses [VDA+99] DateNew files (%)Old files (%) Oct. 8, Oct. 9, Oct. 10, Oct. 11, Accesses to old documents account for most first-time misses hard to anticipate such accesses & eliminate first-time misses
18 Temporal Stability of File Popularity Methodology Consider the traces from a pair of days Pick the top n popular documents from each day Compute the overlap Results One day apart:significant overlap ( 80%) Two months apart: smaller overlap (20-80%) Ten months apart: very small overlap (mostly below 20%) The set of popular documents remains stable for days
19 Spatial Locality in Client Accesses Domain membership is significant except when there is a “hot” event of global interest
20 The Applicability of Zipf-law to Web requests The Web requests follow Zipf-like distribution Request frequency 1/i , where i is a document’s ranking The value of is much larger in MSNBC traces 1.4 – 1.8 in MSNBC traces smaller or close to 1 in the proxy traces close to 1 in the small departmental server logs [ABC+96] Highest when there is a hot event
21 Impact of larger Accesses in MSNBC traces are much more concentrated 90% of the accesses are accounted by Top 2-4% files in MSNBC traces Top 36% files in proxy traces (Microsoft proxies and the proxies studied in [BCF+99]) Top 10% files in small departmental server logs reported in [AW96] Popular news sites like MSNBC see much more concentrated accesses Reverse caching and replication can be very effective!
22 Summary of Results & Implications FactsImplications Past modification history, when averaged over a sufficiently large window, yields a rough predictor Guide for setting TTL, but need alternative mechanism (e.g. callback- based invalidation) as backup Modification between successive versions is small Delta encoding can be very useful
23 Summary of Results & Implications (Cont.) FactsImplications The set of popular documents remains stable over a timescale of days Prefetch/push previously popular files that have undergone modification File popularity follows Zipf- like distribution, but with a much larger than at proxies Potential of reverse caching & replication Accesses to old documents account for most first-time accesses Hard to anticipate such accesses, and eliminate first-time misses
24 Future Work Study data sets from other large server sites Different types of Web servers may have very different workload More studies such as ours will be needed Develop efficient cache consistency algorithms
25 Acknowledgement Jason Bender and Ian Marriott Erich Nahum Kiem-Phong Vo Damon Cole, Susan Dumais, Niccole Golden, Chris Haslam, Eric Horvitz, Geoff Voelker Anonymous reviewers