Group 3: Olena Hunsicker and Divya Josyula “Rate of Change and other Metrics: a Live Study of the World Wide Web” Fred Douglis Anja Feldmann Balachander Krishnamurthy Jeffrey Mogul Group 3: Olena Hunsicker and Divya Josyula CS 791/891 "Web Syndication Formats" ODU Spring 2008
Presentation Overview Motivation Behind the research Internet in 1997 What is the Web cache? Traces Statistics Analyzing the results Access rate Modification times Ages Modification Intervals Duplication Semantic Differences Conclusion CS 791/891 "Web Syndication Formats" ODU Spring 2008
Motivation Behind the Research Assumptions: 1. Significant amount of web resources accessed more than once (locality of references) 2. “Those resources don’t change between accesses”. [1] (stability of value) Validate this assumptions Measure the benefits of using a shared proxy-server. Calculate the rate and nature of changes of Web resources How this metrics depend on: Access rate Resource size Content type Age at the time of reference Internet top level domain (TLD) Frequency of duplicates on the Web CS 791/891 "Web Syndication Formats" ODU Spring 2008
Historic Overview 1997 2007 19.5 million hosts [4] 200 million hosts Table1. Changes on the Web from 1997 to 2007 1997 2007 19.5 million hosts [4] 200 million hosts 1 million of websites >92 millions of websites[5] Dial-up DSL/cable Internet CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 What is a Web cache? Advantages: reduce latency and network traffic Proxy servers don’t cache documents that require authorization, include no-cache header Delta-Encoding reduces cache misses Client Browser Proxy server Origin server CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces Static- Web “crawling” – doesn’t provide dynamic access information Dynamic- analyzing the proxy or web server log – can reflect access times and modification dates CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces (cont) Amount of data : 19 GBytes Time limits: 17 days Where: gateway between AT&T Labs-Research and Internet Type of data: full contents of all HTTP requests and responses CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces (cont) Used only 200 “OK” and 304 “Not Modified“ HTTP responses CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces (cont) 79% of status-200 responses included Last-Modified header > telnet www.cs.odu.edu 80 | tee a1.out Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. GET /~ohunsick/index.html HTTP/1.1 Host: www.cs.odu.edu HTTP/1.1 200 OK Date: Sun, 27 Jan 2008 22:12:38 GMT Server: Apache/2.2.0 Last-Modified: Sat, 10 Nov 2007 14:22:47 GMT ETag: "5caedb-d56-d553dbc0" Accept-Ranges: bytes Content-Length: 3414 Content-Type: text/html <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta name="keywords" content="Olena Hunsicker" /> CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces (cont) If status-200 responses didn’t include Last-Modified header & content changed, assume that resource was dynamically generated - use Date header > telnet www.cs.odu.edu 80 | tee a2.out Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. GET /~ohunsick/index.html HTTP/1.1 Host: www.cs.odu.edu If-Modified-Since: Sat, 10 Nov 2007 14:22:47 GMT HTTP/1.1 304 Not Modified Date: Sun, 27 Jan 2008 22:16:28 GMT Server: Apache/2.2.0 ETag: "5caedb-d56-d553dbc0" CS 791/891 "Web Syndication Formats" ODU Spring 2008
Statistics Content-Type Accesses % by count Resources Images Table 2. Content Type distribution Content-Type Accesses % by count Resources Images (jpeg & gif) 69% 64% Text/html 20% 24% Application/octet-stream + others 11% 12% CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Access Rate 474000 distinct resources in the AT&T trace 105000 resources (22%) were accessed more than once and returned multiple 200 “OK” responses or 304 “Not Modified” CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Change Ratio Change ratio for the resource = # new instances of resource total # references Resource accessed more then once – 13 % modified Resource accessed 2 or more times – 16.5 % modified Overall – 15.4% all resources were modified between the accesses CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Change Ratio (cont) Fig. 2 Cumulative distribution of change ratio for the AT&T trace [1] Grouped by content type HTML only by # of references CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Age Age = Request time - Last-Modified Time Fig. 3 Grouping data by number of references and resource size Thus, frequency of access and resource size do not affect the age CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Age (cont). Fig 4. Grouping data by top-level domain (TLD) (edu, com, gov) an by content type. CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Age (cont) Fig. 5. Grouping data by number of references All content types HTML only Conclusion: frequently accessed resources are younger CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Modification Interval Definition : Elapsed time between modifications of resources. Benefit : Helps cache in maintaining data consistency CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Modification interval (cont) Statistics Results Measurement by varying the no. of accesses The interval reduces as frequency of access increases Measurement by varying content type HTML resources change more often than static content types Content -type interval HTML 15 minutes Application /octet-stream 1 hour Images Gif/jpeg 1 day CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Duplication A resource can have many replicas available under different URLs on the same or different machines Benefit of identifying replicas: - Reduce storage size of cache - Reduce number of accesses to the resource. - Extent of Duplication is an important aspect for HTTP Distribution and replication protocol CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Duplication (cont) Fig. 6 Number of hosts by comparison with number of replicas CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Duplication (cont) Observations 18% of the full body responses accessing an instance of particular resource were identical to at least one other instance of a different resource Possible causes : Multiple URL’s point to the same resource, for example: if you go to http://www.apple.com/tiger/ , you will end up at http://www.apple.com/macosx/. Same image embedded in two different HTML resources Different resources with the same links in their content CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Semantic Differences Semantically interesting items should : have recognizable pattern (phone numbers, <href ...>, <img ...>, email addresses ) occurs reasonably often The string “757-200-1111” not necessarily, but likely is a phone number CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Semantic Differences (cont) # of forms that changed Churn = total # of forms For example, instance of the resource has 8 phone numbers. Next instance of the resource changes 4 phone numbers: Churn = 4/8 * 100% = 50% CS 791/891 "Web Syndication Formats" ODU Spring 2008
Results: Semantic Differences (cont) Table 3. Percentage of instances having a given value of churn [1] churn HREF IMG Email 10-digit phone 7-digit phone 100% 3.3 4.7 1.4 0.9 3.2 >=75% 5.6 6.2 1.5 1.0 4.9 >=50% 9.7 12.6 2.1 6.3 >=25% 17.8 24.6 2.6 1.6 7.1 0% 41.2 48.6 96.5 98.0 90.2 Example: in 75% of cases, 4.9% of recognizable 7-digit phone numbers changed between instances CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Conclusion Many resources change frequently Frequency of access, resource age and frequency of modification depends on: content type and TLD do not depend on the resource size Assumptions about locality of reference and stability of value for Web caching is valid for subset of the resources on the Web only. CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 Questions: 1. The earlier studies on servers in Boston and Harvard Universities found that most popular resources change less frequently than others. Why their results were different? 2. When multiple URL’s can refer to the same resource located on the same server? 3. The researchers used the formula to calculate the age of the resource: Age = Response time – Last Modified time stamp. How is it different from Age header in HTTP response? CS 791/891 "Web Syndication Formats" ODU Spring 2008
CS 791/891 "Web Syndication Formats" ODU Spring 2008 References: Fred Douglis, Anja Feldman, Balachander Krishnamurthy, Jeffrey Mogul (1997). “Rate of Change and other Metrics: a Live Study of the World Wide Web”. http://www.research.ibm.com/people/f/fdouglis/papers/roc.pdf Craig E. Wills, Mikhail Mikhailov (1999). “Towards a Better Understanding of Web Resources and Server Responces for Improved Caching”. http://www8.org/w8-papers/2a-webserver/towards/towards.html Paul James (2006) “HTTP caching” http://www.peej.co.uk/articles/http-caching.html “History of the Internet” http://www.netvalley.com/archives/mirrors/davemarsh-timeline-1.htm (2007) Brief Timeline of the Internet http://www.webopedia.com/quick_ref/timeline.asp CS 791/891 "Web Syndication Formats" ODU Spring 2008