Group 3: Olena Hunsicker and Divya Josyula

Slides:



Advertisements
Similar presentations
Hypertext Transfer PROTOCOL ----HTTP Sen Wang CSE5232 Network Programming.
Advertisements

Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:
An Introduction to the Internet and the Web Frank McCown COMP 250 – Internet Development Harding University.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
What’s a Web Cache? Why do people use them? Web cache location Web cache purpose There are two main reasons that Web cache are used:  to reduce latency.
Towards a Better Understanding of Web Resources and Server Responses for Improved Caching Craig E. Wills and Mikhail Mikhailov Computer Science Department.
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/10/10.
Web Caching: Replication on the World Wide Web Jonathan Bulava CSC8530 – Distributed Systems Dr. Paul Schragger.
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.
TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.
Web Server Design Week 4 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/03/10.
Performance of Web Proxy Caching in Heterogeneous Bandwidth Environments IEEE Infocom, 1999 Anja Feldmann et.al. AT&T Research Lab 발표자 : 임 민 열, DB lab,
Web Server Design Week 7 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/24/10.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2010 Martin Klein 2/17/10.
COMP2322 Lab 2 HTTP Steven Lee Jan. 29, HTTP Hypertext Transfer Protocol Web’s application layer protocol Client/server model – Client (browser):
Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.
Web Server Design Week 5 Old Dominion University Department of Computer Science CS 495/595 Spring 2012 Michael L. Nelson 02/07/12.
INTRODUCTION Dr Mohd Soperi Mohd Zahid Semester /16.
Web Server Design Week 3 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 1/23/06.
Web Server Design Week 6 Old Dominion University Department of Computer Science CS 495/595 Spring 2006 Michael L. Nelson 2/13/06.
What’s Really Happening
HyperText Transfer Protocol HTTP v1.1 hussein suleman uct cs honours 2009.
Block 5: An application layer protocol: HTTP
HTTP – An overview.
Web Development Web Servers.
HTTP request message: general format
Web Server Design Week 10 Old Dominion University
COMP2322 Lab 2 HTTP Steven Lee Feb. 8, 2017.
Chapter 27 WWW and HTTP Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Web Server Design Week 4 Old Dominion University
Hypertext Transport Protocol
Web Caching? Web Caching:.
Web Server Design Week 8 Old Dominion University
Web Server Design Week 7 Old Dominion University
IS333D: MULTI-TIER APPLICATION DEVELOPMENT
Internet Applications
Computer Communication & Networks
Web Server Design Week 4 Old Dominion University
Hypertext Transfer Protocol
CSE 461 HTTP and the Web.
HTTP Request Method URL Protocol Version GET /index.html HTTP/1.1
Hypertext Transfer Protocol
Web Server Design Week 5 Old Dominion University
CS 5565 Network Architecture and Protocols
HyperText Transfer Protocol
Web Server Design Week 8 Old Dominion University
Web Server Design Week 8 Old Dominion University
Web Server Design Week 6 Old Dominion University
Web Server Design Week 8 Old Dominion University
Web Server Design Week 3 Old Dominion University
World Wide Web Uniform Resource Locator hostname [:port]/path
Hypertext Transfer Protocol
Kevin Harville Source: Webmaster in a Nutshell, O'Rielly Books
Web Server Design Week 5 Old Dominion University
HTTP/2.
Web Server Design Week 3 Old Dominion University
Web Server Design Week 4 Old Dominion University
HTTP Hypertext Transfer Protocol
Web Server Design Week 12 Old Dominion University
Web Server Design Week 12 Old Dominion University
Hypertext Transfer Protocol
Web Server Design Week 6 Old Dominion University
Web Server Design Week 3 Old Dominion University
Web Server Design Week 3 Old Dominion University
Web Server Design Week 7 Old Dominion University
Web Programming Week 1 Old Dominion University
Web Server Design Week 7 Old Dominion University
Presentation transcript:

Group 3: Olena Hunsicker and Divya Josyula “Rate of Change and other Metrics: a Live Study of the World Wide Web” Fred Douglis Anja Feldmann Balachander Krishnamurthy Jeffrey Mogul Group 3: Olena Hunsicker and Divya Josyula CS 791/891 "Web Syndication Formats" ODU Spring 2008

Presentation Overview Motivation Behind the research Internet in 1997 What is the Web cache? Traces Statistics Analyzing the results Access rate Modification times Ages Modification Intervals Duplication Semantic Differences Conclusion CS 791/891 "Web Syndication Formats" ODU Spring 2008

Motivation Behind the Research Assumptions: 1. Significant amount of web resources accessed more than once (locality of references) 2. “Those resources don’t change between accesses”. [1] (stability of value) Validate this assumptions Measure the benefits of using a shared proxy-server. Calculate the rate and nature of changes of Web resources How this metrics depend on: Access rate Resource size Content type Age at the time of reference Internet top level domain (TLD) Frequency of duplicates on the Web CS 791/891 "Web Syndication Formats" ODU Spring 2008

Historic Overview 1997 2007 19.5 million hosts [4] 200 million hosts Table1. Changes on the Web from 1997 to 2007 1997 2007 19.5 million hosts [4] 200 million hosts 1 million of websites >92 millions of websites[5] Dial-up DSL/cable Internet CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 What is a Web cache? Advantages: reduce latency and network traffic Proxy servers don’t cache documents that require authorization, include no-cache header Delta-Encoding reduces cache misses Client Browser Proxy server Origin server CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces Static- Web “crawling” – doesn’t provide dynamic access information Dynamic- analyzing the proxy or web server log – can reflect access times and modification dates CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces (cont) Amount of data : 19 GBytes Time limits: 17 days Where: gateway between AT&T Labs-Research and Internet Type of data: full contents of all HTTP requests and responses CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces (cont) Used only 200 “OK” and 304 “Not Modified“ HTTP responses CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces (cont) 79% of status-200 responses included Last-Modified header > telnet www.cs.odu.edu 80 | tee a1.out Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. GET /~ohunsick/index.html HTTP/1.1 Host: www.cs.odu.edu HTTP/1.1 200 OK Date: Sun, 27 Jan 2008 22:12:38 GMT Server: Apache/2.2.0 Last-Modified: Sat, 10 Nov 2007 14:22:47 GMT ETag: "5caedb-d56-d553dbc0" Accept-Ranges: bytes Content-Length: 3414 Content-Type: text/html <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" /> <meta name="keywords" content="Olena Hunsicker" /> CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Traces (cont) If status-200 responses didn’t include Last-Modified header & content changed, assume that resource was dynamically generated - use Date header > telnet www.cs.odu.edu 80 | tee a2.out Trying 128.82.4.2... Connected to xenon.cs.odu.edu. Escape character is '^]'. GET /~ohunsick/index.html HTTP/1.1 Host: www.cs.odu.edu If-Modified-Since: Sat, 10 Nov 2007 14:22:47 GMT HTTP/1.1 304 Not Modified Date: Sun, 27 Jan 2008 22:16:28 GMT Server: Apache/2.2.0 ETag: "5caedb-d56-d553dbc0" CS 791/891 "Web Syndication Formats" ODU Spring 2008

Statistics Content-Type Accesses % by count Resources Images Table 2. Content Type distribution Content-Type Accesses % by count Resources Images (jpeg & gif) 69% 64% Text/html 20% 24% Application/octet-stream + others 11% 12% CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Access Rate 474000 distinct resources in the AT&T trace 105000 resources (22%) were accessed more than once and returned multiple 200 “OK” responses or 304 “Not Modified” CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Change Ratio Change ratio for the resource = # new instances of resource total # references Resource accessed more then once – 13 % modified Resource accessed 2 or more times – 16.5 % modified Overall – 15.4% all resources were modified between the accesses CS 791/891 "Web Syndication Formats" ODU Spring 2008

Results: Change Ratio (cont) Fig. 2 Cumulative distribution of change ratio for the AT&T trace [1] Grouped by content type HTML only by # of references CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Age Age = Request time - Last-Modified Time Fig. 3 Grouping data by number of references and resource size Thus, frequency of access and resource size do not affect the age CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Age (cont). Fig 4. Grouping data by top-level domain (TLD) (edu, com, gov) an by content type. CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Age (cont) Fig. 5. Grouping data by number of references All content types HTML only Conclusion: frequently accessed resources are younger CS 791/891 "Web Syndication Formats" ODU Spring 2008

Results: Modification Interval Definition : Elapsed time between modifications of resources. Benefit : Helps cache in maintaining data consistency CS 791/891 "Web Syndication Formats" ODU Spring 2008

Results: Modification interval (cont) Statistics Results Measurement by varying the no. of accesses The interval reduces as frequency of access increases Measurement by varying content type HTML resources change more often than static content types Content -type interval HTML 15 minutes Application /octet-stream 1 hour Images Gif/jpeg 1 day CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Results: Duplication A resource can have many replicas available under different URLs on the same or different machines Benefit of identifying replicas: - Reduce storage size of cache - Reduce number of accesses to the resource. - Extent of Duplication is an important aspect for HTTP Distribution and replication protocol CS 791/891 "Web Syndication Formats" ODU Spring 2008

Results: Duplication (cont) Fig. 6 Number of hosts by comparison with number of replicas CS 791/891 "Web Syndication Formats" ODU Spring 2008

Results: Duplication (cont) Observations 18% of the full body responses accessing an instance of particular resource were identical to at least one other instance of a different resource Possible causes : Multiple URL’s point to the same resource, for example: if you go to http://www.apple.com/tiger/ , you will end up at http://www.apple.com/macosx/. Same image embedded in two different HTML resources Different resources with the same links in their content CS 791/891 "Web Syndication Formats" ODU Spring 2008

Results: Semantic Differences Semantically interesting items should : have recognizable pattern (phone numbers, <href ...>, <img ...>, email addresses ) occurs reasonably often The string “757-200-1111” not necessarily, but likely is a phone number CS 791/891 "Web Syndication Formats" ODU Spring 2008

Results: Semantic Differences (cont) # of forms that changed Churn = total # of forms For example, instance of the resource has 8 phone numbers. Next instance of the resource changes 4 phone numbers: Churn = 4/8 * 100% = 50% CS 791/891 "Web Syndication Formats" ODU Spring 2008

Results: Semantic Differences (cont) Table 3. Percentage of instances having a given value of churn [1] churn HREF IMG Email 10-digit phone 7-digit phone 100% 3.3 4.7 1.4 0.9 3.2 >=75% 5.6 6.2 1.5 1.0 4.9 >=50% 9.7 12.6 2.1 6.3 >=25% 17.8 24.6 2.6 1.6 7.1 0% 41.2 48.6 96.5 98.0 90.2 Example: in 75% of cases, 4.9% of recognizable 7-digit phone numbers changed between instances CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Conclusion Many resources change frequently Frequency of access, resource age and frequency of modification depends on: content type and TLD do not depend on the resource size Assumptions about locality of reference and stability of value for Web caching is valid for subset of the resources on the Web only. CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 Questions: 1. The earlier studies on servers in Boston and Harvard Universities found that most popular resources change less frequently than others. Why their results were different? 2. When multiple URL’s can refer to the same resource located on the same server? 3. The researchers used the formula to calculate the age of the resource: Age = Response time – Last Modified time stamp. How is it different from Age header in HTTP response? CS 791/891 "Web Syndication Formats" ODU Spring 2008

CS 791/891 "Web Syndication Formats" ODU Spring 2008 References: Fred Douglis, Anja Feldman, Balachander Krishnamurthy, Jeffrey Mogul (1997). “Rate of Change and other Metrics: a Live Study of the World Wide Web”. http://www.research.ibm.com/people/f/fdouglis/papers/roc.pdf Craig E. Wills, Mikhail Mikhailov (1999). “Towards a Better Understanding of Web Resources and Server Responces for Improved Caching”. http://www8.org/w8-papers/2a-webserver/towards/towards.html Paul James (2006) “HTTP caching” http://www.peej.co.uk/articles/http-caching.html “History of the Internet” http://www.netvalley.com/archives/mirrors/davemarsh-timeline-1.htm (2007) Brief Timeline of the Internet http://www.webopedia.com/quick_ref/timeline.asp CS 791/891 "Web Syndication Formats" ODU Spring 2008