Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009.

Similar presentations


Presentation on theme: "Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009."— Presentation transcript:

1 Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009

2 Outline Web measurement motivation Properties of interest Challenges of web measurement Web measurement tools State of the art  Web properties  Web traffic data gathering and analysis  Web performance  Web applications 2 Web Measurement University of Tehran

3 Motivation The single most popular Internet application. Measurement can be very useful. The single largest application studied in Internet measurement 75% of the Internet traffic in the first decade of existence Around a billion web users 3 Web Measurement University of Tehran

4 Properties of Interest University of Tehran Web Measurement 4

5 Web Measurement Properties Web is at the most-visible level for users Some of the properties are decomposable into components at other layers of protocol stack Web latency  DNS, TCP, HTTP  Web server delay  Client-side rendering 5 Web Measurement University of Tehran

6 Web Measurement Properties (cont’d) 6 Web Measurement University of Tehran

7 High-Level Characterization Measuring fraction of web traffic  Measuring the use of HTTP protocol Considerable traffic over HTTP while the clients and servers are p2p nodes Here we consider web traffic that involves web clients communicating with a web server 7 Web Measurement University of Tehran

8 High-Level Characterization (cont’d) Knowledge of entities involved in web transactions  Clients, proxies, servers Measuring the count and growth of web entities  Providing insight on how the web has evolved and is being used  e.g. number of clients behind a proxy provides insights on the extent of caching 8 Web Measurement University of Tehran

9 Location Identifying where clients and proxies are present can help content providers move resources closer to them Location data can help businesses tailor content, manner of delivery, and consider alternate architectural improvements in placement of services. Network and physical location 9 Web Measurement University of Tehran

10 Configuration Different server configuration impact performance Clients and proxies configurations Protocol variants supported Compliance with protocol specification Clients connectivity 10 Web Measurement University of Tehran

11 User Workload Models How resources are accessed within a web site  reconfiguring the web site  modifying the resources  Alternatives for delivery of popular resources Constructing models for “think-time” of users Help in dealing with the new classes of users Modeling novel phenomena such as flash crowds and attacks 11 Web Measurement University of Tehran

12 Traffic Properties Reduction of redundant transfers and sudden surges  Caching the resources Cacheability of resources, deployment and use of caches, performance of caches Handling circumstances like flash crowds 12 Web Measurement University of Tehran

13 Application Demands Better understanding of the interaction between the application and transport-level protocols  Improvements in the protocols  Reducing time-to-glass The actual flow of a web transaction from the user click to displaying data 13 Web Measurement University of Tehran

14 Web Performance Dominating much of the web measurement work Popularity of a web site is highly dependant on it’s performance Finding ways to reduce delays Sources of slowdowns 14 Web Measurement University of Tehran

15 Challenges of web measurement University of Tehran Web Measurement 15

16 Challenges to Measurement Application-level nature Dependence on multiple protocols  DNS, TCP, HTTP Large sets of entities with varying configurations Equally diverse user population 16 Web Measurement University of Tehran

17 Challenges to Measurement (cont’d) Hidden data Hidden layers Hidden entities 17 Web Measurement University of Tehran

18 Hidden Data Much of the traffic is intra-net and inaccessible. Access to remote server data, even old logs is often unavailable. From the server end, information about the clients (e.g. connection bandwidth) is obscured. New pages are constantly added, old ones removed or modified. 18 Web Measurement University of Tehran

19 Hidden Data (cont’d) Access information of web pages are not accessible. TCP configuration parameters significantly impact performance and can not be remotely ascertained  Tools like TBIT for testing impacts of TCP variants like Reno, Tahoe, or Vegas 19 Web Measurement University of Tehran

20 Hidden Layers Protocol and network layers are harder to measure.  Requires both deep knowledge of the network protocol as well as an understanding of the precise interactions between the different network protocols Not knowing the number of end-clients due to proxies. Requests may be redirected at different layers of the protocol to different servers.  Redirections can happen at DNS, TCP, or HTTP level. 20 Web Measurement University of Tehran

21 Hidden Layers (cont’d) 21 Foo1.jpg Foo2.jpg Foo3.jpg ad1ad2ad3 Index.html Server Client ad1 ad2 ad3 Foo1.jpg Foo2.jpg Foo3.jpg Ad Server1 Ad Server2 Ad Server3 CDN Server 1CDN Server 2 Index.html Web Measurement University of Tehran

22 Hidden Entities Proxies, HTTP and TCP redirectors Transparent interception proxies, return results from a cache. Different behavior of switches for web-related and non web- related traffic Lack of predictability due to multiple hidden entities at various layers of protocol stack. 22 Web Measurement University of Tehran

23 Web Measurement Tools University of Tehran Web Measurement 23

24 Tools: Estimation of Web Traffic From 21 st century peer-to-peer traffic took the lead in terms of number of bytes Web still remains the number one application in terms of active users Almost 1 billion Internet users, a vast majority of whom use the web 24 Web Measurement University of Tehran

25 Tools: Sampling & DNS Netflow: traffic to the HTTP port (80) DNS traces to see what IP addresses are looked up  Well-known web servers are likely to be high 25 Web Measurement University of Tehran

26 Tools: Server Logs Number of requests and clients are logged in web server logs Web log analyzers for generating statistics Presence of obscured data  Proxies Inter-arrival time of requests Range and diversity of resources requested  Crawlers and Spiders Disproportionate number of requests from one of a few IP addresses  Anonymizers  Caches 26 Web Measurement University of Tehran

27 Tools: Surveys Estimating the number of web servers (Netcraft) Important metric: number and identity of popular web servers  Business, technical, and social implications 27 Web Measurement University of Tehran

28 Tools: Locating Entities An increasingly difficult problem Servers resources are distributed geographically  Large number of resources  Increase availability  Being closer to clients Several businesses can use the same server farm to increase utilization. Locating clients: simple ‘traceroute’, techniques such as network aware clustering 28 Web Measurement University of Tehran

29 Tools: Structural View The linkage structure on web pages HITS algorithm for identifying hubs and authorities  Hub: a page having multiple high-value links about a topic  Authority: the page having high-quality content on a given topic  Web pages as nodes and links as edges in a graph model Page rankings and Improvement of web searching 29 Web Measurement University of Tehran

30 Tools: Web Searching & Crawling One of the most important www applications Components:  Crawler: traverses the accessible part of the web to fetch web pages  Indexer: indexes the crawled pages  Search tool: accepts queries and returns pointers to the matching pages 30 Web Measurement University of Tehran

31 Tools: Web Performance (cont’d) Measuring a particular web site’s latency and availability from diverse client perspectives. Examining different latency components such as DNS, TCP or HTTP differences, and CDNs Global measurements of the web to examine protocol compliance and ensure reduction of outages. 31 Web Measurement University of Tehran

32 Tools: Web Performance (cont’d) A variety of companies offer such services:  Keynote, Akamai, eValid Test Suit, etc. A common technique: a distributed set of monitors around the world sending periodic requests to web sites. 32 Web Measurement University of Tehran

33 Tools: Network Aware Clustering An effective technique to group IP addresses into clusters quickly and automatically  Non-overlapping cluster  Being close topologically  Common administrative control Clustering by use of BGP routing table snapshots and longest prefix matching.  Same prefix → same cluster 33 Web Measurement University of Tehran

34 Tools: Network Aware Clustering (cont’d) BGP routing table snapshot 34 Web Measurement University of Tehran

35 Tools: Network Aware Clustering (cont’d) Application  Used to group client IP addresses in web server logs  Recognizing proxies and spiders  Better content access prediction  etc 35 Web Measurement University of Tehran

36 Tools: Network Aware Clustering (cont’d) 36 Total server log Client containing spider Cluster containing proxy Web Measurement University of Tehran

37 Tools: Network Aware Clustering (cont’d) 37 Balachander Krishnamurthy and Jia Wang. On Network-Aware Clustering of Web Clients. In Proceedings of ACM Sigcomm, August 2000. Web Measurement University of Tehran

38 Tools: Handling Mobile Clients 38 Jesse Steinberg and Joseph Pasquale. A Web Middleware Architecture for Dynamic Customization of Content for Wireless Clients. In Proceedings of the World Wide Web Conference, May 2002. Web Measurement University of Tehran

39 Tools: Handling Mobile Clients (cont’d) 39 Figure 3. Document Browsing with Summarizer on WAP Christopher C. Yang and Fu Lee Wang. Fractal Summarization for Mobile Devices to Access Large Documents on the Web. In Proceedings of the World Wide Web Conference, May 2003. Web Measurement University of Tehran

40 Tools: Handling Mobile Clients (cont’d) Continues growth in mobile web Wireless network delays Tailored content Similar methods:  Server logs of mobile content providers  Lab experiments (e.g emulate mobile devices, induce packet loss)  Wide-area experiments 40 Web Measurement University of Tehran

41 State of the Art University of Tehran Web Measurement 41

42 State of the Art Web properties Traffic gathering and analysis Performance issuesApplications Four main parts of web measurement: 42 Web Measurement University of Tehran

43 Web Properties: High Level Reduction in web traffic estimation  Unreachable data Firewalls and other barriers due to attacks Use of internal web sites  The shift from Web to P2P Around a million new sites a month (Netcraft) 43 Web Measurement University of Tehran

44 Web Properties: High Level (cont’d) 60 million web sites in fall 2004  A vast fraction have little or no traffic compared to the top few hundred. Apache and Microsoft server implementations together have 90% of the market (68% for Apache ) 44 Web Measurement University of Tehran

45 Web Properties: High Level (cont’d) 45 Netcraft survey. (news.netcraft.com) Web Measurement University of Tehran

46 Web Properties: High Level (cont’d) 46 Netcraft survey. (news.netcraft.com) Web Measurement University of Tehran

47 Web Properties: Location Steadily growing number of users are in Asian countries such as China and India. The fraction of web content from the US and Europe is falling. Implications on where servers will be mirrored and supported languages. 47 Web Measurement University of Tehran

48 Web Properties: Configuration Popular sites use a variety of techniques to improve server performance:  Distribute servers geographically (e.g. 3 world cup servers in the U.S., 1 in France)  Redirecting requests to the least loaded server in a farm.  Caching frequently requested resources 48 Web Measurement University of Tehran

49 Web Properties: User Workload Models We measure user workload by looking at:  the duration of HTTP connections  request and response sizes,  unique number of IP addresses contacting a given Web site  number and frequency of accesses of individual resources at a given Web site  etc. 49 Web Measurement University of Tehran

50 Web Properties: Access Dynamics Web page access has been experimentally verified to follow Zipf-like distribution. Zipf’s law:  Probability of a request to the ith most popular page is proportional to 1/i 50 Web Measurement University of Tehran

51 State of the Art Web properties Traffic gathering & analysis Performance issuesApplications Four main parts of web measurement: 51 Web Measurement University of Tehran

52 Web Traffic: Critical Path Analysis Constructing critical path to understand where delays are introduced in web requests  Packet propagation  Network variation (e.g. queuing at routers)  Packet loss  Delay at server and client 52 Web Measurement University of Tehran

53 Web Traffic: Critical Path Analysis (cont’d) Only some of the components are responsible for overall response time Importance of activities on the critical path 53 Web Measurement University of Tehran

54 Web Traffic: Software Aid httperf:  Sends HTTP requests and processes responses  Simulates workload  Gathers statistics  Supports HTTP/1.1  Freely available in source code 54 Web Measurement University of Tehran

55 Web Traffic: Software Aid (cont’d) wget  Fetches a large number of pages rooted at a particular node.  Can fetch all the pages up to a certain “level” according to links Mercator (a personalized crawler)  Uses a seed page and then does breadth-first search on the links to find pages.  Higher weight for pages having more incoming links. 55 Web Measurement University of Tehran

56 Web Traffic: Software Aid (cont’d) Detailed study in 2000 of 33 million requests from over 50,000 wireless and PDA users.  Top 1% of notifications responsible for 60% of content.  Notification messages had Zipf-like distribution  For popularity: 0.5% of URLs were accessed 90% of the time. In another study:  Threefold increase in average daily traffic per wireless card between Fall 2003 and Winter 2004 56 Web Measurement University of Tehran

57 Web Traffic: Wireless Users 57 Number of active cards per week at a Dartmouth. Tristan Henderson, David Kotz, and Ilya Abyzov. The Changing Usage of a Mature Campus-wide Wireless Network. In Proceedings of ACM Mobicom, September 2004. Web Measurement University of Tehran

58 State of the Art Web properties Traffic gathering and analysis Performance issuesApplications Four main parts of web measurement: 58 Web Measurement University of Tehran

59 Web Performance: Intro User-perceived latency is a key factor because it affects the popularity of a site. beyond a certain delay, user cancellations of the page increases sharply. 59 Web Measurement University of Tehran

60 Web Performance: CDNs Busy servers outsource delivery of some of their pages CDNs combine the workload of several sites into a single provider. Mirroring the CDNs to be located near clients. DNS-based redirection DNS overhead is a serious bottleneck in some CDNs 60 Web Measurement University of Tehran

61 Motivation: More hops between client and Web server => more congestion! Same data flowing repeatedly over links between clients and Web server 61 Web Performance: CDNs (cont’d) S C1 C4 C2 C3 - IP router Web Measurement University of Tehran

62 Web Performance: CDNs (cont’d) Caches 62 Web Server www.cnn.com User merlot.cis.udel.edu 1000,000 other hosts 1000,000 other hosts New Content WTC News! old content request - Caching Proxy ISP - Congestion / Bottleneck Web Measurement University of Tehran

63 63 Web Performance: CDNs (cont’d) Caching problems: Caching proxies serve only their clients, not all users on the Internet Content providers (say, Web servers) cannot rely on existence and correct implementation of caching proxies Accounting issues with caching proxies. For instance, www.cnn.com needs to know the number of hits to the webpage for advertisements displayed on the webpage Web Measurement University of Tehran

64 Web Server www.cnn.com User merlot.cis.udel.edu 64 Web Performance: CDNs (cont’d) New Content WTC News! request new content 1000,000 other users 1000,000 other users - Mirrors - Distribution Infrastructure FL IL DE NY MA MI CA WA Web Measurement University of Tehran

65 Overlay network to distribute content from origin servers to users Avoids large amounts of same data repeatedly traversing potentially congested links on the Internet Reduces Web server load Reduces user perceived latency 65 Web Performance: CDNs (cont’d) Web Measurement University of Tehran

66 66 DNS-based Request Routing Akamai DNS DNS query: www.cnn.com DNS response: A 145.155.10.15 Session local DNS server (louie.udel.edu) 128.4.4.12 DNS query: www.cnn.com DNS response: A 145.155.10.15 www.cnn.com Surrogate 145.155.10.1 5 Surrogate 58.15.100.15 2 Akamai CDN merlot.cis.udel.edu 128.4.30.15 delaware.cnn.akamai.com california.cnn.akamai.com Q: How does the Akamai DNS know which surrogate is closest ? Web Measurement University of Tehran

67 67 DNS-Based Request Routing (cont’d) DNS query DNS response Session Akamai DNS www.cnn.com Surrogate Akamai CDN merlot.cis.udel.edu 128.4.30.15 local DNS server (louie.udel.edu) 128.4.4.12 DNS query DNS response Measure to Client DNS Measure to Client DNS Measurement results Measurements Web Measurement University of Tehran

68 DNS-Based Redirection Problem:  The content server is optimized for the local name server, not the actual client  Client may be far from name server  In a study, only 16% of the clients were in the same network-aware cluster as the local DNS server 68 Web Measurement University of Tehran

69 Total & Selective Redirection 1. Total redirection  Any request for origin server is redirected to CDN  Basically, CDN takes control of content provider’s DNS zone  Benefit: All requests are automatically redirected  Disadvantage: May send lots of traffic to CDN, hence expensive for the content provider 2. Selective redirection  Content provider marks which objects are to be served from CDN  Typically, larger objects like images are selected  Refer to images as:  Pro: Fine-grained control over what gets delivered  Con: Have to (manually) mark content for CDN 69 Web Measurement University of Tehran

70 Surrogate Server CDN Origin Server Client GET index.html GET image1.gif, image2.gif index.html, image1.gif, image2.gif Total Redirection 70 index.html embedded image1.gif image2.gif Web Measurement University of Tehran

71 Origin Server Surrogate Server CDN Client GET index.html GET image1.gif, image2.gif image1.gif, image2.gif Partial Redirection 71 index.html embedded image1.gif image2.gif Web Measurement University of Tehran

72 Total vs. Selective Redirection Total redirection has clearly superior performance Selective redirection is typically slower than downloading everything from the origin server  But origin server might be loaded… Which redirection is more used?  Initially, selective redirection was used  These days, mainly total redirection 72 Web Measurement University of Tehran

73 Web Performance: Client Connectivity Finding clients’ connection quality Delivering the most suitable version of content  Sending just the base document  Using compression Tailoring server’s policy  Keep persistent connections open longer Measure the inter-arrival time of requests to classify clients. 73 Web Measurement University of Tehran

74 Web Performance: Client Connectivity (cont’d) Stability of client classification Classifying new clients using network-aware clustering  same cluster → same class Classification works best for sites having variety of clients. 74 Web Measurement University of Tehran

75 Web performance: Client Connectivity (cont’d) 75 Balachander Krishnamurthy, Craig E. Wills, Yin Zhang, and Kashi Vishwanath. Design, Implementation, and Evaluation of a Client Characterization Driven Web Server. In Proceedings of the World Wide Web Conference, May 2003. Server Action conclusions: - Compression - consistently good results for poorer but not well- connected clients. - Reducing the quality of objects only yielded benefits for a modem client. - Bundling was effective when there was good connectivity or poor connectivity with large latency. - Persistent connections with serialized requests did not show significant improvement - Pipelining was only significant for client with high throughput or RTT. Web Measurement University of Tehran

76 Web Performance: Protocol Compliance A 16-month study used the httperf tool to test for HTTP protocol compliance. Absence of required headers (such as date) Nearly half the servers did not implement range requests. Inability to handle long URIs in a graceful manner. The popular Apache server was most compliant, then Microsoft’s IIS. 76 Web Measurement University of Tehran

77 State of the Art Web properties Traffic gathering and analysis Performance issuesApplications Four main parts of web measurement: 77 Web Measurement University of Tehran

78 Web Applications: Searching In 1999, 200 million pages and 1.5 billion links were examined. The probability of a node having in-degree i is proportional to 1/ix (x>1). Nodes with a large in-degree are considered “high rank”  Used frequently in search engines  Sites may use fake linkages to trick crawlers. 78 Web Measurement University of Tehran

79 Web Applications: Searching (cont’d) A four-part separation in web structure.  A central core  Two parts connected to the core  One part with no connection to the core  All the components have roughly equal number of pages! 79 Web Measurement University of Tehran

80 Web Applications: Searching (cont’d) Over 90% of web pages are reachable from each other. The probability of reaching a random page from another is only 0.25. The well-connected component will remain connected even if we remove nodes with large degrees (hubs). 80 Web Measurement University of Tehran

81 Web Applications: Searching (cont’d) Image resources change infrequently. Many text documents change periodically. Some studies have tried to model the rate of change of pages as a Poisson process. Some studies done to examine the rate of change in different domains.(e.g..com vs.org) 81 Web Measurement University of Tehran

82 Web Applications: Searching (cont’d) 150 web sites were studied over a 7-month period.  Incoming links of the pages were computed  Rich getting richer!  Pages in the bottom 60% ranking received no additional links.  Need for change in search engines ranking manner. 82 Web Measurement University of Tehran

83 Web Applications: Searching (cont’d) A study examined several subset of pages.  Significant fraction of links were dead with impact on crawling an page ranking.  Over 50% dead links in some cases.  Faster crawling and more useful ranking by avoiding dead links. 83 Web Measurement University of Tehran

84 Web Applications: Flash Crowds Large number of legitimate and wanted requests (unlike DoS attacks in which the requests are not wanted) During flash crowds  Same average number of requests per client  No increase in the number of client clusters  Between 60% and 82% of the resources are accessed only at this time.  Less than 10% are responses for 90% of the requests. DoS attackers have no way of knowing the typical distribution of client clusters.  Many new clusters emerge. 84 Web Measurement University of Tehran

85 Flash Crowd vs DoS Attack Flash crowd  Increase in number of clients  Fixed number of clusters DoS attack  Increase in number of both clients and clusters University of Tehran Web Measurement 85

86 Web Applications: Blogs Providing early warning of flash crowds Different rate of change comparing to traditional web pages Having much references, the same as popular web sites Significant fraction of links going to other blogs having significantly more self-references 86 Web Measurement University of Tehran


Download ppt "Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009."

Similar presentations


Ads by Google