Download presentation
Presentation is loading. Please wait.
Published byKourtney Brame Modified over 10 years ago
1
Web Measurement Chapter 7, Section 7.3 Hessam Mirsadeghi ECE Department, University of Tehran Fall 2009
2
Outline Web measurement motivation Properties of interest Challenges of web measurement Web measurement tools State of the art Web properties Web traffic data gathering and analysis Web performance Web applications 2 Web Measurement University of Tehran
3
Motivation The single most popular Internet application. Measurement can be very useful. The single largest application studied in Internet measurement 75% of the Internet traffic in the first decade of existence Around a billion web users 3 Web Measurement University of Tehran
4
Properties of Interest University of Tehran Web Measurement 4
5
Web Measurement Properties Web is at the most-visible level for users Some of the properties are decomposable into components at other layers of protocol stack Web latency DNS, TCP, HTTP Web server delay Client-side rendering 5 Web Measurement University of Tehran
6
Web Measurement Properties (cont’d) 6 Web Measurement University of Tehran
7
High-Level Characterization Measuring fraction of web traffic Measuring the use of HTTP protocol Considerable traffic over HTTP while the clients and servers are p2p nodes Here we consider web traffic that involves web clients communicating with a web server 7 Web Measurement University of Tehran
8
High-Level Characterization (cont’d) Knowledge of entities involved in web transactions Clients, proxies, servers Measuring the count and growth of web entities Providing insight on how the web has evolved and is being used e.g. number of clients behind a proxy provides insights on the extent of caching 8 Web Measurement University of Tehran
9
Location Identifying where clients and proxies are present can help content providers move resources closer to them Location data can help businesses tailor content, manner of delivery, and consider alternate architectural improvements in placement of services. Network and physical location 9 Web Measurement University of Tehran
10
Configuration Different server configuration impact performance Clients and proxies configurations Protocol variants supported Compliance with protocol specification Clients connectivity 10 Web Measurement University of Tehran
11
User Workload Models How resources are accessed within a web site reconfiguring the web site modifying the resources Alternatives for delivery of popular resources Constructing models for “think-time” of users Help in dealing with the new classes of users Modeling novel phenomena such as flash crowds and attacks 11 Web Measurement University of Tehran
12
Traffic Properties Reduction of redundant transfers and sudden surges Caching the resources Cacheability of resources, deployment and use of caches, performance of caches Handling circumstances like flash crowds 12 Web Measurement University of Tehran
13
Application Demands Better understanding of the interaction between the application and transport-level protocols Improvements in the protocols Reducing time-to-glass The actual flow of a web transaction from the user click to displaying data 13 Web Measurement University of Tehran
14
Web Performance Dominating much of the web measurement work Popularity of a web site is highly dependant on it’s performance Finding ways to reduce delays Sources of slowdowns 14 Web Measurement University of Tehran
15
Challenges of web measurement University of Tehran Web Measurement 15
16
Challenges to Measurement Application-level nature Dependence on multiple protocols DNS, TCP, HTTP Large sets of entities with varying configurations Equally diverse user population 16 Web Measurement University of Tehran
17
Challenges to Measurement (cont’d) Hidden data Hidden layers Hidden entities 17 Web Measurement University of Tehran
18
Hidden Data Much of the traffic is intra-net and inaccessible. Access to remote server data, even old logs is often unavailable. From the server end, information about the clients (e.g. connection bandwidth) is obscured. New pages are constantly added, old ones removed or modified. 18 Web Measurement University of Tehran
19
Hidden Data (cont’d) Access information of web pages are not accessible. TCP configuration parameters significantly impact performance and can not be remotely ascertained Tools like TBIT for testing impacts of TCP variants like Reno, Tahoe, or Vegas 19 Web Measurement University of Tehran
20
Hidden Layers Protocol and network layers are harder to measure. Requires both deep knowledge of the network protocol as well as an understanding of the precise interactions between the different network protocols Not knowing the number of end-clients due to proxies. Requests may be redirected at different layers of the protocol to different servers. Redirections can happen at DNS, TCP, or HTTP level. 20 Web Measurement University of Tehran
21
Hidden Layers (cont’d) 21 Foo1.jpg Foo2.jpg Foo3.jpg ad1ad2ad3 Index.html Server Client ad1 ad2 ad3 Foo1.jpg Foo2.jpg Foo3.jpg Ad Server1 Ad Server2 Ad Server3 CDN Server 1CDN Server 2 Index.html Web Measurement University of Tehran
22
Hidden Entities Proxies, HTTP and TCP redirectors Transparent interception proxies, return results from a cache. Different behavior of switches for web-related and non web- related traffic Lack of predictability due to multiple hidden entities at various layers of protocol stack. 22 Web Measurement University of Tehran
23
Web Measurement Tools University of Tehran Web Measurement 23
24
Tools: Estimation of Web Traffic From 21 st century peer-to-peer traffic took the lead in terms of number of bytes Web still remains the number one application in terms of active users Almost 1 billion Internet users, a vast majority of whom use the web 24 Web Measurement University of Tehran
25
Tools: Sampling & DNS Netflow: traffic to the HTTP port (80) DNS traces to see what IP addresses are looked up Well-known web servers are likely to be high 25 Web Measurement University of Tehran
26
Tools: Server Logs Number of requests and clients are logged in web server logs Web log analyzers for generating statistics Presence of obscured data Proxies Inter-arrival time of requests Range and diversity of resources requested Crawlers and Spiders Disproportionate number of requests from one of a few IP addresses Anonymizers Caches 26 Web Measurement University of Tehran
27
Tools: Surveys Estimating the number of web servers (Netcraft) Important metric: number and identity of popular web servers Business, technical, and social implications 27 Web Measurement University of Tehran
28
Tools: Locating Entities An increasingly difficult problem Servers resources are distributed geographically Large number of resources Increase availability Being closer to clients Several businesses can use the same server farm to increase utilization. Locating clients: simple ‘traceroute’, techniques such as network aware clustering 28 Web Measurement University of Tehran
29
Tools: Structural View The linkage structure on web pages HITS algorithm for identifying hubs and authorities Hub: a page having multiple high-value links about a topic Authority: the page having high-quality content on a given topic Web pages as nodes and links as edges in a graph model Page rankings and Improvement of web searching 29 Web Measurement University of Tehran
30
Tools: Web Searching & Crawling One of the most important www applications Components: Crawler: traverses the accessible part of the web to fetch web pages Indexer: indexes the crawled pages Search tool: accepts queries and returns pointers to the matching pages 30 Web Measurement University of Tehran
31
Tools: Web Performance (cont’d) Measuring a particular web site’s latency and availability from diverse client perspectives. Examining different latency components such as DNS, TCP or HTTP differences, and CDNs Global measurements of the web to examine protocol compliance and ensure reduction of outages. 31 Web Measurement University of Tehran
32
Tools: Web Performance (cont’d) A variety of companies offer such services: Keynote, Akamai, eValid Test Suit, etc. A common technique: a distributed set of monitors around the world sending periodic requests to web sites. 32 Web Measurement University of Tehran
33
Tools: Network Aware Clustering An effective technique to group IP addresses into clusters quickly and automatically Non-overlapping cluster Being close topologically Common administrative control Clustering by use of BGP routing table snapshots and longest prefix matching. Same prefix → same cluster 33 Web Measurement University of Tehran
34
Tools: Network Aware Clustering (cont’d) BGP routing table snapshot 34 Web Measurement University of Tehran
35
Tools: Network Aware Clustering (cont’d) Application Used to group client IP addresses in web server logs Recognizing proxies and spiders Better content access prediction etc 35 Web Measurement University of Tehran
36
Tools: Network Aware Clustering (cont’d) 36 Total server log Client containing spider Cluster containing proxy Web Measurement University of Tehran
37
Tools: Network Aware Clustering (cont’d) 37 Balachander Krishnamurthy and Jia Wang. On Network-Aware Clustering of Web Clients. In Proceedings of ACM Sigcomm, August 2000. Web Measurement University of Tehran
38
Tools: Handling Mobile Clients 38 Jesse Steinberg and Joseph Pasquale. A Web Middleware Architecture for Dynamic Customization of Content for Wireless Clients. In Proceedings of the World Wide Web Conference, May 2002. Web Measurement University of Tehran
39
Tools: Handling Mobile Clients (cont’d) 39 Figure 3. Document Browsing with Summarizer on WAP Christopher C. Yang and Fu Lee Wang. Fractal Summarization for Mobile Devices to Access Large Documents on the Web. In Proceedings of the World Wide Web Conference, May 2003. Web Measurement University of Tehran
40
Tools: Handling Mobile Clients (cont’d) Continues growth in mobile web Wireless network delays Tailored content Similar methods: Server logs of mobile content providers Lab experiments (e.g emulate mobile devices, induce packet loss) Wide-area experiments 40 Web Measurement University of Tehran
41
State of the Art University of Tehran Web Measurement 41
42
State of the Art Web properties Traffic gathering and analysis Performance issuesApplications Four main parts of web measurement: 42 Web Measurement University of Tehran
43
Web Properties: High Level Reduction in web traffic estimation Unreachable data Firewalls and other barriers due to attacks Use of internal web sites The shift from Web to P2P Around a million new sites a month (Netcraft) 43 Web Measurement University of Tehran
44
Web Properties: High Level (cont’d) 60 million web sites in fall 2004 A vast fraction have little or no traffic compared to the top few hundred. Apache and Microsoft server implementations together have 90% of the market (68% for Apache ) 44 Web Measurement University of Tehran
45
Web Properties: High Level (cont’d) 45 Netcraft survey. (news.netcraft.com) Web Measurement University of Tehran
46
Web Properties: High Level (cont’d) 46 Netcraft survey. (news.netcraft.com) Web Measurement University of Tehran
47
Web Properties: Location Steadily growing number of users are in Asian countries such as China and India. The fraction of web content from the US and Europe is falling. Implications on where servers will be mirrored and supported languages. 47 Web Measurement University of Tehran
48
Web Properties: Configuration Popular sites use a variety of techniques to improve server performance: Distribute servers geographically (e.g. 3 world cup servers in the U.S., 1 in France) Redirecting requests to the least loaded server in a farm. Caching frequently requested resources 48 Web Measurement University of Tehran
49
Web Properties: User Workload Models We measure user workload by looking at: the duration of HTTP connections request and response sizes, unique number of IP addresses contacting a given Web site number and frequency of accesses of individual resources at a given Web site etc. 49 Web Measurement University of Tehran
50
Web Properties: Access Dynamics Web page access has been experimentally verified to follow Zipf-like distribution. Zipf’s law: Probability of a request to the ith most popular page is proportional to 1/i 50 Web Measurement University of Tehran
51
State of the Art Web properties Traffic gathering & analysis Performance issuesApplications Four main parts of web measurement: 51 Web Measurement University of Tehran
52
Web Traffic: Critical Path Analysis Constructing critical path to understand where delays are introduced in web requests Packet propagation Network variation (e.g. queuing at routers) Packet loss Delay at server and client 52 Web Measurement University of Tehran
53
Web Traffic: Critical Path Analysis (cont’d) Only some of the components are responsible for overall response time Importance of activities on the critical path 53 Web Measurement University of Tehran
54
Web Traffic: Software Aid httperf: Sends HTTP requests and processes responses Simulates workload Gathers statistics Supports HTTP/1.1 Freely available in source code 54 Web Measurement University of Tehran
55
Web Traffic: Software Aid (cont’d) wget Fetches a large number of pages rooted at a particular node. Can fetch all the pages up to a certain “level” according to links Mercator (a personalized crawler) Uses a seed page and then does breadth-first search on the links to find pages. Higher weight for pages having more incoming links. 55 Web Measurement University of Tehran
56
Web Traffic: Software Aid (cont’d) Detailed study in 2000 of 33 million requests from over 50,000 wireless and PDA users. Top 1% of notifications responsible for 60% of content. Notification messages had Zipf-like distribution For popularity: 0.5% of URLs were accessed 90% of the time. In another study: Threefold increase in average daily traffic per wireless card between Fall 2003 and Winter 2004 56 Web Measurement University of Tehran
57
Web Traffic: Wireless Users 57 Number of active cards per week at a Dartmouth. Tristan Henderson, David Kotz, and Ilya Abyzov. The Changing Usage of a Mature Campus-wide Wireless Network. In Proceedings of ACM Mobicom, September 2004. Web Measurement University of Tehran
58
State of the Art Web properties Traffic gathering and analysis Performance issuesApplications Four main parts of web measurement: 58 Web Measurement University of Tehran
59
Web Performance: Intro User-perceived latency is a key factor because it affects the popularity of a site. beyond a certain delay, user cancellations of the page increases sharply. 59 Web Measurement University of Tehran
60
Web Performance: CDNs Busy servers outsource delivery of some of their pages CDNs combine the workload of several sites into a single provider. Mirroring the CDNs to be located near clients. DNS-based redirection DNS overhead is a serious bottleneck in some CDNs 60 Web Measurement University of Tehran
61
Motivation: More hops between client and Web server => more congestion! Same data flowing repeatedly over links between clients and Web server 61 Web Performance: CDNs (cont’d) S C1 C4 C2 C3 - IP router Web Measurement University of Tehran
62
Web Performance: CDNs (cont’d) Caches 62 Web Server www.cnn.com User merlot.cis.udel.edu 1000,000 other hosts 1000,000 other hosts New Content WTC News! old content request - Caching Proxy ISP - Congestion / Bottleneck Web Measurement University of Tehran
63
63 Web Performance: CDNs (cont’d) Caching problems: Caching proxies serve only their clients, not all users on the Internet Content providers (say, Web servers) cannot rely on existence and correct implementation of caching proxies Accounting issues with caching proxies. For instance, www.cnn.com needs to know the number of hits to the webpage for advertisements displayed on the webpage Web Measurement University of Tehran
64
Web Server www.cnn.com User merlot.cis.udel.edu 64 Web Performance: CDNs (cont’d) New Content WTC News! request new content 1000,000 other users 1000,000 other users - Mirrors - Distribution Infrastructure FL IL DE NY MA MI CA WA Web Measurement University of Tehran
65
Overlay network to distribute content from origin servers to users Avoids large amounts of same data repeatedly traversing potentially congested links on the Internet Reduces Web server load Reduces user perceived latency 65 Web Performance: CDNs (cont’d) Web Measurement University of Tehran
66
66 DNS-based Request Routing Akamai DNS DNS query: www.cnn.com DNS response: A 145.155.10.15 Session local DNS server (louie.udel.edu) 128.4.4.12 DNS query: www.cnn.com DNS response: A 145.155.10.15 www.cnn.com Surrogate 145.155.10.1 5 Surrogate 58.15.100.15 2 Akamai CDN merlot.cis.udel.edu 128.4.30.15 delaware.cnn.akamai.com california.cnn.akamai.com Q: How does the Akamai DNS know which surrogate is closest ? Web Measurement University of Tehran
67
67 DNS-Based Request Routing (cont’d) DNS query DNS response Session Akamai DNS www.cnn.com Surrogate Akamai CDN merlot.cis.udel.edu 128.4.30.15 local DNS server (louie.udel.edu) 128.4.4.12 DNS query DNS response Measure to Client DNS Measure to Client DNS Measurement results Measurements Web Measurement University of Tehran
68
DNS-Based Redirection Problem: The content server is optimized for the local name server, not the actual client Client may be far from name server In a study, only 16% of the clients were in the same network-aware cluster as the local DNS server 68 Web Measurement University of Tehran
69
Total & Selective Redirection 1. Total redirection Any request for origin server is redirected to CDN Basically, CDN takes control of content provider’s DNS zone Benefit: All requests are automatically redirected Disadvantage: May send lots of traffic to CDN, hence expensive for the content provider 2. Selective redirection Content provider marks which objects are to be served from CDN Typically, larger objects like images are selected Refer to images as: Pro: Fine-grained control over what gets delivered Con: Have to (manually) mark content for CDN 69 Web Measurement University of Tehran
70
Surrogate Server CDN Origin Server Client GET index.html GET image1.gif, image2.gif index.html, image1.gif, image2.gif Total Redirection 70 index.html embedded image1.gif image2.gif Web Measurement University of Tehran
71
Origin Server Surrogate Server CDN Client GET index.html GET image1.gif, image2.gif image1.gif, image2.gif Partial Redirection 71 index.html embedded image1.gif image2.gif Web Measurement University of Tehran
72
Total vs. Selective Redirection Total redirection has clearly superior performance Selective redirection is typically slower than downloading everything from the origin server But origin server might be loaded… Which redirection is more used? Initially, selective redirection was used These days, mainly total redirection 72 Web Measurement University of Tehran
73
Web Performance: Client Connectivity Finding clients’ connection quality Delivering the most suitable version of content Sending just the base document Using compression Tailoring server’s policy Keep persistent connections open longer Measure the inter-arrival time of requests to classify clients. 73 Web Measurement University of Tehran
74
Web Performance: Client Connectivity (cont’d) Stability of client classification Classifying new clients using network-aware clustering same cluster → same class Classification works best for sites having variety of clients. 74 Web Measurement University of Tehran
75
Web performance: Client Connectivity (cont’d) 75 Balachander Krishnamurthy, Craig E. Wills, Yin Zhang, and Kashi Vishwanath. Design, Implementation, and Evaluation of a Client Characterization Driven Web Server. In Proceedings of the World Wide Web Conference, May 2003. Server Action conclusions: - Compression - consistently good results for poorer but not well- connected clients. - Reducing the quality of objects only yielded benefits for a modem client. - Bundling was effective when there was good connectivity or poor connectivity with large latency. - Persistent connections with serialized requests did not show significant improvement - Pipelining was only significant for client with high throughput or RTT. Web Measurement University of Tehran
76
Web Performance: Protocol Compliance A 16-month study used the httperf tool to test for HTTP protocol compliance. Absence of required headers (such as date) Nearly half the servers did not implement range requests. Inability to handle long URIs in a graceful manner. The popular Apache server was most compliant, then Microsoft’s IIS. 76 Web Measurement University of Tehran
77
State of the Art Web properties Traffic gathering and analysis Performance issuesApplications Four main parts of web measurement: 77 Web Measurement University of Tehran
78
Web Applications: Searching In 1999, 200 million pages and 1.5 billion links were examined. The probability of a node having in-degree i is proportional to 1/ix (x>1). Nodes with a large in-degree are considered “high rank” Used frequently in search engines Sites may use fake linkages to trick crawlers. 78 Web Measurement University of Tehran
79
Web Applications: Searching (cont’d) A four-part separation in web structure. A central core Two parts connected to the core One part with no connection to the core All the components have roughly equal number of pages! 79 Web Measurement University of Tehran
80
Web Applications: Searching (cont’d) Over 90% of web pages are reachable from each other. The probability of reaching a random page from another is only 0.25. The well-connected component will remain connected even if we remove nodes with large degrees (hubs). 80 Web Measurement University of Tehran
81
Web Applications: Searching (cont’d) Image resources change infrequently. Many text documents change periodically. Some studies have tried to model the rate of change of pages as a Poisson process. Some studies done to examine the rate of change in different domains.(e.g..com vs.org) 81 Web Measurement University of Tehran
82
Web Applications: Searching (cont’d) 150 web sites were studied over a 7-month period. Incoming links of the pages were computed Rich getting richer! Pages in the bottom 60% ranking received no additional links. Need for change in search engines ranking manner. 82 Web Measurement University of Tehran
83
Web Applications: Searching (cont’d) A study examined several subset of pages. Significant fraction of links were dead with impact on crawling an page ranking. Over 50% dead links in some cases. Faster crawling and more useful ranking by avoiding dead links. 83 Web Measurement University of Tehran
84
Web Applications: Flash Crowds Large number of legitimate and wanted requests (unlike DoS attacks in which the requests are not wanted) During flash crowds Same average number of requests per client No increase in the number of client clusters Between 60% and 82% of the resources are accessed only at this time. Less than 10% are responses for 90% of the requests. DoS attackers have no way of knowing the typical distribution of client clusters. Many new clusters emerge. 84 Web Measurement University of Tehran
85
Flash Crowd vs DoS Attack Flash crowd Increase in number of clients Fixed number of clusters DoS attack Increase in number of both clients and clusters University of Tehran Web Measurement 85
86
Web Applications: Blogs Providing early warning of flash crowds Different rate of change comparing to traditional web pages Having much references, the same as popular web sites Significant fraction of links going to other blogs having significantly more self-references 86 Web Measurement University of Tehran
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.