Download presentation
1
Towards Understanding Modern Web Traffic
Sunghwan Ihm and Vivek S. Pai Google Inc. / Princeton University
2
Web Changes and Growth Simple static documents complex rich media applications Heavy client-side interactions (e.g., Ajax) Traffic increase Social networking, file-sharing, and video streaming sites Trends expected to continue Applications migrated to the Web A de facto standard interface of cloud services Sunghwan Ihm, Princeton University
3
Understanding Changes
Goal: shape system design by better understanding the traffic optimization opportunities Improve response times Understand caching effectiveness Design intermediary systems: firewalls, security analyzers, and reporting/monitoring systems Sunghwan Ihm, Princeton University
4
Challenges We address these challenges by
Tracking changes Requires large-scale data set spanning many years collected under the same conditions Web page analysis Requires new analysis techniques suitable for dynamic Web pages with client-side interactions (e.g, Ajax) Redundancy and caching Requires full content instead of simple access logs for assessing implications of content-based caching We address these challenges by Analyzing large-scale data with full content Developing a new Web page analysis technique Sunghwan Ihm, Princeton University
5
CoDeeN Traffic CoDeeN content distribution network (CDN)
A semi-open globally distributed open proxy on PlanetLab nodes Running since 2003 30+ million requests per day Sunghwan Ihm, Princeton University
6
Data Collection Full Content Access Logs CoDeeN Cache WAN Local Proxy
Browser Cache Origin Web Server User Assume local proxy caches 1. Access logs (all requests, but limited info.) URL, Timestamp, Content-Length, Content-Type, Referer, etc. 2. Full content (cache-misses) Header + body Sunghwan Ihm, Princeton University
7
100M+ requests / 1TB+ / 100K+ users
Data Set 5 years: from 2006 to 2010 Focus on one month (April) per year Full content data only for 2010 Total volume per month 3.3~6.6 TB 280~460 million requests 240~360K unique client IPs (40~60% /8 nets) 168~187 countries and regions 820K~1.2 million servers Focus on US, CN, FR, BR: 100M+ requests / 1TB+ / 100K+ users Sunghwan Ihm, Princeton University
8
Analysis Outline 1. High-level analysis 2. Page-level analysis 3. Caching analysis Access Logs Full Content Sunghwan Ihm, Princeton University
9
1. High-Level Analysis Q: What has changed over five years?
Connection speed NAT usage Max # concurrent browser connections Content type Object Size Traffic share of Web sites Sunghwan Ihm, Princeton University
10
Content Type US, 20062010, both X and Y log-scale
A sharp increase of Ajax: JavaScript / CSS / XML A sharp increase of Flash video (FLV) (<5%25%) Sunghwan Ihm, Princeton University
11
Traffic Share of Web Sites
Increase in video sites’ traffic Increase in ad networks and analytics sites’ requests (~12%) Ad networks market growth Most accessed site by users search / analytics google.com, baidu.com, google-analytics.com % user share increasing, tracking up to 65% Sunghwan Ihm, Princeton University
12
2. Page-Level Analysis Q: How have Web pages changed?
New page detection heuristic Initial page characteristics Page size / # of embedded objects / latency Page load latency simulation Entire page characterization Sunghwan Ihm, Princeton University
13
Page Detection Problem
Given a set of access logs, detect the page boundaries # of embedded objects, page size, time, etc. Challenge: previous approaches from 1990s are a poor fit, inaccurate for modern Web traffic Time main embedded Sunghwan Ihm, Princeton University
14
Previous Approach #1: Time-based
Check idle time between requests If within a threshold (e.g. 1 second), they belong to the same page Misclassify client-side interactions (Ajax) with longer idle time as pages Sunghwan Ihm, Princeton University
15
Previous Approach #2: Type-based
Check file extension / content type Regard every html object as a main object Misclassify frames/iframes within a page as separate pages Sunghwan Ihm, Princeton University
16
StreamStructure Algorithm
Ajax 1. Group logs into streams by Referer field 2. Consider all html object as main object candidates ( Type-based) 3. Ignore those with no children (embedded objects) 4. Apply idle time among the candidates for finalizing selection ( Time-based) frames/iframes Sunghwan Ihm, Princeton University
17
Validation Ground truth: browse Alexa’s top 100 sites
Visit about 10 pages per site Record Web page URLs (main objects) Total 1197 pages Precision # correct pages found / # total pages found Recall # correct pages found / # total correct pages Sunghwan Ihm, Princeton University
18
Validation Result Better 4 26~33 19~30 4~24 1 sec
StreamStructure outperforms other approaches Robust to the idle time parameter selection Sunghwan Ihm, Princeton University
19
Identifying Initial Page Loads
Client-side Interactions (e.g., Ajax) Initial Page Load Initial page: user-perceived page user- perceived latency traffic/revenue of Websites Apply Time-based approach, but DNS lookup or browser processing time can vary significantly Use Google Analytics beacon JavaScript collecting various client-side info. Fires when document are loaded 40-60% of traffic after initial page loads Sunghwan Ihm, Princeton University
20
Initial Page Size and # Objects
Initial pages become increasingly complex US: about 2x increase 2006: 69 KB / 6 objects 2010: 133 KB / 12 objects Caching Effectiveness Sunghwan Ihm, Princeton University
21
Initial Page Load Latency
Median latency dropped in 2009 and 2010 Increased # of browser concurrent connections Reduced per-object latency from improved caching behavior / client bandwidth Sunghwan Ihm, Princeton University
22
3. Caching Analysis Q: Implications for caching? URL popularity
Caching effectiveness Required cache storage size Impact of aborted transfers Sunghwan Ihm, Princeton University
23
Two Caching Approaches
HTTP Object-based Approach Whole object HTTP-cacheable only Previously reported cache hit rate: 35~50% Byte hit rate usually much less Content-based Approach Cache smaller chunks instead of objects Protocol independent Effective for uncacheable content as well WAN accelerators, storage/file systems Sunghwan Ihm, Princeton University
24
Ideal Cache Hit Rate 1.8~2.5x HTTP object-based: 17~28%
Mainly effective for JavaScript and image Content-based: 42~51% with 128-byte chunks Effective for any content type Growth of tail that hurts caching Sunghwan Ihm, Princeton University
25
Origins of Redundancy Aborted US, 128 byte Content updates
Most of additional savings from the redundancy across different versions (intra-URL) across different objects (inter-URL) Sunghwan Ihm, Princeton University
26
Required Cache Storage Size
CN: 218GB 1-KB outperforms 128-B w/ metadata overhead MRC: Multi-Resolution Chunking (USENIX’10) Increases working set size Large cache storage highly desirable Sunghwan Ihm, Princeton University
27
Conclusions Analyzed five years of real Web traffic with over 70,000 users Observed a rise of Ajax and Flash video, search engine / analytics site tracking 65% users Developed StreamStructure Half of the traffic occurs due to client-side interactions after initial page loads Pages have become increasingly complex Content-based caching with large cache storage highly desirable 2x larger byte hit rate, aborted transfers Sunghwan Ihm, Princeton University
28
sihm@cs.princeton.edu http://www.cs.princeton.edu/~sihm/
Thank You Sunghwan Ihm, Princeton University
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.