Towards Understanding Modern Web Traffic

Slides:



Advertisements
Similar presentations
The Internet and the Web
Advertisements

CHAPTER 15 WEBPAGE OPTIMIZATION. LEARNING OBJECTIVES How to test your web-page performance How browser and server interactions impact performance What.
TCP/IP Protocol Suite 1 Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Chapter 22 World Wide Web and HTTP.
Client side performance in Web based Banking applications Divakar Prabhu Infosys Limited (NASDAQ: INFY)
Networking Problems in Cloud Computing Projects. 2 Kickass: Implementation PROJECT 1.
Amazon CloudFront An introductory discussion. What is Amazon CloudFront? 5/31/20122© e-Zest Solutions Ltd. Amazon CloudFront is a web service for content.
1 “Tracking the Evolution of Web Traffic: Felix Hernandez-Campos, Kevin Jeffay, F. Donelson Smith IEEE/ACM International Symposium on Modeling,
1 10 Web Workload Characterization Web Protocols and Practice.
1 Content Delivery Networks iBAND2 May 24, 1999 Dave Farber CTO Sandpiper Networks, Inc.
EEC-484/584 Computer Networks Lecture 6 Wenbing Zhao
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
An Analysis of Internet Content Delivery Systems Stefan Saroiu, Krishna P. Gommadi, Richard J. Dunn, Steven D. Gribble, and Henry M. Levy Proceedings of.
The Internet Useful Definitions and Concepts About the Internet.
WHAT IS AJAX? Zack Sheppard [zts2101] WHIM April 19, 2011.
Wide-area Network Acceleration for the Developing World Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)
Tracking the Evolution of Web Traffic: Felix Hernandez-Campos, Kevin Jeffay F. Donelson Smith IEEE/ACM International Symposium on Modeling, Analysis.
Performance Comparison of Congested HTTP/2 Links Brian Card, CS /7/
Web Caching Schemes For The Internet – cont. By Jia Wang.
 Proxy Servers are software that act as intermediaries between client and servers on the Internet.  They help users on private networks get information.
COMPUTER TERMS PART 1. COOKIE A cookie is a small amount of data generated by a website and saved by your web browser. Its purpose is to remember information.
Web Programming Language Dr. Ken Cosh Week 1 (Introduction)
Client/Server Architectures
Prof. Vishnuprasad Nagadevara Indian Institute of Management Bangalore
 Zhichun Li  The Robust and Secure Systems group at NEC Research Labs  Northwestern University  Tsinghua University 2.
{ Content Distribution Networks ECE544 Dhananjay Makwana Principal Software Engineer, Semandex Networks 5/2/14ECE544.
Global NetWatch Copyright © 2003 Global NetWatch, Inc. Factors Affecting Web Performance Getting Maximum Performance Out Of Your Web Server.
Healing the Web: An Overview of CoDeeN & Related Projects Vivek Pai, Larry Peterson + many others Princeton University.
Next Stop, the Cloud: Understanding Modern Web Service Deployment in EC2 and Azure Keqiang He, Alexis Fisher, Liang Wang, Aaron Gember, Aditya Akella,
Web Caching and Content Distribution: A View From the Interior Syam Gadde Jeff Chase Duke University Michael Rabinovich AT&T Labs - Research.
MIS 424 Professor Sandvig. Overview  Why Analytics?  Two major approaches:  Server logs  Google Analytics.
WEB SCIENCE. What is the difference between the Internet and the World Wide Web? Internet is the entire network of connected computers and routers used.
TOWARDS UNDERSTANDING DEVELOPING WORLD TRAFFIC Sunghwan Ihm (Princeton) KyoungSoo Park (KAIST) Vivek S. Pai (Princeton)
Web Cache Redirection using a Layer-4 switch: Architecture, issues, tradeoffs, and trends Shirish Sathaye Vice-President of Engineering.
Kiew-Hong Chua a.k.a Francis Computer Network Presentation 12/5/00.
1 MSCS 237 Overview of web technologies (A specific type of distributed systems)
Overview Web Session 3 Matakuliah: Web Database Tahun: 2008.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
On The Cooperation of Web Clients and Proxy Caches Yiu Fai Sit, Francis C.M. Lau, Cho-Li Wang Department of Computer Science The University of Hong Kong.
Performance of Web Proxy Caching in Heterogeneous Bandwidth Environments IEEE Infocom, 1999 Anja Feldmann et.al. AT&T Research Lab 발표자 : 임 민 열, DB lab,
2007cs Servers on the Web. The World-Wide Web 2007 cs CSS JS HTML Server Browser JS CSS HTML Transfer of resources using HTTP.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Oversight Wc3.org –Standards body –Ensure interoperability with HTML –Growth of the web.
Fundamentals of Web DevelopmentRandy Connolly and Ricardo HoarFundamentals of Web DevelopmentRandy Connolly and Ricardo Hoar Fundamentals of Web DevelopmentRandy.
REST By: Vishwanath Vineet.
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
The Internet. Important Terms Network Network Internet Internet WWW (World Wide Web) WWW (World Wide Web) Web page Web page Web site Web site Browser.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Web Proxy Caching: The Devil is in the Details Ramon Caceres, Fred Douglis, Anja Feldmann Young-Ho Suh Network Computing Lab. KAIST Proceedings of the.
SpyProxy SpyProxy Execution-based Detection of MaliciousWeb Content Execution-based Detection of MaliciousWeb Content Hongjin, Lee.
Introduction Web analysis includes the study of users’ behavior on the web Traffic analysis – Usage analysis Behavior at particular website or across.
On the scale and performance of cooperative Web proxy caching 2/3/06.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Performance Evaluation of Redirection Schemes in Content Distribution Networks Jussi Kangasharju, Keith W. Ross Institut Eurecom Jim W. Roberts France.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
The Internet Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.
Whole Page Performance Leeann Bent and Geoffrey M. Voelker University of California, San Diego.
THE FUTURE IS HERE: APPLICATION- AWARE CACHING BY ASHOK ANAND.
CSE541: Web Applications Special Thanks to M. Abdur Rahman.
FASTER PERFORMANCE FOR DYNAMIC HTML PAGES BY SHARAD JAISWAL.
Varnish Cache and its usage in the real world Ivan Chepurnyi Owner EcomDev BV.
Fault – Tolerant Distributed Multimedia Streaming Web Application By Nirvan Sagar – Srishti Ganjoo – Syed Shahbaaz Safir
Wide-area Network Acceleration for the Developing World
Web Programming Language
Ad-blocker circumvention System
Web Caching? Web Caching:.
Utilization of Azure CDN for the large file distribution
Building a Database on S3
The Internet and Electronic mail
Presentation transcript:

Towards Understanding Modern Web Traffic Sunghwan Ihm and Vivek S. Pai Google Inc. / Princeton University

Web Changes and Growth Simple static documents  complex rich media applications Heavy client-side interactions (e.g., Ajax) Traffic increase Social networking, file-sharing, and video streaming sites Trends expected to continue Applications migrated to the Web A de facto standard interface of cloud services Sunghwan Ihm, Princeton University

Understanding Changes Goal: shape system design by better understanding the traffic optimization opportunities Improve response times Understand caching effectiveness Design intermediary systems: firewalls, security analyzers, and reporting/monitoring systems Sunghwan Ihm, Princeton University

Challenges We address these challenges by Tracking changes Requires large-scale data set spanning many years collected under the same conditions Web page analysis Requires new analysis techniques suitable for dynamic Web pages with client-side interactions (e.g, Ajax) Redundancy and caching Requires full content instead of simple access logs for assessing implications of content-based caching We address these challenges by Analyzing large-scale data with full content Developing a new Web page analysis technique Sunghwan Ihm, Princeton University

CoDeeN Traffic CoDeeN content distribution network (CDN) http://codeen.cs.princeton.edu/ A semi-open globally distributed open proxy on 500+ PlanetLab nodes Running since 2003 30+ million requests per day Sunghwan Ihm, Princeton University

Data Collection Full Content Access Logs CoDeeN Cache WAN Local Proxy Browser Cache Origin Web Server User Assume local proxy caches 1. Access logs (all requests, but limited info.) URL, Timestamp, Content-Length, Content-Type, Referer, etc. 2. Full content (cache-misses) Header + body Sunghwan Ihm, Princeton University

100M+ requests / 1TB+ / 100K+ users Data Set 5 years: from 2006 to 2010 Focus on one month (April) per year Full content data only for 2010 Total volume per month 3.3~6.6 TB 280~460 million requests 240~360K unique client IPs (40~60% /8 nets) 168~187 countries and regions 820K~1.2 million servers Focus on US, CN, FR, BR: 100M+ requests / 1TB+ / 100K+ users Sunghwan Ihm, Princeton University

Analysis Outline 1. High-level analysis 2. Page-level analysis 3. Caching analysis Access Logs Full Content Sunghwan Ihm, Princeton University

1. High-Level Analysis Q: What has changed over five years? Connection speed NAT usage Max # concurrent browser connections Content type Object Size Traffic share of Web sites Sunghwan Ihm, Princeton University

Content Type US, 20062010, both X and Y log-scale A sharp increase of Ajax: JavaScript / CSS / XML A sharp increase of Flash video (FLV) (<5%25%) Sunghwan Ihm, Princeton University

Traffic Share of Web Sites Increase in video sites’ traffic Increase in ad networks and analytics sites’ requests (~12%) Ad networks market growth Most accessed site by users search / analytics google.com, baidu.com, google-analytics.com % user share increasing, tracking up to 65% Sunghwan Ihm, Princeton University

2. Page-Level Analysis Q: How have Web pages changed? New page detection heuristic Initial page characteristics Page size / # of embedded objects / latency Page load latency simulation Entire page characterization Sunghwan Ihm, Princeton University

Page Detection Problem Given a set of access logs, detect the page boundaries # of embedded objects, page size, time, etc. Challenge: previous approaches from 1990s are a poor fit, inaccurate for modern Web traffic Time main embedded Sunghwan Ihm, Princeton University

Previous Approach #1: Time-based Check idle time between requests If within a threshold (e.g. 1 second), they belong to the same page Misclassify client-side interactions (Ajax) with longer idle time as pages Sunghwan Ihm, Princeton University

Previous Approach #2: Type-based Check file extension / content type Regard every html object as a main object Misclassify frames/iframes within a page as separate pages Sunghwan Ihm, Princeton University

StreamStructure Algorithm Ajax 1. Group logs into streams by Referer field 2. Consider all html object as main object candidates ( Type-based) 3. Ignore those with no children (embedded objects) 4. Apply idle time among the candidates for finalizing selection ( Time-based) frames/iframes Sunghwan Ihm, Princeton University

Validation Ground truth: browse Alexa’s top 100 sites Visit about 10 pages per site Record Web page URLs (main objects) Total 1197 pages Precision # correct pages found / # total pages found Recall # correct pages found / # total correct pages Sunghwan Ihm, Princeton University

Validation Result Better 4 26~33 19~30 4~24 1 sec StreamStructure outperforms other approaches Robust to the idle time parameter selection Sunghwan Ihm, Princeton University

Identifying Initial Page Loads Client-side Interactions (e.g., Ajax) Initial Page Load Initial page: user-perceived page  user- perceived latency  traffic/revenue of Websites Apply Time-based approach, but DNS lookup or browser processing time can vary significantly Use Google Analytics beacon JavaScript collecting various client-side info. Fires when document are loaded 40-60% of traffic after initial page loads Sunghwan Ihm, Princeton University

Initial Page Size and # Objects Initial pages become increasingly complex US: about 2x increase 2006: 69 KB / 6 objects 2010: 133 KB / 12 objects Caching Effectiveness Sunghwan Ihm, Princeton University

Initial Page Load Latency Median latency dropped in 2009 and 2010  Increased # of browser concurrent connections  Reduced per-object latency from improved caching behavior / client bandwidth Sunghwan Ihm, Princeton University

3. Caching Analysis Q: Implications for caching? URL popularity Caching effectiveness Required cache storage size Impact of aborted transfers Sunghwan Ihm, Princeton University

Two Caching Approaches HTTP Object-based Approach Whole object HTTP-cacheable only Previously reported cache hit rate: 35~50% Byte hit rate usually much less Content-based Approach Cache smaller chunks instead of objects Protocol independent Effective for uncacheable content as well WAN accelerators, storage/file systems Sunghwan Ihm, Princeton University

Ideal Cache Hit Rate 1.8~2.5x HTTP object-based: 17~28% Mainly effective for JavaScript and image Content-based: 42~51% with 128-byte chunks Effective for any content type Growth of tail that hurts caching Sunghwan Ihm, Princeton University

Origins of Redundancy Aborted US, 128 byte Content updates Most of additional savings from the redundancy across different versions (intra-URL) across different objects (inter-URL) Sunghwan Ihm, Princeton University

Required Cache Storage Size CN: 218GB 1-KB outperforms 128-B w/ metadata overhead MRC: Multi-Resolution Chunking (USENIX’10) Increases working set size Large cache storage highly desirable Sunghwan Ihm, Princeton University

Conclusions Analyzed five years of real Web traffic with over 70,000 users Observed a rise of Ajax and Flash video, search engine / analytics site tracking 65% users Developed StreamStructure Half of the traffic occurs due to client-side interactions after initial page loads Pages have become increasingly complex Content-based caching with large cache storage highly desirable 2x larger byte hit rate, aborted transfers Sunghwan Ihm, Princeton University

sihm@cs.princeton.edu http://www.cs.princeton.edu/~sihm/ Thank You sihm@cs.princeton.edu http://www.cs.princeton.edu/~sihm/ Sunghwan Ihm, Princeton University