15-744: Computer Networking L-21: Caching and CDNs.

Slides:



Advertisements
Similar presentations
Peer-to-Peer and Social Networks An overview of Gnutella.
Advertisements

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
1 Server Selection & Content Distribution Networks (slides by Srini Seshan, CS CMU)
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Small-world Overlay P2P Network
An Engineering Approach to Computer Networking
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Internet Networking Spring 2006 Tutorial 12 Web Caching Protocols ICP, CARP.
FRIENDS: File Retrieval In a dEcentralized Network Distribution System Steven Huang, Kevin Li Computer Science and Engineering University of California,
Other File Systems: AFS, Napster. 2 Recap NFS: –Server exposes one or more directories Client accesses them by mounting the directories –Stateless server.
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
15-441: Computer Networking The “Web” Thomas Harris (slides from Srini Seshan’s Fall ’01 course)
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
CDNs & Replication Prof. Vern Paxson EE122 Fall 2007 TAs: Lisa Fowler, Daniel Killebrew, Jorge Ortiz.
Gnutella, Freenet and Peer to Peer Networks By Norman Eng Steven Hnatko George Papadopoulos.
Internet Networking Spring 2002 Tutorial 13 Web Caching Protocols ICP, CARP.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
15-744: Computer Networking L-21: Caching and CDNs.
1 Seminar: Information Management in the Web Gnutella, Freenet and more: an overview of file sharing architectures Thomas Zahn.
Wide-area cooperative storage with CFS
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
1 Freenet  Addition goals to file location: -Provide publisher anonymity, security -Resistant to attacks – a third party shouldn’t be able to deny the.
World Wide Web Caching: Trends and Technology Greg Barish and Katia Obraczka USC Information Science Institute IEEE Communications Magazine, May 2000 Presented.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Caching and Content Distribution Networks. Web Caching r As an example, we use the web to illustrate caching and other related issues browser Web Proxy.
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Freenet. Anonymity  Napster, Gnutella, Kazaa do not provide anonymity  Users know who they are downloading from  Others know who sent a query  Freenet.
Peer-to-Peer Computing CS587x Lecture Department of Computer Science Iowa State University.
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
CS640: Introduction to Computer Networks Aditya Akella Lecture 18 - The Web, Caching and CDNs.
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
2: Application Layer1 Chapter 2 outline r 2.1 Principles of app layer protocols r 2.2 Web and HTTP r 2.3 FTP r 2.4 Electronic Mail r 2.5 DNS r 2.6 Socket.
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
Module 7: Resolving NetBIOS Names by Using Windows Internet Name Service (WINS)
15-744: Computer Networking L-21: Caching and CDNs Amit Manjhi.
Adaptive Web Caching CS411 Dynamic Web-Based Systems Flying Pig Fei Teng/Long Zhao/Pallavi Shinde Computer Science Department.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.
ICP and the Squid Web Cache Duanc Wessels k Claffy August 13, 1997 元智大學系統實驗室 宮春富 2000/01/26.
15-744: Computer Networking L-22: P2P. Lecture 22: Peer-to-Peer Networks Typically each member stores/provides access to content Has quickly.
P2PComputing/Scalab 1 Gnutella and Freenet Ramaswamy N.Vadivelu Scalab.
Computer Science Lecture 14, page 1 CS677: Distributed OS Last Class: Concurrency Control Concurrency control –Two phase locks –Time stamps Intro to Replication.
Freenet “…an adaptive peer-to-peer network application that permits the publication, replication, and retrieval of data while protecting the anonymity.
Lecture 12 Distributed Hash Tables CPE 401/601 Computer Network Systems slides are modified from Jennifer Rexford.
Computer Networking P2P. Why P2P? Scaling: system scales with number of clients, by definition Eliminate centralization: Eliminate single point.
HTTP evolution - TCP/IP issues Lecture 4 CM David De Roure
ICP and the Squid Web Cache Duane Wessels and K. Claffy 산업공학과 조희권.
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
15-744: Computer Networking L-22: P2P. L -22; © Srinivasan Seshan, P2P Peer-to-peer networks Assigned reading [Cla00] Freenet: A Distributed.
Peer-to-peer systems (part I) Slides by Indranil Gupta (modified by N. Vaidya)
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
15-744: Computer Networking L-23: P2P. L -23; © Srinivasan Seshan, P2P Peer-to-peer networks Assigned reading [Cla00] Freenet: A Distributed.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Content Distribution Networks (CDNs)
Aditya Akella The Web Aditya Akella 18 Apr, 2002.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
A Survey of Peer-to-Peer Content Distribution Technologies Stephanos Androutsellis-Theotokis and Diomidis Spinellis ACM Computing Surveys, December 2004.
Content Distribution Networks
BitTorrent Vs Gnutella.
CS 268: Lecture 22 (Peer-to-Peer Networks)
Internet Networking recitation #12
Distributed Systems CS
Content Distribution Networks
15-441: Computer Networking
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
Presentation transcript:

15-744: Computer Networking L-21: Caching and CDNs

L -21; © Srinivasan Seshan, Caching & CDN’s HTTP APIs Assigned reading [FCAB98] Summary Cache: A Scalable Wide- Area Cache Sharing Protocol [Cla00] Freenet: A Distributed Anonymous Information Storage and Retrieval System

L -21; © Srinivasan Seshan, Overview Web caches Content distribution networks Peer-to-peer networks

L -21; © Srinivasan Seshan, Web Caching Why cache HTTP objects? Reduce client response time Reduce network bandwidth usage Wide area vs. local area use These two objectives are often in conflict May do exhaustive local search to avoid using wide area bandwidth Prefetching uses extra bandwidth to reduce client response time

L -21; © Srinivasan Seshan, Web Proxies Also used for security Proxy is only host that can access Internet Administrators makes sure that it is secure Performance How many clients can a single proxy handle? Caching Provides a centralized coordination point to share information across clients How to index Early caches used file system to find file Metadata now kept in memory on most caches

L -21; © Srinivasan Seshan, Caching Proxies - Sources for misses Capacity How large a cache is necessary or equivalent to infinite On disk vs. in memory  typically on disk Compulsory First time access to document Non-cacheable documents CGI-scripts Personalized documents (cookies, etc) Encrypted data (SSL) Consistency Document has been updated/expired before reuse Conflict  no such issue

L -21; © Srinivasan Seshan, Cache Hierarchies Use hierarchy to scale a proxy to more than limited population Why? Larger population = higher hit rate Larger effective cache size Why is population for single proxy limited? Performance, administration, policy, etc. NLANR cache hierarchy Most popular 9 top level caches Internet Cache Protocol based (ICP) Squid/Harvest proxy

L -21; © Srinivasan Seshan, ICP Simple protocol to query another cache for content Uses UDP – why? ICP message contents Type – query, hit, hit_obj, miss Other – identifier, URL, version, sender address (is this needed?) Special message types used with UDP echo port Used to probe server or “dumb cache” Transfers between caches still done using HTTP

L -21; © Srinivasan Seshan, Squid Cache ICP Use Upon query that is not in cache Sends ICP_Query to each peer (or ICP_Decho to echo port of peer caches that do not speak ICP) May also send ICP_Secho to origin server’s echo port Sets time to short period (default 2 sec) Peer caches process queries and return either ICP_Hit or ICP_Miss Proxy begins transfer upon reception of ICP_Hit, ICP_Decho or ICP_Secho Upon timer expiration, proxy request object from closest (RTT) parent proxy Would be better to direct to parent that is towards origin server

L -21; © Srinivasan Seshan, Squid Client Parent Child Web page request ICP Query

L -21; © Srinivasan Seshan, Squid Client Parent Child ICP MISS

L -21; © Srinivasan Seshan, Squid Client Parent Child Web page request

L -21; © Srinivasan Seshan, Squid Client Parent Child Web page request ICP Query

L -21; © Srinivasan Seshan, Squid Client Parent Child Web page request ICP MISS ICP HIT

L -21; © Srinivasan Seshan, Squid Client Parent Child Web page request

L -21; © Srinivasan Seshan, ICP vs HTTP Why not just use HTTP to query other caches? ICP is lightweight – positive and negative Makes it easy to process quickly Caches may process many more ICP requests than HTTP requests HTTP has many functions that are not supported by ICP ICP does not evolve with HTTP changes Adds extra RTT to any proxy-proxy transfer

L -21; © Srinivasan Seshan, Optimal Cache Mesh Behavior Minimize number of hops through mesh Each hop add significant latency ICP hops can cost a 2 sec timeout each! Strict hierarchies cost disk lookup, etc. Especially painful for misses Share across many users and scale to many caches ICP does not scale to a large number of peers Cache and fetch data close to clients

L -21; © Srinivasan Seshan, Hinting Have proxies store content as well as metadata about contents of other proxies (hints) Minimizes number of hops through mesh Size of hint cache is a concern – size of key vs. size of document Having hints can help consistency Makes it possible to push updated documents or invalidations to other caches How to keep hints up-to-date? Not critical – incorrect hint results in extra lookups not incorrect behavior Can batch updates to peers

L -21; © Srinivasan Seshan, Summary Cache Primary innovation – use of compact representation of cache contents Typical cache has 8GB of space and 8KB objects  1M objects Using 16byte MD5  16MB per peer Solution: Bloom filters Delayed propagation of hints Waits until threshold %age of cached documents are not in summary Perhaps should have looked at %age of false hits?

L -21; © Srinivasan Seshan, Bloom Filters Proxy contents summarize as a M bit value Each page stored contributes k hash values in range [1..M] Bits for k hashes set in summary Check for page = if all pages k hash bits are set in summary it is likely that proxy has summary Tradeoff  false positives Larger M reduces false positives What should M be? 8-16 * number of pages seems to work well What about k? Is related to (M/number of pages)  4 works for above M

L -21; © Srinivasan Seshan, Leases Only consistency mechanism in HTTP is for clients to poll server for updates Should HTTP also support invalidations? Problem: server would have to keep track of many, many clients who may have document Possible solution: leases Leases – server promises to provide invalidates for a particular lease duration Server can adapt time/duration of lease as needed To number of clients, frequency of page change, etc.

L -21; © Srinivasan Seshan, Problems Over 50% of all HTTP objects are uncacheable – why? Not easily solvable Dynamic data  stock prices, scores, web cams CGI scripts  results based on passed parameters Obvious fixes SSL  encrypted data is not cacheable Most web clients don’t handle mixed pages well  many generic objects transferred with SSL Cookies  results may be based on passed data Hit metering  owner wants to measure # of hits for revenue, etc. What will be the end result?

L -21; © Srinivasan Seshan, Proxy implementation problems Aborted transfers Many proxies transfer entire document even though client has stopped  eliminates saving of bandwidth Making objects cacheable Proxy’s apply heuristics  cookies don’t apply to some objects, guesswork on expiration May not match client behavior/desires Client misconfiguration Many clients have either absurdly small caches or no cache How much would hit rate drop if clients did the same things as proxies

L -21; © Srinivasan Seshan, Problems – Population Size How does population size affect hit rate? Critical to understand usefulness of hierarchy or placement of caches Issues: frequency of access vs. frequency of change (ignore working set size  infinite cache) UW/Msoft measurement  hit rate rises quickly to about 5000 people and very slowly beyond that Proxies/Hierarchies don’t make much sense for populations > 5000 Single proxies can easily handle such populations Hierarchies only make sense for policy/administrative reasons

L -21; © Srinivasan Seshan, Problems – Common Interests Do different communities have different interests? I.e. do CS and English majors access same pages? IBM and Pepsi workers? Has some impact  UW departments have about 5% higher hit rate than randomly chosen UW groups Many common interests remain Is this true in general? UW students have more in common than IBM & Pepsi workers Some related observations Geographic caching – server traces have shown that there is geographic locality to interest UW & MS hierarchy performance is bad – could be due to size or interests?

L -21; © Srinivasan Seshan, Overview Web caches Content distribution networks Peer-to-peer networks

L -21; © Srinivasan Seshan, CDN Replicate content on many servers Challenges How to replicate content Where to replicate content How to find replicated content How to choose among know replicas How to direct clients towards replica Discussed in DNS/server selection lecture DNS, HTTP 304 response, anycast, etc. Akamai

L -21; © Srinivasan Seshan, How Akamai Works Clients fetch html document from primary server E.g. fetch index.html from cnn.com URLs for replicated content are replaced in html E.g. replaced with Client is forced to resolve aXYZ.g.akamaitech.net hostname

L -21; © Srinivasan Seshan, How Akamai Works How is content replicated? Akamai only replicates static content Modified name contains original file Akamai server is asked for content First checks local cache If not in cache, requests file from primary server and caches file

L -21; © Srinivasan Seshan, How Akamai Works Root server gives NS record for akamai.net Akamai.net name server returns NS record for g.akamaitech.net Name server chosen to be in region of client’s name server TTL is large G.akamaitech.net nameserver choses server in region Should try to chose server that has file in cache - How to choose? Uses aXYZ name and consistent hash TTL is small

L -21; © Srinivasan Seshan, Consistent Hash “view” = subset of all hash buckets that are visible Desired features Smoothness – little impact on hash bucket contents when buckets are added/removed Spread – small set of hash buckets that may hold an object regardless of views Load – across all views # of objects assigned to hash bucket is small

L -21; © Srinivasan Seshan, Consistent Hash – Example Construction Assign each of C hash buckets to Klog(C) random points on unit interval Map object to random position on unit interval Hash of object = closest bucket Monotone  addition of bucket does not cause movement between existing buckets Spread & Load  small set of buckets that lie near object Balance  no bucket is responsible for large portion of unit interval

L -21; © Srinivasan Seshan, How Akamai Works End-user cnn.com (content provider)DNS root serverAkamai server Akamai high-level DNS server Akamai low-level DNS server Closest Akamai server Get index. html Get /cnn.com/foo.jpg 12 Get foo.jpg 5

L -21; © Srinivasan Seshan, Akamai – Subsequent Requests End-user cnn.com (content provider)DNS root serverAkamai server 12 Akamai high-level DNS server Akamai low-level DNS server Closest Akamai server Get index. html Get /cnn.com/foo.jpg

L -21; © Srinivasan Seshan, Overview Web caches Content distribution networks Peer-to-peer networks

L -21; © Srinivasan Seshan, Peer-to-Peer Networks Typically each member stores content that it desires Basically a replication system for files Always a tradeoff between possible location of files and searching difficulty Peer-to-peer allow files to be anywhere  searching is the challenge Dynamic member list makes it more difficult

L -21; © Srinivasan Seshan, Napster Simple centralized scheme  motivated by ability to sell/control On startup client contacts central server and reports list of files Upon query, central server returns list of possible clients that store data Transfer is done peer-to-peer

L -21; © Srinivasan Seshan, Gnutella On startup client contacts any servent (server + client) in network Basic message header Unique ID, TTL, Hops Transfers are done with HTTP between peers Servent interconnection used to forward control (queries, hits, etc)

L -21; © Srinivasan Seshan, Gnutella Details Message types Ping – probes network for other servents Pong – response to ping, contains IP addr, # of files, # of Kbytes shared Query – search criteria + speed requirement of servent QueryHit – successful response to Query, contains addr + port to transfer from, speed of servent, number of hits, hit results, servent ID Push – request to servent ID to initiate connection, used to traverse firewalls Ping, Queries are flooded QueryHit, Pong, Push reverse path of previous message

L -21; © Srinivasan Seshan, Freenet Anonymity a primary goal Files are stored according to associated key Core idea: try to cluster information about similar keys Messages Random 64bit ID used for loop detection TTL TTL 1 are forwarded with finite probablity Helps anonymity depth counter Opposite of TTL – incremented with each hop Depth counter initialized to small random value

L -21; © Srinivasan Seshan, Freenet Requests User requests key XYZ – not in local cache Looks up nearest key in routing table and forwards to corresponding node If request reaches node with data, it forwards data back to upstream requestor Requestor adds file to cache, adds entry in routing table Any node forwarding reply may change the source of the reply (to itself or any other node) Helps anonymity If data is not found, failure is reported back Requestor then tries next closest match in routing table

L -21; © Srinivasan Seshan, Freenet Request 1 AB C D E F Data Request Data Reply Request Failed

L -21; © Srinivasan Seshan, Freenet Search Features Nodes tend to specialize in searching for similar keys over time Gets queries from other nodes for similar keys Nodes store similar keys over time Caching of files as a result of successful queries Similarity of keys does not reflect similarity of files Routing does not reflect network topology

L -21; © Srinivasan Seshan, Freenet File Creation Key for file generated and searched  helps identify collision Not found (“All clear”) result indicates success Source of insert message can be change by any forwarding node Creation mechanism adds files/info to locations with similar keys New nodes are discovered through file creation Erroneous/malicious inserts propagate original file further

L -21; © Srinivasan Seshan, Cache Management LRU Cache of files Files are not guaranteed to live forever Files “fade away” as fewer requests are made for them File contents can be encrypted with original text names as key Cache owners do not know either original name or contents  cannot be held responsible

L -21; © Srinivasan Seshan, Freenet Naming Freenet deals with keys But humans need names Keys are flat  would like structure as well Could have files that store keys for other files File /text/philiosophy could store keys for files in that directory  how to update this file though? Search engine  undesirable centralized solution

L -21; © Srinivasan Seshan, Freenet Naming - Indirect files Normal files stored using content-hash key Prevents tampering, enables versioning, etc. Indirect files stored using name-based key Indirect files store keys for normal files Inserted at same time as normal file Has same update problems as directory files Updates handled by signing indirect file with public/private key Collisions for insert of new indirect file handled specially  check to ensure same key used for signing Allows for files to be split into multiple smaller parts

L -21; © Srinivasan Seshan, Next Lecture: QOS & IntServ QOS IntServ Architecture Assigned reading [She95] Fundamental Design Issues for the Future Internet [CSZ92] Supporting Real-Time Applications in an Integrated Services Packet Network: Architecture and Mechanisms