Web Caching Dr. Yingwu Zhu
What is Web Caching Introducing proxy servers at certain points in the network that serve in caching Web documents for faster client access. Comparable to the cache memory in a computer system
Proxy Cache clients proxy servers Reply Req. Reply
How? Client send requests to the proxy. If the requested document is in its cache, the proxy serves the request from its cache. Otherwise, the proxy forward the request to the server. Server replies the request through the proxy (proxy keep a copy of the requested document).
Why Web Caching? Rapid growth in HTTP traffic to form the largest part of the Internet traffic which causes more network congestion and server unavailability. The number of Web static pages almost doubles every year Some old data –Number of unique pages: 800M < X < 2.2B –Number of unique web sites: 8,500,000 –static pages: %30 - %40 –pages revisited: %80 –expected hit-rate: %24 - %32
Why Web Caching? Bandwidth Latency Performance = Response Time Server Load Failure Redundancy
Expected Gains Bandwidth saving Improving content availability. Improving web server availability. Server load balancing. Reducing user-perceived latency
What: Content and Protocols HTTP 1.0 Basic protocol –Send Request based on fix number of verbs GET HEAD POST –Receive response, meta-data, content
What: Content and Protocols HTTP Request Request = Simple-Request | Full-Request Simple-Request = "GET" SP Request-URI CRLF Full-Request = Request-Line ; * ( General-Header ; | Request-Header ; | Entity-Header ) ; CRLF [ Entity-Body ]
What: Content and Protocols Example: GET /pub/www/index.html HTTP/1.0 Response: HTTP/ OK Server: Microsoft-IIS/5.0 Date: Sat, 19 Oct :46:53 GMT Expires: Sun, 20 Oct :00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private
What: Content and Protocols Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct :43:31 GMT Response: HTTP/ OK Server: Microsoft-IIS/5.0 Date: Thu, 13 Jul :46:53 GMT Expires: Sun, 20 Oct :00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private
What: Content and Protocols Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct :43:31 GMT Response: HTTP/ Not Modified
HTTP support for caching Conditional requests (IMS) Servers can set expires and max-age Request indirection: application level routing Range requests, entity tag Cache-control header –Requests: min-fresh, max-stale, no-transform –Responses: must-revalidate, public, private, no-cache
Reverse Proxy Reverse Proxy Reverse Proxy Intranet Where Browser Local ISP cache L4 Switch Data Center ISP cdn cache Content Server Content Server Content Server Content Server Reverse Proxy Browser cache Browser cache cdn
Cache Types Proxy Caching Reverse Proxy Caching Transparent Caching Adaptive Caching Push Caching Active Caching
Proxy Caching Harvest/Squid Provide web content for a fixed user base Deployed at the network edges (company or institutional gateway or firewall hosts) Standalone operation Manual configuration in web browsers Commodity product/technology Single point of failures
Reverse Proxy Caching Designed to offload duties from one or more specific servers Data size is limited to size of static content on the server Challenge is fast, disk-less operation Cache consistency is easy
Transparent Caching Intercept HTTP requests and redirect them to web cache servers or cache clusters No client configuration Violates end-to-end paradigm –Client thinks it is talking directly to server –Server thinks it is talking to cache Implemented as: L4-switch –Layer 4 switch makes switching decisions based on TCP or UDP port number, i.e., 80
Transparent Caching
Adaptive Caching ISP Level caching, global data placement optimization Cooperating multiple distributed caches Operate as a cache-mesh based on content demand Cache Group Management Protocol –How meshes are formed –How individual caches join/leave the meshes Content Routing Protocol sends request to the appropriate cache within the meshes Uses distributed cache meshes to solve the hot spot problem Caches dynamically join and leave the groups based on content demand Administrative boundaries must be relaxed
Push Caching Keep data close to those clients requesting this information Send the data out proactively Assumption: we are able launch caches that may cross administrative boundaries Incurs cost (storage and transmission)
Active Caching Applies caching to dynamic documents 30 % of client HTTP requests contains cookies The servers provides the cache with the objects and any associated cache applets –Use an applet inside of the cache to customize dynamic pages on the fly
Cache Placement/Deployment Close to clients/content consumers –Proxy caching –Transparent proxy caching Close to servers/content providers –Improve access to logical sets of data –Delay-sensitive data: video, audio –Reverse proxy caching –Push caching Network choke points: strategic deployment –Adaptive caching –Problem with administrative control
Zipf Law vs. Web Access Zipf Law Web Access Caching?
Zipf’s Law Zipf’s law: The frequency of an event P as a function of rank i is a power law function: P i = Ω / i α where α ≤ 1
Zipf’s Law Observed to be true for –Frequency of written words in English texts –Population of cities –Income of a company as a function of rank
Zipf’s Law vs. Web Access For a given server, page access by rank follows Zipf’s law Web requests from a fixed population of users follows Zipf’s law 0.64 < α < 0.83
Observations Top %1 of all documents account for %20 - %35 of proxy requests Top %10 account for %45 - %55 of requests It takes %25 to %40 of all documents to account for %70 of requests It takes %70 to %80 of all documents to account for %90 of requests
Zipf’s Law and Caching Discussion How does this help in cache design?
Basic caching algorithm Pages may be Fresh: up-to-date Expired: current date > expiration date Stale: “old”
Basic caching algorithm - #2 If (page is in the cache) if ( page is expired or stale ) Get from server - if-modified-since If not modified, Get from cache Get from Server Else Get from Server
Basic caching algorithm - #3 If cache has space Store the file Else 1.Delete expired from cache 2.Delete stale from cache 3.Delete LRU from cache 4.Delete largest/smallest from cache?
Cache Replacement Cache size is limited, need replacement policy LRU LFU Greedy-dual size Many others
Cache Consistency Multiple copies of objects created – How and when renewing the copies? Goals –Avoid stale copies –Keep non useful traffic as low as possible
Cache Consistency: Polling Solution 1: polling every time implemented in HTTP using the optional “if-modified-since" request header field Benefit: strong consistency Drawback: very slow cache hit
Cache Consistency: Polling Solution 2: polling if TTL expires, widely used –Associate a TTL (12 hours or 2 days) with each cached object implemented in HTTP using the optional "expires" header field Benefit: fast cache hit Drawback: weak cache consistency (5% stale) due to TTL is an a priori estimate of an object's life time
Cache Consistency Solution 3 : Invalidation Protocols The server helps the proxy in maintaining consistency Invalidation protocols –When the proxy makes a request, Piggyback cache validation (PCV) : the proxy provides some other potentially stale copies for server validating Piggyback cache invalidation (PCI) : the server provides some copies which have been updated since last access –Use of volumes Volume lease : – The client receive a lease from the server –During the lease validity the client can retreive copies from proxy –When the lease expire the client has to renew it Problems: scalability, servers needs keep cache states
Cache Cooperation Hierarchical caching –Cache servers form a hierarchy, tree-like structures –Parent servers: top of the hierarchy, receive requests from child servers. If they do not have the requested objects, either ask their parents or original web servers –Sibling servers: if the local cache does not have the requested object, then ask its sibling caches. If the sibling caches do not have the object, then the local cache asks the parent cache
Cache Hierarchies Use hierarchy to scale a proxy –Why? Larger population = higher hit rate (less compulsory misses) Larger effective cache size –Why is population for single proxy limited? Performance, administration, policy, etc. NLANR cache hierarchy –Most popular –9 top level caches –Internet Cache Protocol based (ICP) –Squid/Harvest proxy How to locate content?
ICP (Internet cache protocol) Simple protocol to query another cache for content Uses UDP – why? ICP message contents –Type – query, hit, hit_obj, miss –Other – identifier, URL, version, sender address –Special message types used with UDP echo port Used to probe server or “dumb cache” Query and then wait till time-out (2 sec) Transfers between caches still done using HTTP
Squid Client Parent Child Web page request ICP Query
Squid Client Parent Child ICP MISS
Squid Client Parent Child Web page request
Squid Client Parent Child Web page request ICP Query
Squid Client Parent Child Web page request ICP MISS ICP HIT
Squid Client Parent Child Web page request
Hierarchical caching Ideally, want the cache mesh to behave as a single cache with equivalent capacity and processing capability ICP: many copies of popular objects created – capacity wasted High Latency: More than one hop needed for searching object How to improve? Discuss!
Problems with caching Over 50% of all HTTP objects are uncacheable. Sources: –Dynamic data stock prices, frequently updated content –CGI scripts results based on passed parameters –SSL encrypted data is not cacheable Most web clients don’t handle mixed pages well many generic objects transferred with SSL –Cookies results may be based on passed data –Hit metering owner wants to measure # of hits for revenue, etc, so, cache busting
Risks of Using Proxy Benefits: reduce latency, bandwidth saving, etc. Risks –Obsolete data –Violate client privacy: the proxy can keep a log file telling which objects the client has requested –Data integrity
Real Proxy Servers Squid: The most widely used. The better working and the free one. Microsoft ISA Server 2004 : Microsoft developed ISA to replace Microsoft proxy server. It’s fully functional with Active Directory Apache: Apache web server has a module to do reverse caching (experimental) Cisco Cache Engine: sits next to (mostly) Cisco routers and receives transparently redirected HTTP requests CERN/W3C HTTPd: It was the original proxy server.