Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol By Abuzafor Rasal and Vinoth Rayappan
Web caching 1 2 HTTP request HTTP response Client1 Client2 Cache Server Client3
Web Cache Sharing Proxy Caches Users Regional Network Rest of Internet Bottleneck...
Web Cache Sharing: Internet Cache Protocol (ICP) Internet Cache Protocol is currently implemented technique of web cache sharing Internet Cache Protocol = the proxy multicasts a query message to all other proxies whenever a cache miss occurs.
Internet Cache Protocol Client Proxy Cache Proxy Cache Proxy Cache Proxy Cache Internet
Proxy HTTP INTERNET Proxy … Client 1Client 2Client n ….. 12N First request: document is available in local proxy. HTTP HIT Internet Cache Protocol
Proxy HTTP INTERNET Proxy … Client 1Client 2Client n ….. 12N HTTP ICP Internet Cache Protocol Second Request: document is not available in local proxy.
Problem of ICP As the number of collaborating proxies increase the overhead dramatically increases, thus not scalable. –A proxy multicasts a query message to all other proxies whenever a cache miss occurs
UDP = ICP query and replay messages TCP = HTTP traffic between proxies, servers, and clients Total Packets or IP = UDP + TCP Problem of ICP
+ = ;
Summary Cache Each proxy maintains a Bloom Filter (data in compressed form) representing its local cache. Also, it holds Bloom Filters representing caches of other proxies. Updates to Bloom Filters are exchanged periodically or after a certain percentage of the documents in the cache was replaced. Request is sent only to proxy who most likely holds the requested document.
Summary Cache Client Internet Proxy Cache Proxy Cache Proxy Cache Proxy Cache First request: document is in other proxy
Summary Cache Client Internet Proxy Cache Proxy Cache Proxy Cache Proxy Cache Second request: the document is not in any proxy
Summary Cache Client Internet Proxy Cache Proxy Cache Proxy Cache Proxy Cache Third request: summary gives false hit
Summary Cache Two Parameter to design of Summary Cache protocol: –The frequency of summary updates. (inter-proxy traffic, overhead) –The representation of summary (memory). Above Solution: –Delay update summaries until a fixed percentage i.e. 1% of the cached documents are new. Positive: Reduce overhead (traffic) Negative: Introduce “false miss” error –Store summaries as a “Bloom Filter”. This is efficient hash-based probabilistic scheme that represent URLs of cached document. Positive: Reduce memory requirement Negative: Introduce “false hit” error
Summary Cache false misses: –Definition : the document requested is cached at some other proxy but its summary does not reflect the fact. –Effect: In this case, a remote cache hit is lost, and the total hit ratio within the collection of caches is reduced. –Improvement: can be eliminated/improved with higher frequency of update false hits: –Definition: the document requested is not cached at some other proxy but its summary indicates that it is. The proxy will send a query message to the other proxy, only to be noticed that the document is not cached there. –Effect: In this case, a query message is wasted. –Improvement: can be eliminated/improved by increasing the vector size of Bloom Filter or increase memory size of representation
Summary Cache Remote Stale Hits: document is cached at another proxy but the cached copy is stale. (Not because of update delay) –Delta compression can be used to transfer the new document. Delta compression transfers only the difference between the old and the new document instead of downloading the whole document.
Summary Cache Two factors limit the scalability: –The network overhead, the inter-proxies communication. Determined by update frequency, false hits and remote hits –Memory required to store the summaries. Determined by size of individual summary and # of proxies.
ICP = Hit ratio when no update delay is introduced exact_dir = Hit ratio with update delay introduced false_hit = No delay – delay = ICP – exact_dir stale-hit = Remote stale hit due to the document is stale (out dated) but not reflected in summary Impact of Update Delay: Explanation of the Graph
exact_dir = hit ratio decrease linearly as threshold increases. stale-hit = not effected by threshold because stale-hit error exist for both ICP and Summary Cache. False-hit = increases as threshold increases because deleted document in cache may still be show present in summary. Impact of Update Delay: Observation of the Graph
Summary Representations Summary Representation = how to store the summaries in proxies. Summary needs to be stored in DRAM (main memory) –Disk arms become bottlenecks in proxy cache –DRAM price continues to drop –DRAM is faster
Summary Representations: Naïve approach Exact-directory = the summary is essentially the list of URLs of cached documents, with each URL represented by its 16-byts MD5 signature. –Positive: Less errors –Negative: Consumes too much memory Server-name = web server names in the URLs of cached documents. –Positive: Cut down memory requirement by a factor of 10 but introduces errors –Negative: Generate too many false hit thus increase network traffic
Summary Representations: Bloom Filters Process –Step 1: Take each URL as an input to four different hash functions. –Step 2: Take each output of hash function (32 bits) and convert to 1 bit. –Step 3: Store 4 bits from four different hash functions and stores into a vector. Positive: Consumes much less memory Negative: Introduce insignificant errors
Summary Representations Server name produces too much traffic in network because request is send to any proxies that has server name.
Bloom filter Bloom filter is type technique used for compression of memory space( To avoid false hit) Summary cache : uses the bloom technique to do compression A method of representing a set of “A” of n elements to support the membership queries. It is a mechanism for identifying which pages have associated comments stored with in common knowledge server
Problem? Place A place B cnn.com/index.html wayne.edu/ Compact Representation arbitrary URI ? Bloom
How the bloom works? Pick a large bit array with all ‘0’s Pick # of independent hash function, in this case we have four(4) Every URL in the bag (Proxy summary cache), you apply the four hash function, and we will be getting four integers. Use the four integers in to the bit array Turn all the bits to 1 Repeat this to all URL in Proxy summary cache The above is the Encryption process. Repeat above steps in reverse for decrypting.
How does hash works? Hash function turns data into a relatively small number that may serve as a digital "fingerprint" of the data. Hash function turns data into a relatively small number that may serve as a digital "fingerprint" of the data.
Bloom filter A hashing technique m bit k independent hashing function many to one mapping “false positive
Bloom filter False positive - Given the query to b, we check bits at position h1(b), h2(b)…..,hk(b)..if any of them is 0, b is not in the set of A. - Other wise we know b is in a set A, although there is a certain probability that we are wrong. If fall positive increases number of access will go up, but when the fall negative increase, probability of getting wrong doc will go up. The salient feature of Bloom is there is a trade of between memory size(array) and false positive.
Probability of false positive upper graph: for 4 hash functions lower graph: optimal integral number of hash functions(5 hash function)
Bloom filter as summaries Provides straight forward mechanism to built summaries Proxy build bloom from the URL of cached docs Thus increasing the memory can decrease flase positive and other wise provides the clear trade between the above two
How the hash function built? 32 bit hash …… bit hash MD5 128 bit
Hit ratio
Obeservations of the cache hit ratio Exact_dir and bloom filter_8, _16,_32 is have virtually the same hit ratio compared to server name. Exact_dir will give same hit as bloom, but it will consume more memory to store all the informations of URL. Incase of Bloom filter_8_16_32,it will consume less memory than exact_dir, because of hash function.
False hit ratio under different summary representations
Observation of false hit (miss) ratio Server name has a much higher false hit (miss) ratio. Why? Because it just got the server name and don’t have a specific address of the requested URL. So the request will be sent to all other proxies, but the hit will be in any of the one proxy and obviously false hit is high. Exact_dir will have less false hit ratio compared to all (but it does need large cache size (memory).
Message per request
Observations on Msg/request We included ICP in for a comparative study. In case of ICP( With out the summary cache) the request will sent to all proxy to find the requested URL. So obviously messages/client request will be high compared to others. In the other extreme the bloom_8_16_32 and exact_dir will spend much less msg/client request to find the URL. It is good and economical to go with. Server name will be in the mid the above, because it got more false hit (miss). So higher the msg/client request.
Bytes of Msg size per request
Observations on size of inter network msg in bytes We are considering this issue because, update messages is of higher size than the query messages. So, Summary caches uses the occasional burst of large messages in between the small query messages. So it reduces CPU overhead and network interface packet (Results are table 2 and 4) significantly For query messages Header sizeAverage URL ICP and others 2050 For Summary updates Header sizeBytes/Change Exact directory 2016 Server name 2016 Bloom filter based Summaries 324
Memory requirments in terms of % of Proxy cache: NLANR 4 proxies
Memory requirments in terms of % of Proxy cache: DEC 16 proxies
Summary Web caching is an active research area. Directory server: Approach uses the a central server to keep track of the cache directories of all the proxies query the server for the cache hits in other proxies The above approach is failed because being a centralized server the network overhead will be high because of serving the all request. To over come the above we got a summary cache enabled ICP web-cache sharing protocol. Our inspection of the Quesnet traces showed that the chid to parent ICP queries can be a significant portion of the messages that the parent proxy has to process. So in this case applying the summary cache will significantly reduce the # of queries and overhead.
Future work Plan to investigate the impact of the protocol on the parent – child proxy cooperation and the optimal hierarchy configuration for a given work load Plan to investigate the application of summary cache in various web-cache consistency protocol Plan to design new method for summary cache implementation in proxy to speed up the look up.
Conclusion We proposed the summary-cache enhanced ICP, a scalable world wide web cache sharing protocol and proved it is the best to go with compared all other techniques. Our study has two key concepts effects of delayed updates of summary cache, and the representation of summary. Solution to first is, we can delayed the updates1 % to 10 % (Proved based on trace driven simulation) and it will cause errors but it is bearable. Solution to second problem, we introduced bloom filter technique for representation of summary cache. We achieve over 50 % reduction in bandwidth, and reduces the inter-proxy communication messages by a factor of 25 to 60.