On Caching Search Engine Query Results Evangelos Markatos Evangelos Markatoshttp://archvlsi.ics.forth.gr/OS/os.html Computer Architecture and VLSI Systems.

On Caching Search Engine Query Results Evangelos Markatos Evangelos Markatoshttp://archvlsi.ics.forth.gr/OS/os.html Computer Architecture and VLSI Systems Division Institute of Computer Science Foundation for Research and Technology Hellas Heraklion, Crete, Greece

CARV ICS, FORTH Outline Introduction - The Problem: Introduction - The Problem:  Web Caching has focused on static data: an ever- decreasing percentage of URL requests Caching Dynamic Data Caching Dynamic Data  Search Engine Query Results There exists significant locality of reference There exists significant locality of reference i.e. different people ask the same queries Medium-sized caches can exploit this locality Medium-sized caches can exploit this locality Conclusions Conclusions

CARV ICS, FORTH Caching static data is not enough anymore Web Caching has focused on static documents (files) Web Caching has focused on static documents (files)  html pages, images, videos  BUT: 40% of http requests are to dynamic data [Wolman 99] 40% of http requests are to dynamic data [Wolman 99]  up from 7% in 1997  it will probably increase in the future

CARV ICS, FORTH Caching Search Engine Query Results Queries represent: Queries represent:  14% of all URL requests (1 out of 7)  30-50% of non-image URL requests (1 out of 3) Caching Query Results may Caching Query Results may  increase overall hit rate  reduce network traffic  reduce search engine overload  reduce client latency

CARV ICS, FORTH Caching Query Results Where? Where?  At the client side little reuse - small hit rates  At the proxy medium reuse  At the (Web/database) server using inverse proxies - accelerators –maximum reuse - highest hit rates –controlled environment –close interaction with database

CARV ICS, FORTH Caching at the Web Server Avoids re-evaluation of the query Avoids re-evaluation of the query  reduces computation overhead forking processes to process queries processing of database buffers  reduces I/O (DB index and data) requests Main memory caching Main memory caching  avoids disk requests

CARV ICS, FORTH Caching at the Web Werver Query Cache Database server Hit? no Query reply Query Reply Query request yes Search Engine

CARV ICS, FORTH The Traces 1M queries from EXCITE 1M queries from EXCITE 927,010 are keyword-based queries 927,010 are keyword-based queries FORMAT: FORMAT: uidkeywords user-id1dogs(first page) user-id1dogs(second page) user-id1dogs & cats (first page) user-id2 california (first page) Definition: Query is a single page of results of a keyword-based search

CARV ICS, FORTH Locality of Reference: Are there any popular Queries? Although people have a wide variety of interests there exist some very popular query topics Most popular query: 2219 accesses Most popular query: 2219 accesses 1000th most popular: 27 accesses 1000th most popular: 27 accesses

CARV ICS, FORTH What % of requests goes to popular Queries? 100 queries amount for 2.5% of the accesses 100 queries amount for 2.5% of the accesses 1000 queries amount for 7% of the accesses 1000 queries amount for 7% of the accesses

CARV ICS, FORTH Cache Placement All query requests are cached All query requests are cached All queries have the same size All queries have the same size  1 page of results at a time (~ 4Kbytes) All queries are served by one server All queries are served by one server

CARV ICS, FORTH Cache Replacement Cache Replacement using Cache Replacement using  LRU (least recently used) keeps a queue sorted on the access time new accesses move to the head of the queue tail of the queue may be evicted  SLRU much like LRU but: –accessing non-cached URLs puts them in the middle (not head) of sorted queue frequently accessed queries are given better chances of staying in the cache

CARV ICS, FORTH LRU Accessing: Time:1234 Hot Cold MRU LRU

CARV ICS, FORTH SLRU Accessing: Time:1234 Hot Cold MRU LRU

CARV ICS, FORTH Cache Effectiveness Hit Rate increases sharply with cache size: Hit Rate increases sharply with cache size: Max Hit Rate: 25% Max Hit Rate: 25% Frequency of reference important for small caches Frequency of reference important for small caches

CARV ICS, FORTH Using Warm Caches Use warm caches (1.6 Gbytes in size) Use warm caches (1.6 Gbytes in size)  hit rate: calculated only for for the last 50K reqs  max hit rate: 29%  1 our of 3.5 queries can be found in the cach

CARV ICS, FORTH Static Caching Don’t cache the recent queries Don’t cache the recent queries Cache the popular ones Cache the popular ones  no cache pollution  no cache replacement overheadBUT:  may miss recent queries e.g. due to an earthquake  yesterday’s popular queries may not be popular anymore

CARV ICS, FORTH Static Caching: Performance Static Caching: Static Caching:  calculate popular queries of the first half traces  cache them throughout the second half Static Caching is good for small caches Static Caching is good for small caches

CARV ICS, FORTH Related Work Alta-Vista traces [Silverstein 98] Alta-Vista traces [Silverstein 98]  1 billion-long query trace  avg. number of accesses per query: 4 - 75% hit rate Active Caching [Zhang98, Meira99] Active Caching [Zhang98, Meira99]  Cache at the proxy  execute a server-provided “cachelet” on hit Query Containment [Luo00] Query Containment [Luo00]  evaluate subqueries from cached queries “dogs and cats” is contained in “dogs”

CARV ICS, FORTH Conclusions Queries have locality of reference Queries have locality of reference  30% in our trace (75% in AV trace) Medium-size caches are effective Medium-size caches are effective  256 Mbytes result in 20% hit rate even higher (30%) for warm caches Both frequency and recency count Both frequency and recency count Static Caching is effective Static Caching is effective  for small cache sizes

On Caching Search Engine Query Results Evangelos Markatos Evangelos Markatos  http://archvlsi.ics.forth.gr/OS/os.html Computer Architecture and VLSI Systems Division Institute of Computer Science Foundation for Research and Technology Hellas Heraklion, Crete, Greece

CARV ICS, FORTH Temporal Locality 1,639 queries resubmitted in less than 100 time units 1,639 queries resubmitted in less than 100 time units 14K queries resubmitted in less than 1K time units 14K queries resubmitted in less than 1K time units

CARV ICS, FORTH Freshness of cached data Dynamic Caching may return stale data Dynamic Caching may return stale data But: But:  our caching lasts for a few/several hours  search engine data are several weeks old S. Engines dot archive the entire web every day Thus: Thus:  Caching does not return more stale data

CARV ICS, FORTH Popular queries 1sex 2sex (second page) 3yahoo 4playboy 5chat 6porn 7princess diana 8adult-related 9sex (third page) 10adult-related 11adult-related 12jokes 13hotmail 14chat rooms 15music

CARV ICS, FORTH DB Caching [Labrinidis and Roussopoulos 00] [Labrinidis and Roussopoulos 00] Web server caching Web server caching  is 1-2 orders of magnitude better than db caching  gets better with load update rates

On Caching Search Engine Query Results Evangelos Markatos Evangelos Markatoshttp://archvlsi.ics.forth.gr/OS/os.html Computer Architecture and VLSI Systems.

Similar presentations

Presentation on theme: "On Caching Search Engine Query Results Evangelos Markatos Evangelos Markatoshttp://archvlsi.ics.forth.gr/OS/os.html Computer Architecture and VLSI Systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On Caching Search Engine Query Results Evangelos Markatos Evangelos Markatoshttp://archvlsi.ics.forth.gr/OS/os.html Computer Architecture and VLSI Systems.

Similar presentations

Presentation on theme: "On Caching Search Engine Query Results Evangelos Markatos Evangelos Markatoshttp://archvlsi.ics.forth.gr/OS/os.html Computer Architecture and VLSI Systems."— Presentation transcript:

Similar presentations

About project

Feedback