Mining di Dati Web Web Search Engine ’ s Query Log Mining A.A 2006/2007.

Slides:



Advertisements
Similar presentations
The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms (ACM SIGMETRIC 05 ) ACM International Conference on Measurement & Modeling.
Advertisements

A Survey of Web Cache Replacement Strategies Stefan Podlipnig, Laszlo Boszormenyl University Klagenfurt ACM Computing Surveys, December 2003 Presenter:
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Qinqing Gan Torsten Suel Improved Techniques for Result Caching in Web Search Engines Presenter: Arghyadip ● Konark.
Search Engines and Information Retrieval
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Time-dependent Similarity Measure of Queries Using Historical Click- through Data Qiankun Zhao*, Steven C. H. Hoi*, Tie-Yan Liu, et al. Presented by: Tie-Yan.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Improving Proxy Cache Performance: Analysis of Three Replacement Policies Dilley, J.; Arlitt, M. A journal paper of IEEE Internet Computing, Volume: 3.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Submitting: Barak Pinhas Gil Fiss Laurent Levy
Virtual Memory Chapter 8.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Internet Cache Pollution Attacks and Countermeasures Yan Gao, Leiwen Deng, Aleksandar Kuzmanovic, and Yan Chen Electrical Engineering and Computer Science.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
Web Caching Schemes For The Internet – cont. By Jia Wang.
Dyer Rolan, Basilio B. Fraguela, and Ramon Doallo Proceedings of the International Symposium on Microarchitecture (MICRO’09) Dec /7/14.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
RIPQ: Advanced Photo Caching on Flash for Facebook Linpeng Tang (Princeton) Qi Huang (Cornell & Facebook) Wyatt Lloyd (USC & Facebook) Sanjeev Kumar (Facebook)
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
FALL 2012 DSCI5240 Graduate Presentation By Xxxxxxx.
Tag-based Social Interest Discovery
Search Engines and Information Retrieval Chapter 1.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
1 Cache Me If You Can. NUS.SOC.CS5248 OOI WEI TSANG 2 You Are Here Network Encoder Sender Middlebox Receiver Decoder.
Web Cache Replacement Policies: Properties, Limitations and Implications Fabrício Benevenuto, Fernando Duarte, Virgílio Almeida, Jussara Almeida Computer.
Search Engine Caching Rank-preserving two-level caching for scalable search engines, Paricia Correia Saraiva et al, September 2001
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Processes and OS basics. RHS – SOC 2 OS Basics An Operating System (OS) is essentially an abstraction of a computer As a user or programmer, I do not.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
Design and Analysis of Advanced Replacement Policies for WWW Caching Kai Cheng, Yusuke Yokota, Yahiko Kambayashi Department of Social Informatics Graduate.
Data Mining By Dave Maung.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Multicache-Based Content Management for Web Caching Kai Cheng and Yahiko Kambayashi Graduate School of Informatics, Kyoto University Kyoto JAPAN.
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Predictive Caching and Prefetching of Query Results in Search Engines Based on a Paper by: Ronny Lempel and Shlomo Moran Presentation: Omid Fatemieh CS598CXZ.
Lecture Topics: 11/24 Sharing Pages Demand Paging (and alternative) Page Replacement –optimal algorithm –implementable algorithms.
A BRIEF INTRODUCTION TO CACHE LOCALITY YIN WEI DONG 14 SS.
NUS.SOC.CS5248 Ooi Wei Tsang 1 Proxy Caching for Streaming Media.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
An Overview of Proxy Caching Algorithms Haifeng Wang.
Page Table Implementation. Readings r Silbershatz et al:
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Video Caching in Radio Access network: Impact on Delay and Capacity
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
A Metric Cache for Similarity Search fabrizio falchi claudio lucchese salvatore orlando fausto rabitti raffaele perego.
On Caching Search Engine Query Results Evangelos Markatos Evangelos Markatoshttp://archvlsi.ics.forth.gr/OS/os.html Computer Architecture and VLSI Systems.
Information Retrieval in Practice
CMSC 611: Advanced Computer Architecture
Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras
Text Based Information Retrieval
Demand Paging Reference Reference on UNIX memory management
Cache Memory Presentation I
Demand Paging Reference Reference on UNIX memory management
Adaptive Cache Replacement Policy
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Distributed Systems CS
A Restaurant Recommendation System Based on Range and Skyline Queries
How can we find data in the cache?
Efficient Cache-Supported Path Planning on Roads
Information Retrieval and Web Design
10/18: Lecture Topics Using spatial locality
Journal of Web Semantics 55 (2019)
Presentation transcript:

Mining di Dati Web Web Search Engine ’ s Query Log Mining A.A 2006/2007

What’s Recorded in a WSE Query Log?  Each component of a WSE records information about its operations.  We are mainly concerned with frontend logs.  They record each query submitted to the WSE.

Data Recorded  Among other information WSEs record:  The query topic.  The first result wanted.  The number of results wanted.  Some examples:  q(fabrizio silvestri)f(1)n(10)  q(“information retrieval”)f(5)n(15)  Some other information:  The language.  Results folded? (Y/N).  Etc. Commonly referred to as “the query”

What Can We Look For?  The most popular queries.  How queries are distributed.  How queries are related.  How subsequent queries are related.  How topics are distributed.  How topics change throughout the 24 hours.  Can we exploit this information?

Let’s Start Looking at Some Interesting Items  What are the most popular queries?

Most Popular Topics

Most Popular Terms

What Are Users Doing?  Not typing many words!  Average query was 2.6 words long (in 2001), up from 2.4 words in  Moving toward e-commerce  Less sex (down from 17% to 9%), more business (up from 13% to 25%).  Spink A., et al. “From e-Sex to e- Commerce: Web Search Changes”, Computer, March 2002.

Why Are Queries so Short?  Users minimize effort.  Users don’t realize more information is better.  Users learn that too many words belongs to fewer results. (Since implicit AND)  Query Boxes are Small.  Belkin, N.J., et al. “Rutgers’ TREC 2001 Interactive Track Experience”, in Voorhees & Harmon, The Tenth Text Retrieval Conference.

Different Kind of Queries

Distribution of Query Types

Hourly Analysis of a Query Log  Steven M. Beitzel, Eric C. Jensen, Abdur Chowdhury, David Grossman, Ophir Frieder, "Hourly Analysis of a Very Large Topically Categorized Web Query Log", Proceedings of the 2004 ACM Conference on Research and Development in Information Retrieval (ACM-SIGIR), Sheffield, UK, July 2004.

Frequency Time Distribution

Query Repetition

Query Categories

Categories over Time

Analysis of Three Query Logs  Tiziano Fagni, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri. “Boosting the Performance of Web Search Engines: Caching and Prefetching Query Results by Exploiting Historical Usage Data. ACM Transactions on Information Systems. 24(1). January 2006.

Temporal Locality  =0.66

Query Submission Distance

Page Requested

Subsequent Page Requests

Query Caching Francesca, 1 Results WSE Francesca Index Francesca, 1

Caching: Who Care?!? Successful caching of query results can:  Lower the number/cost of query executions.  Shorten the engine’s response time.  Increase the engine’s throughput.

Caching: How-To?  Caching can exploit locality of reference in the query streams search engines are faced with.  Query popularity follows a power-law and vary widely, from the extremely popular to the very rare.

Caching: What to Measure?  Hit Ratio:  Let N be the number of requests to the WSE  Let H be the number of hits - i.e. the number of queries that can be answered by the cache.  The Hit Ratio HR is defined as HR = H/N. Usually is expressed in percentage.  E.g. HR = 30% means that the thirty percent of the queries are satisfied using the cache.  Alternatively we could define the Miss Ratio: MR = 1 - HR = M/N. Where M is the number of miss - i.e. the number of queries that cannot be answered by the query.

What About the Throughput?  The throughput is defined as the number of queries answered per-second.  Caching, in general, rises the throughput.  The lower the hit-ratio the lower the throughput.  The lower the cache response-time the higher the throughput.

Caching Complexity  The caching response time depends on the replacement policy complexity.  The complexity usually depends on the cache size K.  There exists policies that are:  O(1) - i.e. constant. They don’t depend on the size of the cache.  O(log K).  O(N).

Is There Only Caching? NNo!!!! TThere’s also PREFETCHING! WWhat’s Prefetching: AAnticipating users query by exploiting query stream properties UUhuuuu! Sounds like kind of “Usage Mining”! FFor instance let’s have a look at the probability of subsequent page requests. PPrefetching factor p is the number of pages prefetched.

Prefetching: PROS and CONS  Prefetching enhance hit-ratio.  Prefetching reduce the query load on the query server.  The cost for computing p pages of results is approx the same of computing only one page  Prefetching is very likely to load pages that will never be requested in future.

Adaptive Prefetching

Theoretical Bounds

Some Classical Caching Policies  LRU  Last Recently Used.  Evict from Cache the query results that have been accessed farthest in the past.  SLRU  Two segments:  Probationary  Protected.  Lines in each segment are ordered from the most to the least recently accessed. Data from misses is added to the cache at the most recently accessed end of the probationary segment. Hits are removed from wherever they currently reside and added to the most recently accessed end of the protected segment. Lines in the protected segment have thus been accessed at least twice. The protected segment is finite, so migration of a line from the probationary segment to the protected segment may force the migration of the LRU line in the protected segment to the most recently used (MRU) end of the probationary segment, giving this line another chance to be accessed before being replaced.

Problems  Classical Replacement Policies do not care about stream characteristics.  They are not designed using usage mining investigation techniques.  They offer godd performance, though!  Uhmmm…. Are you sure?!? Stay tuned!

Caching May be Attacked from two Directions  Architecture of the caching system:  Two-level caching  Three-level caching  SDC  Replacement policy  PDC  SDC  Both  SDC

Two-level Caching  Cache of Query Results  Cache of Inverted Lists  Both

Throughput

Three-level Caching  Long, X. and Suel, T Three-level caching for efficient query processing in large Web search engines. In Proceedings of the 14th international Conference on World Wide Web (Chiba, Japan, May , 2005). WWW '05. ACM Press, New York, NY,

Probability Driven Caching  Lempel, R. and Moran, S Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th international Conference on World Wide Web (Budapest, Hungary, May , 2003). WWW '03. ACM Press, New York, NY,  Tanks to Ronny for his original slides!slides

Static-Dynamic Caching  Tiziano Fagni, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri. “Boosting the Performance of Web Search Engines: Caching and Prefetching Query Results by Exploiting Historical Usage Data. ACM Transactions on Information Systems. 24(1). January  Idea:  Divide the cache in two sets:  Static Set  Dynamic Set.  Fill the Static Set using the most frequently submitted query in the past.  The Static Set is read-only: good in multithreaded architectures.

Inside SDC  Static-Dynamic Caching.  The cache is divided into two sets:  Static Set: contains the results of the queries most frequently submitted so far.  Dynamic Set: is implemented using a classical caching replacement policy like, for instance, LRU, SLRU, PDC.  The Static Set size is given by f static *N. Where 0< f static < 1 is the fraction of the total entries (N) of the cache devoted to the Static Set.  Adaptive Prefetching is adopted.

Benefits in Real-World Caches SDC Cache Thread Static Set Dynamic Set Mutex SDC Cache WSE SDC Cache Thread SDC Cache Thread SDC Cache Thread

SDC Performance  Linux PC: 2GHz Pentium Xeon - 1GB RAM  Single process.  f static = 0.5. No prefetching.

SDC Hit-Ratio

Why Static Set Helps?

Concurrent Caching

Freshness of the Training Data  How frequently should we perform mining again on the usage data?  Does performance of Usage-Mining-based caching degrades gracefully as time goes by?  Do time-of-day patterns exist in query stream.

Daily Patterns