Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.

Slides:

Advertisements

Similar presentations

For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.

Advertisements

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Information Retrieval in Practice

Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.

Presentation of Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (1997) Presenter: Scott White.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Part IV - Advanced Techniques - High-performance crawling - Recrawling and focused crawling - Link-based ranking (Pagerank, HITS) - Structural analysis.

ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.

Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.

X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.

Chapter 6: Information Retrieval and Web Search

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-1 5 Index Pruning 5.1 Index-based Query Processing 5.2 Pruning with Combined Authority/Similarity.

Vector Space Models.

Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.

CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.

Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,

Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.

Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.

Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

1 CS 430: Information Discovery Lecture 5 Ranking.

1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

General Architecture of Retrieval Systems 1Adrienn Skrop.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Information Retrieval in Practice

Gleb Skobeltsyn Flavio Junqueira Vassilis Plachouras

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Information Retrieval and Web Search

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Information Retrieval and Web Design

Presentation transcript:

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201

intro: query processing in search engines related work: query execution and pruning techniques algorithmic techniques experimental evaluation: single and multiple nodes concluding remarks Talk Outline: “how to optimize query throughput in large search engines, when the ranking function is a combination of term-based ranking and a global ordering such as Pagerank” The Problem:

inverted index - a data structure for supporting text queries - like index in a book Query processing in search engines (single node case) inverted index aalborg 3452, 11437, …... arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,.... zz 602, 1189, 3209,... disks with documents indexing Boolean queries: (zebra AND armadillo) OR alligator unions/intersections of lists

scoring function: assigns score to each document with respect to a given query top-k queries: return k documents with highest scores example cosine measure for query with terms t to t Ranking in search engines: can be implemented by computing score for all documents that contain any of the query words (union of inverted lists) in case of search engines: often intersection instead of union in large collections, lists are many MB for average queries 0m-1

Basic Idea: “significance of a page is determined by the significance of the pages linking to it” Pagerank Algorithm: (Brin/Page 1998) More precisely: prune graph until no nodes with out-degree 0 initialize every node with value (N number of nodes) iterate, where in each iteration we update each node as follows: where d(q) is the out-degree of q and = 0.15 corresponds to random walk with random jump with prob.

recall the cosine measure: Combining Term-Based Methods and Pagerank Pagerank assigns a fixed score to each page (independent of query) naïve way of integrating Pagerank value: works for any global ordering of page scores (e.g., based on traffic) but some more details remain

how to combine/add pagerank score and cosine? (addition) use PR or log(PR) ? normalize using mean of top-100 in list (Richardson/Domingo) Using Pagerank in ranking:

pages index pages index pages index pages index pages index broadcasts each query and combines the results LAN Cluster with global index organization Query Processing in Parallel Search Engines query integrator local index: every node stores and indexes subset of pages every query broadcast to all nodes by query integrator (QI) every node supplies top-10, and QI computes global top-10 note: we don’t really need top-10 from all, maybe only top-2 low-cost cluster architecture (usually with additional replication)

IR: optimized evaluation of cosine measures (since 1980s) DB: top-k queries for multimedia databases (Fagin 1996) does not consider combinations of term-based and global scores Brin/Page 1998: fancy lists in Google Related Work on top-k Queries basic idea: “ presort entries in each inverted list by contribution to cosine” also process inverted lists from shortest to longest list various schemes, either reliable or probabilistic most closely related: - Persin/Zobel/Sacks-Davis 1993/96 - Anh/Moffat 1998, Anh/deKretzer/Moffat 2001 typical assumptions: many keywords/query, OR semantics Related Work (IR)

motivation: searching multimedia objects by several criteria typical assumptions: few attributes, OR semantics, random access FA (Fagin’s algorithm), TA (Threshold algorithm), others formal bounds: for k lists if lists independent term-based ranking: presort each list by contribution to cosine Related Work (DB) (Fagin 1996 and others)

“fancy lists” optimization in Google create extra shorter inverted list for “fancy matches” (matches that occur in URL, anchor text, title, bold face, etc.) note: fancy matches can be modeled by higher weights in the term-based vector space model no details given or numbers published Related Work (Google) (Brin/Page 1998) chair table fancy list rest of list with other matches

pruning techniques for query execution in large search engines focus on a combination of a term-based and a global score (such as Pagerank) techniques combine previous approaches such as fancy lists and presorting of lists by term scores experimental evaluation on 120 million pages very significant savings with almost no impact on results it’s good to have a global ordering! Results of our Paper

exhaustive algorithm: “no pruning, traverse entire list” first-m: “a naïve algorithm with lists sorted by Pagerank; stop after m elements in intersection found” fancy first-m: “use fancy and non-fancy lists, each sorted by Pagerank, and stop after m elements found” reliable pruning: “stop when top-k results found” fancy last-m: “stop when at most m elements unresolved” single-node and parallel case with optimization Algorithms:

120 million pages on 16 machines (1.8TB uncompressed) P-4 1.7Ghz with 2x80GB Seagate Barracuda IDE compressed index based on Berkeley DB (using the mg compression macros) queries from Excite query trace from December 1999 queries with 2 terms in the following local index organization with query integrator first results for one node (7.5 million pages), then 16 note: do not need top-10 from every node motivates top-1, top-4 schemes and precision at 1, 4 ranking by cosine + log(PR) with normalization Experimental setup:

sort inverted lists by Pagerank (docID = rank due to Pagerank) exhaustive: top-10 first-m: return 10 highest scoring among first 10/100/1000 pages in intersection A naïve approach: first-m

for first-10, about 45% of top-10 results belong in top-10 for first-1000, about 85% of top-10 results belong in top-10 first-m (ctd.) loose/strict precision, relative to “correct” cosine + log(PR) for first-100, about 80% of queries return correct top-1 result for first-1000, about 70% of queries return all correct top-10 results average cost per query in terms of disk blocks

(1) Use better stopping criteria? reliable pruning: stop when we are sure probabilistic pruning: stop when almost sure do not work well for Pagerank-sorted index (2) Reorganize index structure? sort lists by term score (cosine) instead of Pagerank - does not do any better than sorting by Pagerank only sort lists by term log(PR) (or some combination of these) - some problems in normalization and dependence on # of keywords generalized fancy lists - for each list, put entries with highest term value in fancy list - sort both lists by pagerank docID - note: anything that does well in 2 out of 3 scores is found soon - deterministic or probabilistic pruning, or first-k How can we do better? chair table fancy list rest of list, cosine < x rest of list, cosine < y

loose vs. strict precision for various sizes of the fancy lists Results for generalized fancy lists MUCH better precision than without fancy lists! for first-1000, we always get correct top-1 in these runs

cost similar to first-m without fancy lists plus the additional cost of reading fancy lists cost increases slightly with size of fancy list slight inefficiency: fancy list items not removed from other list note: we do not consider savings due to caching Costs of Fancy Lists

always gives “correct” result top-4 can be computed reliably with ~20% of original cost with 16 nodes, top-4 from each node suffice with 99% prob. to get top-10 Reliable Pruning

comparison of first-m, last-m, and reliable pruning schemes for top-4 queries with loose precision Summary for all Approaches (single node)

first-30 returns correct top-10 for almost 98% of all queries Results for 16 Nodes

top-10 queries on 16 machines with 120 million pages up to 10 queries/sec with reliable pruning up to 20 queries per second with first-30 scheme Throughput and Latency for 16 Nodes Note: reliable pruning not implemented in purely incremental manner

results for 3+ terms and incremental query integrator need to do precision/recall study need to engineer ranking function and reevaluate how to include term distance in documents impact of caching at lower level working on publicly available engine prototype tons of loose ends and open questions Current and Future Work