Winter Semester 2003/2004Selected Topics in Web IR and Mining5-1 5 Index Pruning 5.1 Index-based Query Processing 5.2 Pruning with Combined Authority/Similarity.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
IRDM WS Chapter 3: Top-k Query Processing and Indexing 3.1 Top-k Algorithms 3.2 Approximate Top-k Query Processing 3.3 Index Access Scheduling.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Vector Space Models.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
CSE 6392 – Data Exploration and Analysis in Relational Databases April 20, 2006.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Information Retrieval in Practice
An Efficient Algorithm for Incremental Update of Concept space
Optimizing Parallel Algorithms for All Pairs Similarity Search
Indexing & querying text
Database Management System
Information Retrieval in Practice
Tree-Structured Indexes
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
Introduction to Query Optimization
Implementation Based on Inverted Files
6. Implementation of Vector-Space Retrieval
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Chapter 5: Information Retrieval and Web Search
Evaluation of Relational Operations: Other Techniques
Information Retrieval and Web Design
Presentation transcript:

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-1 5 Index Pruning 5.1 Index-based Query Processing 5.2 Pruning with Combined Authority/Similarity Scoring 5.3 Pruning with Similarity Scoring

Winter Semester 2003/2004Selected Topics in Web IR and Mining Index-based Query Processing Given: query q = t1 t2... tz with z (conjunctive) keywords document collection D = {d1,..., dn} of m-dim. vectors dj similarity scoring function score(q,d) Find: top k results with regard to score(q,d) Naive QP algorithm: candidate-docs :=  ; for i=1 to z do { candidate-docs := candidate-docs  index-lookup(ti) }; for each dj  candidate-docs do {compute score(q,dj)}; sort candidate-docs by score(q,dj) descending; Inverted index t12d66d93d95...d101 t13d98d101d t15d53d66d99... t27d47d66d75... Inverted index contains for each ti an index list with all docs that contain ti, sorted by doc id each ti has idf, each doc id has tf in index list index is organized as a search tree (usually B+-tree or suffix tree/trie) index lists are only scanned, no random access

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-3 Framework for Index Pruning Assume scoring of the form Goal: avoid scanning very long index lists until completion Key ideas: 1)keep index lists in specific sort order 2)possibly keep redundant lists (possibly in alternative sort order) 3)accumulate partial scores for docs by adding up s i or r values 4)while scanning the z query-relevant index lists maintain: pos(i) – current position in the i-th index list high(i) – upper bound for all docs in i-th list that follow pos(i) high(0) – upper bound for r(dj) among all docs not in current top k high – upper bound for all docs with incomplete score 5)in each step compute high(i) and high, and compare to min score of current top k stop scanning i-th index list if high(i) < min score of top k

Winter Semester 2003/2004Selected Topics in Web IR and Mining Pruning with Combined Authority/Similarity Scoring (Long/Suel 2003) Focus on score(q,dj) = r(dj) + s(q,dj) with normalization r(  )  a, s(  )  b (and often a+b=1) Keep index lists sorted in descending order of authority r(dj) Authority-based pruning (used in Google ?): high(0) := max{r(pos(i))|i=1..z}; high := high(0)+b; high(i) := r(pos(i))+b; stop scanning i-th index when high(i) < min score of top k effective when total score of top k results is dominated by r First-m heuristics (used in Google ?): scan all z index lists until m  k docs have been found that appear in all lists; the stopping condition is easy to check because of the sorting by r

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-5 Separating Documents with Large s i Values Idea (Google): in addition to the full index lists L(i) sorted by r, keep short „fancy lists“ F(i) that contain the docs dj with the highest values of si(ti,dj) and sort these by r Fancy first-m heuristics: Compute total score for all docs in  F(i) (i=1..k) and keep top k results; Cand :=  i F(i)   i F(i); for each dj  Cand do {compute partial score of dj}; Scan full index lists L(i) (i=1..k); if pos(i)  Cand {add s i (ti,pos(i)) to partial score of pos(i)} else {add pos(i) to Cand and set its partial score to s i (ti,pos(i))}; Terminate the scan when m docs have a completely computed total score;

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-6 Authority-based Pruning with Fancy Lists Guarantee that the top k results are complete by extending the fancy first-m heuristics as follows: stop scanning the i-th index list L(i) not after m results, but only when we know that no imcompletely scored doc can qualify itself for the top k results Maintain: r_high(i) := r(pos(i)) s_high(i) := max{si(q,dj) | dj  L(i)  F(i)} Scan index lists L(i) and accumulate partial scores for all docs dj Stop scanning L(i) iff r_high(i) +  i s_high(i) < min{score(d) | d  current top k results}

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-7 Probabilistic Pruning Maintain statistics about the distribution of si values For pos(i) estimate the probability p(i) that the rest of L(i) contains a doc d for which the si score is so high that d qualifies for the top k results Stop scanning L(i) if p(i) drops below some threshold Simple „approximation“ by the last-l heuristics: stop scanning when the number of docs in  i F(i)   i F(i) with incompletely computed score drops below l (e.g., l=10 or 100)

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-8 Performance Experiments (1) Setup: index lists for 120 Mio. Web pages distributed over 16 PCs (and stored in BerkeleyDB databases) query evaluation iterated over many sample queries with different degrees of concurrency (multiprogramming levels) Evaluation measures: query throughput [queries/second] average query response time [seconds] error for pruning heuristics: strict-k error: fraction of queries for which the top k were not exact loose-k error: fraction of top k results that do not belong to true top k

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-9 Performance Experiments (2) Exact computation versus first-m heuristics (without fancy lists) from: X. Long, T. Suel, Optimized Query Execution in Large Search Engines with Global Page Ordering, VLDB 2003

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-10 Performance Experiments (3) Fancy first-m heuristics from: X. Long, T. Suel, Optimized Query Execution in Large Search Engines with Global Page Ordering, VLDB 2003

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-11 Performance Experiments (4) Authority-based pruning with fancy lists from: X. Long, T. Suel, Optimized Query Execution in Large Search Engines with Global Page Ordering, VLDB 2003 Overall comparison of pruning techniques

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-12 Performance Experiments (5) Scalability on 16-node cluster (with distributed index lists and centralized QP on one node) from: X. Long, T. Suel, Optimized Query Execution in Large Search Engines with Global Page Ordering, VLDB 2003

Winter Semester 2003/2004Selected Topics in Web IR and Mining Pruning with Similarity Scoring Focus on scoring of the form with quit heuristics (with doc-id-ordered or tf-ordered or tf*idl-ordered index lists): ignore index list L(i) if idf(ti) is below threshold or stop scanning L(i) if idf(ti)*tf(ti,dj)*idl(di) drops below threshold or stop scanning L(i) when the number of accumulators is too high Implementation based on a hash array of accumulators for summing up the partial scores of candidate results continue heuristics: upon reaching threshold, continue scanning index lists, but do not add any new documents to the accumulator array

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-14 Greedy QP Assume index lists are sorted by tf(ti,dj) (or tf(ti,dj)*idl(dj)) values Open scan cursors on all z index lists L(i) Repeat Find pos(g) among current cursor positions pos(i) (i=1..z) with the largest value of idf(ti)*tf(ti,dj) (or idf(ti)*tf(ti,dj)*idl(dj)); Update the accumulator of the corresponding doc; Increment pos(g); Until stopping condition

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-15 QP with Compressed Index Lists (Moffat/Zobel 1996) Keep L(i) in ascending order of doc id‘s Compress L(i) by actually storing the gaps between successive doc id‘s (or using some more sophisticated prefix-free code) QP may start with those L(i) lists which are short and have high idf Candidate results need to be looked up in other lists L(j) To avoid having to uncompress the entire list L(j), L(j) is encoded into groups of entries with a skip pointer at the start of each group

Winter Semester 2003/2004Selected Topics in Web IR and Mining5-16 Literature Xiaohui Long, Torsten Suel: Optimized Query Execution in Large Search Engines with Global Page Ordering, VLDB Conf., 2003 Alistair Moffat, Justin Zobel: Self-Indexing Inverted Files for Fast Text Retrieval, ACM TOIS Vol.14 No.4, 1996 Vo Ngoc Anh, Owen de Kretser, Alistair Moffat: Vector-Space Ranking with Effective Early Termination, SIGIR Conf., 2001 Vo Ngoc Anh, Alistair Moffat: Impact Transformation: Effective and Efficient Web Retrieval, SIGIR Conf., 2002 Norbert Fuhr, Norbert Gövert, Mohammad Abolhassani: Retrieval Quality vs. Effectiveness of Relevance-oriented Search in XML Documents, Technical Report, University of Duisburg, 2003