K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma.

Slides:



Advertisements
Similar presentations
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Advertisements

03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
An obvious way to implement the Boolean search is through the inverted file. We store a list for each keyword in the vocabulary, and in each list put the.
1 Overview of Storage and Indexing Chapter 8 (part 1)
1 File Organizations and Indexing Module 4, Lecture 2 “How index-learning turns no student pale Yet holds the eel of science by the tail.” -- Alexander.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
CS/Info 430: Information Retrieval
CS 4432lecture #71 CS4432: Database Systems II Lecture #7 Professor Elke A. Rundensteiner.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
1 Physical Data Organization and Indexing Lecture 14.
CSCE Database Systems Chapter 15: Query Execution 1.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 “How index-learning turns no student pale Yet holds.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.
Index Tuning Conventional index Secondary index To speed up queries on attributes not within primary key Primary index –Determine.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Database Indexing 1 After this lecture, you should be able to:  Understand why we need database indexing.  Define indexes for your tables in MySQL. 
Evidence from Content INST 734 Module 2 Doug Oard.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
CS4432: Database Systems II
Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.
1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
Large Scale Search: Inverted Index, etc.
Text Indexing and Search
CS 245: Database System Principles Notes 4: Indexing
Database Management System
Information Retrieval in Practice
CS 245: Database System Principles Notes 4: Indexing
COMP 430 Intro. to Database Systems
Implementation Issues & IR Systems
Chapter 12: Query Processing
Lecture#12: External Sorting (R&G, Ch13)
File Organizations and Indexing
Putting things in order
CS 245: Database System Principles Notes 4: Indexing
Chapter 12 Query Processing (1)
External Sorting.
Recuperação de Informação B
Information Retrieval and Web Design
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.
CS222P: Principles of Data Management UCI, Fall 2018 Notes #04 Schema versioning and File organizations Instructor: Chen Li.
Presentation transcript:

K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston

University of Houston2 Boolean queries Alice and Bob -- Retrieve documents containing Bob and Alice Alice or Bob -- Retrieve documents containing either Bob or Alice or both Alice and not Bob, …

University of Houston3 Existing solutions Query: Bob and Alice Inverted file Retrieve inverted list (on disk) for Bob Retrieve inverted list for Alice Merge the lists to compute intersection, or For “And” only: retrieve the shorter list and scan the docs (disk I/Os “saved?” at expense of CPU time) Google times for query: Bob – 0.11s, Alice – 0.1s, Bob and Alice – 0.2s

University of Houston4 Existing solutions Query: Bob and Alice Build Secondary index on inverted lists Retrieve secondary index on Bob’s list from disk (assuming secondary index on Bob’s list is smaller) Search for Alice in secondary index Retrieve documents

University of Houston5 K-tree ( Leaves point to lists on disk) Alice Bob

University of Houston6 Experiments Data  1 million word documents divided into pages of 100 words each  Pages indexed by keywords contained Methods  BST-based inverted file using merge or scan technique  K-tree Queries of type:  Single keyword  Two keywords “and/and- not’’

University of Houston7 Results for single word query MethodI/O’s BST-based inverted file31.26 K-tree (parallel) K-tree (sequential)37.05 K-tree (sequential with no fragmentation)31.26 Note: index in memory, inverted lists on disk for all methods. Results are averages for all possible queries of type listed before.

University of Houston8 Results for 2-words and query MethodI/O’s BST-based inverted file (merge) BST-based inverted file (scan) K-tree (parallel) K-tree (sequential) K-tree(sequential with no fragmentation) Note: index in memory, inverted lists on disk for all methods. Results are averages for all possible queries of type listed before.

University of Houston9 K-forest Tradeoff: size of K-forest vs. post-processing In general choose size of subset, s, by C(K,s)2 s <= avail. Memory. K can be reduced by standard techniques and by considering frequency. Index on sub- sets of size 3 K-trees for 3 keywords

University of Houston10 K-tree highlights Advantages:  And/But queries – no post processing  Or queries – require some K-tree traversal  Easy to implement  Easy to parallelize, especially for shorter and/and-not queries and all or queries Disadvantage:  Size 2 K for K keywords – but this is overkillsince user queries are typically short (over 90% of queries contain at most 5 keywords). Very rare to have queries with 10 or more keywords.

University of Houston11 Conclusions and Future Work We have presented efficient structures (K- tree/forest) for boolean queries One direction is to do more experiments using for example TREC collections Another direction is to study how document characteristics can help in choosing the ``right set of keywords’’ to include in these structures