K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma.

K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma

University of Houston2 Boolean queries Alice and Bob -- Retrieve documents containing Bob and Alice Alice or Bob -- Retrieve documents containing either Bob or Alice or both Alice and not Bob, …

University of Houston3 Existing solutions Query: Bob and Alice Inverted file Retrieve inverted list (on disk) for Bob Retrieve inverted list for Alice Merge the lists to compute intersection, or For “And” only: retrieve the shorter list and scan the docs (disk I/Os “saved?” at expense of CPU time) Google times for query: Bob – 0.11s, Alice – 0.1s, Bob and Alice – 0.2s

University of Houston4 Existing solutions Query: Bob and Alice Build Secondary index on inverted lists Retrieve secondary index on Bob’s list from disk (assuming secondary index on Bob’s list is smaller) Search for Alice in secondary index Retrieve documents

University of Houston5 K-tree ( Leaves point to lists on disk) Alice Bob 1 1 1 0 00

University of Houston6 Experiments Data  1 million word documents divided into pages of 100 words each  Pages indexed by keywords contained Methods  BST-based inverted file using merge or scan technique  K-tree Queries of type:  Single keyword  Two keywords “and/and- not’’

University of Houston7 Results for single word query MethodI/O’s BST-based inverted file31.26 K-tree (parallel) 25.36 K-tree (sequential)37.05 K-tree (sequential with no fragmentation)31.26 Note: index in memory, inverted lists on disk for all methods. Results are averages for all possible queries of type listed before.

University of Houston8 Results for 2-words and query MethodI/O’s BST-based inverted file (merge) 62.52 BST-based inverted file (scan) 10.13 K-tree (parallel) 00.57 K-tree (sequential) 00.77 K-tree(sequential with no fragmentation) 00.61 Note: index in memory, inverted lists on disk for all methods. Results are averages for all possible queries of type listed before.

University of Houston9 K-forest Tradeoff: size of K-forest vs. post-processing In general choose size of subset, s, by C(K,s)2 s <= avail. Memory. K can be reduced by standard techniques and by considering frequency. Index on sub- sets of size 3 K-trees for 3 keywords

University of Houston10 K-tree highlights Advantages:  And/But queries – no post processing  Or queries – require some K-tree traversal  Easy to implement  Easy to parallelize, especially for shorter and/and-not queries and all or queries Disadvantage:  Size 2 K for K keywords – but this is overkillsince user queries are typically short (over 90% of queries contain at most 5 keywords). Very rare to have queries with 10 or more keywords.

University of Houston11 Conclusions and Future Work We have presented efficient structures (K- tree/forest) for boolean queries One direction is to do more experiments using for example TREC collections Another direction is to study how document characteristics can help in choosing the ``right set of keywords’’ to include in these structures

K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma.

Similar presentations

Presentation on theme: "K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma.

Similar presentations

Presentation on theme: "K-tree/forest: Efficient Indexes for Boolean Queries Rakesh M. Verma and Sanjiv Behl University of Houston www.cs.uh.edu/~rmverma."— Presentation transcript:

Similar presentations

About project

Feedback