IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.

Slides:



Advertisements
Similar presentations
Google News Personalization: Scalable Online Collaborative Filtering
Advertisements

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.
Hadi Goudarzi and Massoud Pedram
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Martin Theobald Max Planck Institute for Computer Science Stanford University Joint work with Ralf Schenkel, Gerhard Weikum TopX Efficient & Versatile.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Chapter 5 Fundamental Algorithm Design Techniques.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Information Retrieval IR 4. Plan This time: Index construction.
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Aggregation Algorithms and Instance Optimality
Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
A Model and Algorithms for Pricing Queries Tang Ruiming, Wu Huayu, Bao Zhifeng, Stephane Bressan, Patrick Valduriez.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
+ Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 VLDB, Background What is important for the user.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Indexing & querying text
Max-Planck Institute for Informatics
Seung-won Hwang, Kevin Chen-Chuan Chang
Chapter 12: Query Processing
Analysis and design of algorithm
Preference Query Evaluation Over Expensive Attributes
Spatio-temporal Pattern Queries
Spatial Online Sampling and Aggregation
Martin Theobald Max-Planck-Institut Informatik Stanford University
Rank Aggregation.
Merge Sort 11/28/2018 2:21 AM The Greedy Method The Greedy Method.
Merge Sort 11/28/2018 8:16 AM The Greedy Method The Greedy Method.
Laks V.S. Lakshmanan Depf. of CS UBC
Efficient Processing of Top-k Spatial Preference Queries
Relax and Adapt: Computing Top-k Matches to XPath Queries
Presentation transcript:

IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger Bast, Ralf Schenkel, Martin Theobald, Gerhard Weikum VLDB 2006, Seoul, Korea

Setup priceresolutionzoom camera 1 €300 camera 5 8MP camera 3 7x camera 3 €330 camera 1 7MP camera 1 5x camera 5 €490 camera 4 6MP camera 2 4x camera 4 €580 camera 2 4MP camera 5 4x ………… ………… ………… Pre-computed index-lists over multiple attributes lists are accessible by both sorted and random accesses combine scores by some monotonic aggregation function: . res + .  zoom - . price Goal: find the top-k items with highest total scores single numeric score for every item for each attribute

Top-k algorithms: example lists sorted by score item item item item item item item item item item item item item item item item item item item List 1 List 2 List 3 Fagin’s NRA Algorithm:

Top-k algorithms: example lists sorted by score item item item item item item item item item item item item item item item item item item item Fagin’s NRA Algorithm: round 1 item 83 [0.9, 2.1] item 17 [0.6, 2.1] item 25 [0.6, 2.1] Candidates min top-2 score: 0.6 maximum score for unseen items: 2.1 min-top-2 < best-score of candidates List 1 List 2 List 3 read one item from every list current score best-score

Top-k algorithms: example lists sorted by score Fagin’s NRA Algorithm: round 2 item 17 [1.3, 1.8] item 83 [0.9, 2.0] item 25 [0.6, 1.9] item 38 [0.6, 1.8] item 78 [0.5, 1.8] Candidates min top-2 score: 0.9 maximum score for unseen items: 1.8 item item item item item item item item item item item item item item item item item item item List 1 List 2 List 3 read one item from every list min-top-2 < best-score of candidates

Top-k algorithms: example lists sorted by score item 83 [1.3, 1.9] item 17 [1.3, 1.9] item 25 [0.6, 1.5] item 78 [0.5, 1.4] Candidates min top-2 score: 1.3 maximum score for unseen items: 1.3 item item item item item item item item item item item item item item item item item item item Fagin’s NRA Algorithm: round 3 List 1 List 2 List 3 read one item from every list min-top-2 < best-score of candidates no more new items can get into top-2 but, extra candidates left in queue

Top-k algorithms: example item item item item item item item item item item item item item item item item item item item lists sorted by score item item 83 [1.3, 1.9] item 25 [0.6, 1.4] Candidates min top-2 score: 1.3 maximum score for unseen items: 1.1 Fagin’s NRA Algorithm: round 4 List 1 List 2 List 3 read one item from every list min-top-2 < best-score of candidates no more new items can get into top-2 but, extra candidates left in queue

Top-k algorithms: example item item item item item item item item item item item item item item item item item item item lists sorted by score item item Candidates min top-2 score: 1.6 maximum score for unseen items: 0.8 Done! Fagin’s NRA Algorithm: round 5 List 1 List 2 List 3 read one item from every list no extra candidate in queue

Top-k algorithms NRA performs only sorted accesses (SA) (No Random Access) Random access (RA) –lookup actual (final) score of an item –costlier than SA (100 – 100,000 times), c R /c S := (cost of RA)/(cost of SA) –often very useful CA (Combined Algorithm), (Fagin et al., 2001) –one RA after every c R /c S SAs –total cost of SA ~ total cost of RA Measure of effectiveness (access cost): #SA + c R /c S x #RA Full-merge: compute scores for all items followed by partial sort –simple and efficient –important baseline for any top-k algorithm Problems with NRA, CA –high bookkeeping overhead: cannot beat full-merge in runtime –for “high” values of k, gain in even access cost not significant

Top-k algorithms Greedy heuristics for sorted access scheduling, based on crude estimate of scores (Guntzer, Balke, Kiessling, ITCC 2001) RankSQL: ordering of binary rank joins at query planning time (Ilyas et al., SIGMOD ’04 and Li et al., SIGMOD ’05) Scheduling RAs on “expensive predicates”, where SAs may not even be possible on all attributes (our setting is different) –MPro (Chang and Hwang, SIGMOD 2002) –Upper, Pick (Bruno, Gravano and Marian, ICDE ’02, ACM TODS ’04) Probabilistic pruning of candidates, RA scheduling (Theobald, Schenkel and Weikum, VLDB ’04, VLDB ’05) Main related previous works: NRA, CA

Our algorithm: IO-Top-k lists sorted by score item item item item item item item item item item item item item item item item item item item Round 1: same as NRA item 83 [0.9, 2.1] item 17 [0.6, 2.1] item 25 [0.6, 2.1] Candidates min top-2 score: 0.6 maximum score for unseen items: 2.1 List 1 List 2 List 3 min-top-2 < best-score of candidates not necessarily round robin

Our algorithm: IO-Top-k lists sorted by score item item item item item item item item item item item item item item item item item item item Round 2 item 17 [1.3, 1.8] item 83 [0.9, 2.0] item 25 [0.6, 1.9] item 78 [0.5, 1.4] Candidates min top-2 score: 0.9 maximum score for unseen items: 1.4 List 1 List 2 List 3 min-top-2 < best-score of candidates not necessarily round robin

Our algorithm: IO-Top-k lists sorted by score item item item item item item item item item item item item item item item item item item item Round 3 item item 83 [1.3, 1.9] item 25 [0.6, 1.4] Candidates min top-2 score: 1.3 maximum score for unseen items: 1.1 List 1 List 2 List 3 min-top-2 < best-score of candidates not necessarily round robin potential candidate for top-2

Our algorithm: IO-Top-k lists sorted by score item item item item item item item item item item item item item item item item item item item Round 4: random access for item 83 item item Candidates min top-2 score: 1.6 maximum score for unseen items: 1.1 Done! fewer sorted accesses carefully scheduled random access List 1 List 2 List 3 random access for item 83 no extra candidate in queue not necessarily round robin

Outline Our contributions –Inverted block-index data structure –Sorted access scheduling –Random access scheduling –Lower bound Experiments Conclusion

Inverted block-index Lists are first sorted by score

Inverted block-index Lists are first sorted by score sort each block by item-id Top-k algorithm with block-index full-merge blocks are sorted by item ids, efficiently merged by full-merge! and so on… full merge pruned split into blocks Choose block size balancing disk seek time and data transfer rate Low overhead: prune once every round

Sorted access scheduling List 1 List 2 List 3 Inverted Block-Index General Paradigm

Sorted access scheduling List 1 List 2 List 3 b 11 b 21 b 31 b 12 b 22 b 32 b 13 b 23 b 33 b 14 b 24 b 34 General Paradigm We assign benefits to every block of each list Optimization problem –Goal: choose a total of 3 blocks from any of the lists such that the total benefit is maximized –We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it –But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities –We choose the schedule with maximum benefit, and continue to next round Inverted Block-Index

Sorted access scheduling List 1 List 2 List 3 b 11 b 21 b 31 b 12 b 22 b 32 b 13 b 23 b 33 b 14 b 24 b 34 General Paradigm We assign benefits to every block of each list Optimization problem –Goal: choose a total of 3 blocks from any of the lists such that the total benefit is maximized –We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it –But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities –We choose the schedule with maximum benefit, and continue to next round Inverted Block-Index

Sorted access scheduling List 1 List 2 List 3 b 11 b 21 b 31 b 12 b 22 b 32 b 13 b 23 b 33 b 14 b 24 b 34 General Paradigm We assign benefits to every block of each list Optimization problem –Goal: choose a total of 3 blocks from any of the lists such that the total benefit is maximized –We can show: this problem is NP-Hard, the well known Knapsack problem reduces to it –But, the number of blocks to choose and number of lists to choose from are small: we can solve it by enumerating all possibilities –We choose the schedule with maximum benefit, and continue to next round Inverted Block-Index scans to different depths in lists

Sorted access scheduling List 1 List 2 List 3 Knapsack for Score Reduction (KSR) Pre-compute score reduction  ij of every block of each list : (max-score of the block – min-score of the block) Inverted Block-Index List 1 List 2 List 3    

Sorted access scheduling Knapsack for Score Reduction (KSR) Pre-compute score reduction  ij of every block of each list : (max-score of the block – min-score of the block) Candidate item d is already seen in list 3. If we scan list 3 further, score s d and best-score b d of d do not change: no benefit In list 2, d is not yet seen. If we scan one block (block 22 ) from list 2 –with high probability d will not be not found in that block: best-score b d of d decreases by  22 Benefit of block B in list i  d  B (1 - Pr[d found in B]) ~  d  B sum taken over all candidates d not yet seen in list i Inverted Block-Index List 1 List 2 List 3     item d [s d,b d ] scanned till some depth

Random access scheduling List 1 List 2 List 3 Redundant random accesses of CA CA: one RA after every c R /c S SAs Many RAs turn out to be redundant Our strategy: two-phase algorithm First sorted access rounds only, then switch to random access: no redundant random access Switch from SA to RA, when –max-score for unseen ≤ min-top-k score –estimated RA-cost ≤ total SA-cost so far –cost of SA ~ cost of RA List 1 List 2 List 3 CA: RA for item d But d is found anyway in subsequent SA round need to estimate cost of RA

Random access scheduling current min-top-3 score candidate items sorted by best score: CA style random access List 1 List 2 List 3 lists scanned till some depths by sorted access Estimating number of random accesses best-scores current scores Each random access can prune some candidates, so better estimate of #RAs necessary A crude upper estimate: #of items in queue pruned

Random access scheduling current min-top-3 score candidate items sorted by best score: CA style Estimating number of random accesses item d [s d,b d ] bdbd If there are at least three items before d with final score > b d, d will be pruned before random access random accesses d is pruned

Random access scheduling current min-top-3 score candidate items sorted by best score: CA style Estimating number of random accesses item d [s d,b d ] bdbd If there are less than three items before d with final score > b d, a random access for d must be made random accesses next: RA for d

Random access scheduling current min-top-k score candidate items sorted by best score Estimating number of random accesses item d [s d,b d ] bdbd Let d be the j-th item d j by best-score ordering For all i < j, define random variables F i,j F i,j = 1 if final-score(d i ) > the best-score(d), 0 otherwise We compute Pr[F i,j = 1] using histogram of the score distributions of the lists Observation: Pr[RA is made for d] = Pr[F 1,j +  + F j-1,j < k] Expected #of random accesses  j Pr[F 1,j +  + F j-1,j < k] the sum is taken over all candidate items For General k: There will be random access for d if and only if #of items before d with final score > b d is less than k j-1 items

Experiments: estimate of RA queue size EST DONE TREC Terabyte data, TREC 2005 adhoc task queries After all sorted accesses #items in queue, #RA estimated and #RA actually done Total RA for 50 queries

Lower bound: what is the best possible? List 1 List 2 List 3 Try every possible SA-schedule Count essential number of RAs that must be done

Lower bound: what is the best possible? List 1 List 2 List 3 Try every possible SA-schedule Count essential number of RAs that must be done #SA C R /C S x #RA = Total cost Schedule 16 x x 75 = 135,000 block size 10,000

Lower bound: what is the best possible? List 1 List 2 List 3 Try every possible SA-schedule Count essential number of RAs that must be done #SA C R /C S x #RA = Total cost Schedule 16 x x 75 = 135,000 Schedule 29 x x 12 = 102,000 block size 10,000

Lower bound: what is the best possible? List 1 List 2 List 3 Try every possible SA-schedule Count essential number of RAs that must be done #SA C R /C S x #RA = Total cost Schedule 16 x x 75 = 135,000 Schedule 29 x x 12 = 102,000 Schedule 312 x x 3 = 123,000 …………………… …………………… Lower bound……102,000 carefully engineered dynamic programming to try out all schedules block size 10,000

Experiments: TREC ,000,000 k average cost (#SA x #RA) full merge NRA CA IO-Top-k (OUR) lower bound k average running time (milliseconds) full merge NRA IO-Top-k (OUR) 100 TREC Terabyte benchmark collection over 25 million documents, 426 GB raw data 50 queries from TREC 2005 adhoc task CA

Experiments: HTTP logs FIFA World Cup HTTP logs World cup billion HTTP requests schema Log( interval, user-id, bytes ) aggregated for each user within one-day intervals typical query: find k users with most usage during June 1-10 full merge NRA CA IO-Top-k (OUR) lower bound

Experiments: IMDB IMDB movie data more than 375,000 movies, 1,200,000 persons attributes: Title, Genre, Actors, Description 20 human generated queries

Conclusion We presented An inverted block-index data structure –efficient: optimizes disk access –performs fast merge in blocks, minimizes overhead Integrated sorted access and random access scheduling –SA scheduling: maximizes benefit of scanning blocks –RA scheduling: effectively estimate RA-cost at every round –postpone RA till the end of all SA: save redundant RAs Lower Bound –shows that our algorithm is close to the best possible

Thank you!

Appendix

Sorted access scheduling List 1 List 2 List 3 Knapsack for Benefit Aggregation (KBA) Pre-compute expected score e ij of an item seen in block j of list i : (average score of the block) Pre-compute score reduction  ij of every block of each list : (max-score of the block – min-score of the block) Inverted Block-Index List 1 List 2 List 3 e 11 e 21 e 31 e 12 e 22 e 32 e 13 e 23 e 33 e 14 e 24 e 34    

Sorted access scheduling Knapsack for Benefit Aggregation (KBA) Pre-compute expected score e ij of an item seen in block j of list i : (average score of the block) Pre-compute score reduction  ij of every block of each list : (max-score of the block – min-score of the block) Candidate item d is already seen in list 3. If we scan list 3 further, score s d and best-score b d of d do not change In list 2, d is not yet seen. If we scan one block from list 2 –either d is found in that block: score s d of d increases, expected increase = e 22 –or d is not found in that block: best-score b d of d decreases by  22 Benefit of block B in list i  d e B Pr[d found in B] +  B (1 - Pr[d found in B]) The sum is taken over all candidates d not yet seen in list i Inverted Block-Index List 1 List 2 List 3 e 11 e 21 e 31 e 12 e 22 e 32 e 13 e 23 e 33 e 14 e 24 e 34     item d [s d,b d ]

Random access scheduling: details current min-top-k score candidate items sorted by best score Estimating number of random accesses item d [s d,b d ] bdbd Let d be the j-th items by best-score ordering For all i < j, Define random variables F i,j which takes value 1 if final score of the i-th item is greater than the best-score of d, 0 otherwise Compute Pr[F i,j = 1] using the expected score gain of the i-th item from lists where it is not yet seen Also define a random variable R j which takes value 1 if a random access is made for d, 0 otherwise Observation: Pr[R j = 1] = Pr[F 1,j +  + F j-1,j < k] Let X j := F 1,j +  + F j-1,j Assume F i,j s are independent, then X j follows Poisson distribution with mean  i Pr[F i,j = 1] We can compute Pr[X j < k] using the incomplete gamma function Expected number of random accesses is  j E(R j ) =  j Pr[R j = 1] =  j Pr[X j < k] the sum is taken over all candidate items There will be random access for d if and only if #of items before d with final score > b d is less than k j-1 items

Other Experiments For different values of cost of RA compared to cost of SA C R/S ratio: 100, 1000 and varying query size title fields: average size 3 description fields: average size 8 TREC Terabyte collection indexed with BM25 scores query size: 3query size: 8 20,000,000 0 cost (cost of RA)/(cost of SA) cost 0 3,000,000

End of appendix