Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed.

Slides:



Advertisements
Similar presentations
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Advertisements

Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
Evaluating scalability Peer-to-Peer File Sharing Networks of Sayantan Mitra Vibhor Goyal.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Best-Effort Top-k Query Processing Under Budgetary Constraints
On the Usage of Global Document Occurrences (GDO) in P2P Information Systems or… Avoiding overlapping results in P2P searching Odysseas Papapetrou 1,2.
Top-k Query Evaluation with Probabilistic Guarantees By Martin Theobald, Gerald Weikum, Ralf Schenkel.
PCP2P: Probabilistic Clustering for P2P networks 32nd European Conference on Information Retrieval 28 th -31 st March 2010, Milton Keynes, UK Odysseas.
Indexing the imprecise positions of moving objects Xiaofeng Ding and Yansheng Lu Department of Computer Science Huazhong University of Science & Technology.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Expediting Searching Processes via Long Paths in P2P Systems 05/30 IDEA Lab.
Tunable Compression of Word-level Index for Versioned Corpora Klaus Berberich, Srikanta Bedathur, Gerhard Weikum Max-Planck Institute for Informatics Saarbruecken,
Bloom Filters Kira Radinsky Slides based on material from:
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Turning Privacy Leaks into Floods: Surreptitious Discovery of Social Network Friendships Michael T. Goodrich Univ. of California, Irvine joint w/ Arthur.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Rendezvous Points-Based Scalable Content Discovery with Load Balancing Jun Gao Peter Steenkiste Computer Science Department Carnegie Mellon University.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik.
Cumulative Violation For any window size  t  Communication-Efficient Tracking for Distributed Cumulative Triggers Ling Huang* Minos Garofalakis.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
VLDB ´04 Top-k Query Evaluation with Probabilistic Guarantees Martin Theobald Gerhard Weikum Ralf Schenkel Max-Planck Institute for Computer Science SaarbrückenGermany.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
TinyLFU: A Highly Efficient Cache Admission Policy
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Efficient Peer to Peer Keyword Searching Nathan Gray.
Peacock Hash: Deterministic and Updatable Hashing for High Performance Networking Sailesh Kumar Jonathan Turner Patrick Crowley.
Fateme Shirazi Spring Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
Histograms for Selectivity Estimation
Efficient Processing of Top-k Spatial Preference Queries
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
IO-Top-k: Index-access Optimized Top-k Query Processing Debapriyo Majumdar Max-Planck-Institut für Informatik Saarbrücken, Germany Joint work with Holger.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Network Coordinates : Internet Distance Estimation Jieming ZHU
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Click to edit Master title style Multi-Destination Routing and the Design of Peer-to-Peer Overlays Authors John Buford Panasonic Princeton Lab, USA. Alan.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
1 VLDB, Background What is important for the user.
Efficient and Self-tuning Incremental Query Expansions for Top-k Query Processing Martin Theobald Ralf Schenkel Gerhard Weikum Max-Planck Institute for.
Efficient Top-k Querying over Social-Tagging Networks Ralf Schenkel, Tom Crecelius, Mouna Kacimi, Sebastian Michel, Thomas Neumann, Josiane Xavier Parreira,
Neighborhood - based Tag Prediction
Efficient Multi-User Indexing for Secure Keyword Search
Laks V.S. Lakshmanan Depf. of CS UBC
Mapping Internet Sensors With Probe Response Attacks
By: Ran Ben Basat, Technion, Israel
Lecture 1: Bloom Filters
Presentation transcript:

Max-Planck-Institut University of Patras NetCInS Lab Informatik KLEE: A Framework for Distributed Top-k Query Algorithms KLEE: A Framework for Distributed Top-k Query Algorithms Sebastian Michel Max-Planck Institute for Informatics Saarbrücken, Germany Peter Triantafillou RACTI / Univ. of Patras Rio, Greece Gerhard Weikum Max-Planck Institute for Informatics Saarbrücken, Germany

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim2 Overview Problem Statement Related Work KLEE The Histogram Bloom Structure Candidate Filtering Evaluation Conclusion / Future Work

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim3 Computational Model Distributed aggregation queries: Query with m terms with index lists spread across m peers P1... Pm Applications: Internet traffic monitoring Sensor networks P2P Web search

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim4 Problem Statement Consider –network consumption –per peer load –latency (query response time) network I/O processing P0 P1 P2 P3 Query initiator P0 serves as per-query coordinator … t1t1 d d1 0.7 d d d … d d t2t2 d d d … d d t3t3 d d d64 0.4

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim5 Existing Methods: Distributed NRA/TA: Extend NRA/TA (Fagin et al. `99/‘03, Güntzer et al. `01, Nepal et al. `99) with batched access TPUT (Cao/Wang 2004): 1)fetch k best entries (d, s j ) from each of P1... Pm and aggregate (  j=1..m s j (d)) at P0 2)ask each of P1... Pm for all entries with s j > min-k / m and aggregate results at P0 3)fetch missing scores for all candidates by random lookups at P1... Pm + DNRA aims to minimize per-peer work - DTA/DNRA incur many messages + TPUT guarantees fixed number of message rounds - TPUT incurs high per-peer load and net BW Related Work

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim6 TPUT... Index List Cohort Peer Pi Coordinator Peer P0 current top-k - candidate set... score Index List Cohort Peer Pj score top k k candidates min-k / m Retrieve missing scores

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim7 KLEE: Key Ideas if mink / m is small TPUT retrieves a lot of data in Phase 2  high network traffic random accesses  high per-peer load KLEE:  Different philosophy: approximate answers!  Efficiency:  Reduces (docId, score)-pair transfers  no random accesses at each peer  Two pillars:  The HistogramBlooms structure  The Candidate List Filter structure

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim8 The KLEE Algorithms KLEE 3 or 4 steps: 1.Exploration Step: … to get a better approximation of min-k score threshold 2.Optimization Step: –decide: 3 or 4 steps ? 3.Candidate Filtering: … a docID is a good candidate if high-scored in many peers. 4.Candidate Retrieval: get all good docID candidates.

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim9 Histogram Bloom Structure Each peer pre-computes for each index list: an equi-width histogram + Bloom filter for each cell + average score per cell + upper/lower score “increase” the mink / m threshold

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim10 Bloom Filter bit array of size m k hash functions h i : docId_space  {1,..,m} insert n docs by hashing the ids and settings the corresponding bits Membership Queries: –document is in the Bloom Filter if the corresponding bits are set probability of false positives (pfp) tradeoff accuracy vs. efficiency

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim11 Exploration and Candidate Retrieval... Index List Cohort Peer Pi Coordinator Peer P0 current top-k - candidate set... score Index List Cohort Peer Pj Histogram b bits c cells b bits c cells score top k k candidates min-k / m

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim12 Candidate List Filter Matrix Goal: filter out unpromising candidate documents in step 2 estimate the max number of docs that are above the mink / m threshold send this number and the threshold to the cohort peers score number of documents min-k / m threshold

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim13 Candidate List Filter Matrix (2) Each cohort returns a Bloom Filter that “contains” all docs above the mink / m threshold  Candidate List Filter Matrix (CLFM) Select all columns with at least R bits set

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim14 KLEE– Candidate Set Reduction... score Index List Cohort Peer Pi top k Coordinator Peer P0 min-k / m current top-k candidate set xxx candidates min-k / m candidate filter matrix Cohort Peer Pj

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim15 KLEE – Candidate Retrieval... score Index List Cohort Peer Pi top k Coordinator Peer P0 min-k / m current top-k candidate set xxx candidates early stopping point candidate filter matrix Cohort Peer Pj

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim16 Enhanced Filtering BF represenation can be improved … (d1, 0.9) (d2, 0.6) (d5, 0.5) (d3, 0.3) (d4, 0.25) (d17, 0.08) (d9, 0.07) d1, d2, and d5 are promising documents but e.g. s1-s3 = 0.4 ! Send byte-array with cell-numbers instead of bits Select „columns“ with Sum over upper-bounds > min-k (d1, 0.9) (d2, 0.6) (d5, 0.5) (d3, 0.3) (d4, 0.25) (d17, 0.08) (d9, 0.07) (d1, 0.9) (d2, 0.6) (d5, 0.5) (d3, 0.3) (d4, 0.25) (d17, 0.08) (d9, 0.07) ,23,14,85,34,23,11,24,54,33,60,34 43,13,15,25,54,21,71,44,64,71,72,31 31,43,21,75,21,71,64,34,74,63,50,62

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim17 Architecture/Testbed open() get(k) getWithBF(..) next() Index Lists Oracle DB close()open() SQL KLEE Algorithmic Framework Extended IndexLists with BloomFilters, Histograms, and Batched Access getAbove(score) B+ Index …

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim18 Evaluation: Benchmarks GOV : TREC.GOV collection + 50 TREC-2003 Web queries, e.g. juvenile delinquency XGOV : TREC.GOV collection + 50 manually expanded queries, e.g. juvenile delinquency youth minor crime law jurisdiction offense prevention IMDB : Movie Database, queries like –actor = John Wayne; genre =western Synthetic Distribution (Zipf, different skewness): GOV collection but with synthetic scores Synthetic Distribution + Synthetic Correlation : 10 index lists

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim19 Evaluation: Metrics Relative recall w.r.t. to the actual results Score error Bandwidth consumption Rank distance Number of RA and number of SA Query response time - network cost (150ms RTT, 800Kb/s data transfer rate) - local I/O cost (8ms rotation latency + 8MB/s transfer delay) - processing cost

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim20 Evaluated Algorithms DTA : –batched distributed threshold algorithm, batch size k. TPUT X-TPUT : –approximate TPUT. No random accesses. KLEE-3 KLEE-4 C = 10% of the score mass

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim21 Synthetic Score Benchmarks  = 0.7

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim22 Synthetic Correlation Benchmark  randomly insert top k documents from list i in the top  documents of list j  = 30%

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim23 GOV / XGOV

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim24 Conclusion / Future Work Conclusion –KLEE: approximate top-k algorithms for wide-area networks –significant performance benefits can be enjoyed, at only small penalties in result quality –flexible framework for top-k algorithms, allowing for trading-off efficiency versus result quality and bandwidth savings versus the number of communication phases. –various fine-tuning parameters Future Work –Reasoning about parameter values –Consider “moving” coordinator

KLEE: A Framework for Distributed Top-k Query Algorithms VLDB 2005, Trondheim25 Thanks for your attention! Questions? Comments?