1 An Overview of Distributed Top-K Ranking Algorithms 30-min presentation by Demetris Zeinalipour Lecturer School of Pure and Applied Sciences Open University.

Slides:



Advertisements
Similar presentations
Web Information Retrieval
Advertisements

1 Top-K Algorithms: Concepts and Applications by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of Cyprus Department.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Case Study: BibFinder BibFinder: A popular CS bibliographic mediator –Integrating 8 online sources: DBLP, ACM DL, ACM Guide, IEEE Xplore, ScienceDirect,
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Computer Science Department Stony Brook University.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
SIGMOD'061 Energy-Efficient Monitoring of Extreme Values in Sensor Networks Adam Silberstein Kamesh Munagala Jun Yang Duke University.
Top-k Monitoring in Wireless Sensor Networks Minji Wu, Jianliang Xu, Xueyan Tang, and Wang-Chien Lee IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
1 Distributed Top-K Ranking Algorithms Demetris Zeinalipour Lecturer School of Pure and Applied Sciences Open University of Cyprus Monday, December 15.
1 Ranking Query Results in a Networked World Demetris Zeinalipour Lecturer Department of Computer Science University of Cyprus Thursday, July 23rd, 2010.
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
On the Construction of Data Aggregation Tree with Minimum Energy Cost in Wireless Sensor Networks: NP-Completeness and Approximation Algorithms National.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Workload-aware Optimization of Query Routing Trees in Wireless Sensor Networks Panayiotis Andreou (Univ. of Cyprus) Demetris Zeinalipour-Yazti (Open Univ.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
MINT Views: Materialized In-Network Top-k Views in Sensor Networks Demetrios Zeinalipour-Yazti (Uni. of Cyprus) Panayiotis Andreou (Uni. of Cyprus) Panos.
1 Disclaimer Feel free to use any of the following slides for educational purposes, however kindly acknowledge the source. We would also like to know how.
Distributed Spatio-Temporal Similarity Search by Demetris Zeinalipour University of Cyprus & Open University of Cyprus Tuesday, July 4 th, 2007, 15:00-16:00,
Universität Stuttgart Institute of Parallel and Distributed Systems (IPVS) Universitätsstraße 38 D Stuttgart Scalable Processing of Trajectory-Based.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
1 Top-K Query Processing Techniques for Distributed Environments by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of.
Energy-Efficient Monitoring of Extreme Values in Sensor Networks Loo, Kin Kong 10 May, 2007.
Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
Efficient Processing of Top-k Spatial Preference Queries
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
1 Ranking Query Results in a Networked World Demetris Zeinalipour Lecturer Department of Computer Science University of Cyprus Thursday, May 27th, 2010.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
1 The Threshold Join Algorithm for Top-k Queries in Distributed Sensor Networks D. Zeinalipour-Yazti, Z. Vagena, D. Gunopulos, V. Kalogeraki, V. Tsotras.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 VLDB, Background What is important for the user.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Cohesive Subgraph Computation over Large Graphs
Efficient Multi-User Indexing for Secure Keyword Search
Introduction to Load Balancing:
Demetrios Zeinalipour-Yazti (Univ. of Cyprus)
Evaluation of Relational Operations
Preference Query Evaluation Over Expensive Attributes
Spatio-temporal Pattern Queries
Rank Aggregation.
Efficient Evaluation of k-NN Queries Using Spatial Mashups
Efficient Subgraph Similarity All-Matching
D. ZeinalipourYazti, Z. Vagena, D. Gunopulos, V. Kalogeraki, V
Efficient Processing of Top-k Spatial Preference Queries
Efficient Aggregation over Objects with Extent
Presentation transcript:

1 An Overview of Distributed Top-K Ranking Algorithms 30-min presentation by Demetris Zeinalipour Lecturer School of Pure and Applied Sciences Open University of Cyprus Friday, December 12 th, 2008, 16:00-16:30 Communication Systems Group (CSG), ETH Zurich, Switzerland

Demetris Zeinalipour (Open University of Cyprus) 2 Top-k Queries: Introduction Top-K Queries are a long studied topic in the database and information retrieval communities The main objective has been to return the K highest- ranked answers quickly and efficiently. A Top-K query returns the subset of most relevant answers, in place of ALL answers, for two reasons: – i) to minimize the cost metric that is associated with the retrieval of all answers (e.g., disk, network, etc.) – ii) to maximize the quality of the answer set, such that the user is not overwhelmed with irrelevant results

Demetris Zeinalipour (Open University of Cyprus) 3 Top-k Queries: Then Assumptions The data is available locally on disks or over a “high- speed”, “always-on” network Trade-off Clients want to get the right answers quickly Service Providers want to consume the least possible resources SELECT TOP-2 pictures FROM PICTURES WHERE SIMILAR(picture, ) { } Query Processing

Demetris Zeinalipour (Open University of Cyprus) 4 Top-k Queries: Now A few motivating queries: Snapshot Query: Find the K nodes with the highest temperature values Continuous Query: For the next one hour continuously report the K rooms with the highest average temperature Historic Query (nodes store all data locally): Find the K nodes with the highest average temperature during the last 6 months Base Station In-Network Top-k Query Processing

Demetris Zeinalipour (Open University of Cyprus) 5 Top-k Queries: Now Assume a cluster of n=5 Web-servers Each server maintains locally a replica of the same m=5 static Web-pages When a web page is accessed by a client, the respective server increases a local hit counter by one TOP-1 Query: “Find the webpage with the highest number of hits across all servers” clien t Hits++ 5 { (N) Web-servers Hit s PageI D (M) Timestamps Scoring Table

Demetris Zeinalipour (Open University of Cyprus) 6 Presentation Outline A.Introduction B.Centralized Top-K Query Processing The Threshold Algorithm (TA) C.Distributed Top-K Query Processing The Threshold Join Algorithm (TJA) Experimentation using 75 workstations A. Other Applications of Top-K Queries Distributed Spatio-temporal Trajectory Retrieval In-Network Top-K Views (MINT Views)

Demetris Zeinalipour (Open University of Cyprus) 7 Centralized Top-K Query Processing Fagin’s* Threshold Algorithm (TA): (In ACM PODS’02) * Concurrently developed by 3 groups The most widely recognized algorithm for Top-K Query Processing in database systems ΤΑ Algorithm 1) Access the n lists in parallel. 2) While some object o i is seen, perform a random access to the other lists to find the complete score for o i. 3) Do the same for all objects in the current row. 4) Now compute the threshold τ as the sum of scores in the current row. 5)The algorithm stops after K objects have been found with a score above τ.

Demetris Zeinalipour (Open University of Cyprus) 8 Centralized Top-K: The TA Algorithm (Example) Have we found K=1 objects with a score above τ? =>ΝΟ Have we found K=1 objects with a score above τ? =>YES! Iteration 1 Threshold τ = => τ = 423 Iteration 2 Threshold τ (2nd row)= => τ = 354 O3, 405 O1, 363 O4, 207 Why is the threshold correct? It gives us the maximum score for the objects we have not seen yet (<= τ)

Demetris Zeinalipour (Open University of Cyprus) 9 Presentation Outline A.Top-K Algorithms: Definitions B.Centralized Top-K Query Processing The Threshold Algorithm (TA) C.Distributed Top-K Query Processing The Threshold Join Algorithm (TJA) Experimentation using 75 workstations A. Other Applications of Top-K Queries Distributed Spatio-temporal Trajectory Retrieval In-Network Top-K Views (MINT Views

Demetris Zeinalipour (Open University of Cyprus) 10 The Centralized Join Algorithm (CJA) Naive solution: – Perform the computation in one phase: each node sends its complete list of scores – Each intermediate node forwards all received lists Problem: To overcome the arbitrary phases of the Threshold Algorithm? Disadvantage – Overwhelming amount of messages. – Huge Query Response Time

Demetris Zeinalipour (Open University of Cyprus) 11 The Staged Join Algorithm (SJA) Improved Solution: Aggregate the lists before these are forwarded to the parent: This is the In-network aggregation approach Advantage: Only O(n) messages Disadvantage: The size of each message is still very large in size (i.e., the complete list)

Demetris Zeinalipour (Open University of Cyprus) 12 Threshold Join Algorithm (TJA) TJA is our 3-phase algorithm that optimizes top-k query execution in distributed (hierarchical) environments. Advantage: – It usually completes in 2 phases. – It never completes in more than 3 phases ( LB Phase, HJ Phase and CL Phase) – It is therefore highly appropriate for distributed environments “The Threshold Join Algorithm for Top-k Queries in Distributed Sensor Networks", D. Zeinalipour-Yazti et. al, In VLDB’s DMSN’05. “Finding the K Highest-Ranked Answers in a Distributed Network”, D. Zeinalipour- Yazti et. al, Computer Networks, Elsevier, 2008.

Demetris Zeinalipour (Open University of Cyprus) 13 Step 1 - LB (Lower Bound) Phase Recursively send the K highest objectIDs of each node to the sink. Each intermediate node performs a union of the received results (defined as τ) Query: TOP-1 Τ=Τ=

Demetris Zeinalipour (Open University of Cyprus) 14 Step 2 – HJ (Hierarchical Join) Phase Disseminate τ to all nodes Each node sends back all objects with score above the objectIDs in τ Before sending the objects, each node tags as incomplete, scores that could not be computed exactly } Complete Incomplete

Demetris Zeinalipour (Open University of Cyprus) 15 Step 3 – CL (Cleanup) Phase Have we found K objects with a complete score that is above all incomplete scores? – Yes: The answer has been found! – No: Find the complete score for each incomplete object (all in a single batch phase) CL ensures correctness This phase is rarely required in practice!

Demetris Zeinalipour (Open University of Cyprus) 16 Experimental Evaluation We have implemented a P2P middleware in JAVA (sockets + binary transfer protocol). We tested our implementation with a network of 1000 real nodes using 75 Linux workstations. We use a trace driven experimentation methodology with data from an Environmental Monitoring Facility in Washington / Oregon Summary of Findings Bytes: CJA = 10xTJA; SJA = 3xTJA Time: TJA:3.7s [LB:1.0s,HJ:2.7s,CL:0.08s]; SJA: 8.2s; CJA:18.6s Messages:TJA:259, SJA:183, CJA:246

Demetris Zeinalipour (Open University of Cyprus) 17 Presentation Outline A.Top-K Algorithms: Definitions B.Centralized Top-K Query Processing The Threshold Algorithm (TA) C.Distributed Top-K Query Processing The Threshold Join Algorithm (TJA) Experimentation using 75 workstations A. Other Applications of Top-K Queries Distributed Spatio-temporal Trajectory Retrieval (UB-K and UBLB-K Algorithms) In-Network Top-K Views (MINT Views)

Demetris Zeinalipour (Open University of Cyprus) 18 Application 2: SpatioTemporal Similarity Search Similarity Search: Given a query Q, find the degree of similarity (Euclidean distance, DTW, LCSS) between Q and a set of m target trajectories {A1,A2,…,Am}. Each Αi (i<=m) is segmented into a number of non- overlapping cells {C1,C2,…,Cn} that maintain the local subsequences. Challenge: How can we find the K most similar trajectories to Q without pulling together all subsequences Q "Distributed Spatio-Temporal Similarity Search”, D. Zeinalipour-Yazti, S. Lin, D. Gunopulos, ACM 15th Conference on Information and Knowledge Management, (ACM CIKM 2006), November 6-11, Arlington, VA, USA, pp.14-23, August 2006.

Demetris Zeinalipour (Open University of Cyprus) 19 Application 2: Spatiotemporal Query Processing Solution Outline Each cell computes a lower bound and an upper bound on the matching of Q to its local subsequences. The distributed scoring table now contains score bounds (lower,upper) rather than exact scores. Q We have proposed two iterative algorithms: UB-K and UBLB-K, which combine these score bounds. UB-K and UBLB-K find the K most similar trajectories to Q without pulling together the distributed subsequences.

Demetris Zeinalipour (Open University of Cyprus) 20 Application 3: ΜΙΝT ΜΙΝΤ : a framework for optimizing the execution of continuous monitoring queries in sensor networks. "MINT Views: Materialized In-Network Top-k Views in Sensor Networks" D. Zeinalipour-Yazti, P. Andreou, P. Chrysanthis and G. Samaras, In IEEE 8th International Conference on Mobile Data Management, Mannheim, Germany, May 7 – 11, 2007 Query: Find the K=1 rooms with the highest average temperature

Demetris Zeinalipour (Open University of Cyprus) 21 ΜΙΝΤ Views: Problem Objective: To prune away tuples locally at each sensor such that messaging is minimized. Naïve Solution: Each node eliminates any tuple with a score lower than its top-1 result. D,76.5 C,75 B,41 (B,40) Problem: We received a incorrect answer i.e., (D,76.5) instead of (C,75).

Demetris Zeinalipour (Open University of Cyprus) 22 ΜΙΝΤ Views: Main Idea Bound above each tuple with its maximum possible value. K-covered Bound-set : Includes all the objects which have an upper bound (v ub ) greater or equal to the kth highest lower bound (τ), i.e., v ub > τ v ub v lb τ sum

Demetris Zeinalipour (Open University of Cyprus) 23 ΜΙΝΤ Views: Main Idea Bound above each tuple with its maximum possible value. K-covered Bound-set : Includes all the objects which have an upper bound (v ub ) greater or equal to the kth highest lower bound (τ), i.e., v ub > τ v ub v lb τ sum

24 An Overview of Distributed Top-K Ranking Algorithms Thank you! Demetris Zeinalipour This presentation is available at: Related Publications available at:

Backup Slides Main Findings: Dataset: Environmental Measurements from atmospheric monitoring stations in Washington & Oregon. ( ) Query: Find the K timestamps on which the average temperature across all stations was maximum. Network: Random Graph (degree=4, diameter 10) Evaluation Criterions: i) Bytes, ii) Time, iii) Messages

Demetris Zeinalipour (Open University of Cyprus) 26 Experimental Results TJA requires one order of magnitude less bytes than CJAs!

Demetris Zeinalipour (Open University of Cyprus) 27 Experimental Results TJA: 3.7sec [ LB:1.0sec, HJ:2.7sec, CL:0.08sec ] SJA: 8.2secCJA:18.6sec

Demetris Zeinalipour (Open University of Cyprus) 28 Experimental Results Although TJA consumes more messages than SJA these are small-size messages

Demetris Zeinalipour (Open University of Cyprus) 29 The TPUT Algorithm Phase 1 : o1 = = 183, o3 = = 240 τ = (Kth highest score (partial) / n) => 240 / 5 => τ = 48 Phase 2 : Have we computed K exact scores ? Computed Exactly: [o3, o1]Incompletely Computed: [o4,o2,o0] Drawback: The threshold is uniform (too coarse) Q: TOP- 1 o1=183, o3=240 o3=405 o1=363 o2’=158 o4’=137 o0’=124

Demetris Zeinalipour (Open University of Cyprus) 30 TJA vs. TPUT

Demetris Zeinalipour (Open University of Cyprus) 31 ΜΙΝΤ Views: Experimentation We obtained a real trace of atmospheric data collected by UC- Berkeley on the Great Duck Island (Maine) in We then performed a trace-driven experimentation using XBows TELOSB sensor. Our query was as follows: –SELECT TOP-K area, Avg(temp) –FROM sensors –GROUP BY area 0% 39% 77% 34% 12%