1 Top-K Query Processing Techniques for Distributed Environments by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of.

Slides:

Advertisements

Similar presentations

IEEE SECON-2005, Tue, September 27, 2005 (C) Anirban Banerjee and Abhishek Mitra RISE & Co-S: A high performance Co-proceSsing Sensor architecture for.

Advertisements

1 Top-K Algorithms: Concepts and Applications by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of Cyprus Department.

Computer Science and Engineering Inverted Linear Quadtree: Efﬁcient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Evaluating Search Engine

Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/2010 NSF Workshop on Sustainable Energy Efficient Data Management (SEEDM), Arlington,

Dept. of Computer Science & Engineering, CUHK1 Trust- and Clustering-Based Authentication Services in Mobile Ad Hoc Networks Edith Ngai and Michael R.

ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Computer Science Department Stony Brook University.

1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.

Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams Amit Manjhi, Suman Nath, Phillip B. Gibbons Carnegie Mellon University.

ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Department of Computer Science Stony Brook University.

Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.

An Authentication Service Against Dishonest Users in Mobile Ad Hoc Networks Edith Ngai, Michael R. Lyu, and Roland T. Chin IEEE Aerospace Conference, Big.

1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.

Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.

Reaching the Top-k of the Skyline: A efficient Indexed Algorithm for Top-k Skyline Queries Marlene Goncalves and María-Esther Vidal Universidad Simón Bolívar,

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.

Top-k Monitoring in Wireless Sensor Networks Minji Wu, Jianliang Xu, Xueyan Tang, and Wang-Chien Lee IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,

1 Distributed Top-K Ranking Algorithms Demetris Zeinalipour Lecturer School of Pure and Applied Sciences Open University of Cyprus Monday, December 15.

1 Ranking Query Results in a Networked World Demetris Zeinalipour Lecturer Department of Computer Science University of Cyprus Thursday, July 23rd, 2010.

Dagstuhl Seminar 10042, Demetris Zeinalipour, University of Cyprus, 26/1/ th IEEE International Conference on Mobile Data Management (MDM’11), June.

Panayiotis G. Andreou, George Constantinou, Demetrios Zeinalipour-Yazti, George Samaras Department of Computer Science, University of Cyprus Panayiotis.

Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.

On the Construction of Data Aggregation Tree with Minimum Energy Cost in Wireless Sensor Networks: NP-Completeness and Approximation Algorithms National.

Content-Based Search in Internet- Scale Peer-to-Peer Systems by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of Cyprus.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Speaker: 吳晋賢 (Chin-Hsien Wu) Embedded Computing and Applications Lab Department of Electronic Engineering National Taiwan University of Science and Technology,

Department of Computer Science City University of Hong Kong Department of Computer Science City University of Hong Kong 1 A Statistics-Based Sensor Selection.

Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

MINT Views: Materialized In-Network Top-k Views in Sensor Networks Demetrios Zeinalipour-Yazti (Uni. of Cyprus) Panayiotis Andreou (Uni. of Cyprus) Panos.

CH1. Hardware: CPU: Ex: compute server (executes processor-intensive applications for clients), Other servers, such as file servers, do some computation.

Benjamin AraiUniversity of California, Riverside Reliable Hierarchical Data Storage in Sensor Networks Song Lin – Benjamin.

Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

Energy-Efficient Monitoring of Extreme Values in Sensor Networks Loo, Kin Kong 10 May, 2007.

Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin

MicroHash:An Efficient Index Structure for Flash-Based Sensor Devices Demetris Zeinalipour [ ] School of Pure and Applied Sciences.

Efficient Processing of Top-k Spatial Preference Queries

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon.

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

1 Ranking Query Results in a Networked World Demetris Zeinalipour Lecturer Department of Computer Science University of Cyprus Thursday, May 27th, 2010.

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

1 The Threshold Join Algorithm for Top-k Queries in Distributed Sensor Networks D. Zeinalipour-Yazti, Z. Vagena, D. Gunopulos, V. Kalogeraki, V. Tsotras.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.

REED ： Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.

EASE: An Energy-Efficient In-Network Storage Scheme for Object Tracking in Sensor Networks Jianliang Xu Department of Computer Science Hong Kong Baptist.

Top-k Queries in Wireless Sensor Networks Amber Faucett, Dr. Longzhuang Li, In today’s world, wireless.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

1 An Overview of Distributed Top-K Ranking Algorithms 30-min presentation by Demetris Zeinalipour Lecturer School of Pure and Applied Sciences Open University.

Presented by: Saurav Kumar Bengani

Demetrios Zeinalipour-Yazti (Univ. of Cyprus)

湖南大学-信息科学与工程学院-计算机与科学系

D. ZeinalipourYazti, Z. Vagena, D. Gunopulos, V. Kalogeraki, V

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

1 Top-K Query Processing Techniques for Distributed Environments by Demetris Zeinalipour Visiting Lecturer Department of Computer Science University of Cyprus Department of Computer Science - University of Cyprus Wednesday, June 7th, 2006 "Mediteranean Studies" Seminar Room, FORTH, Heraklion, Crete

2 Presentation Goals To provide an overview of Top-K Query Processing algorithms for centralized and distributed settings. To present the Threshold Join Algorithm (TJA) which is our distributed top-k query processing algorithm. To present other research activities that are directly or indirectly related to this work.

3 Data Management & Query Processing Today We are living in a world where data is generated All The Time & Everywhere

4 Characteristics of these Applications “Data is generated in a distributed fashion” e.g. sensor data, file-sharing data, Geographically Distributed Clusters) “Distributed Data is often outdated before it is ever utilized” (e.g. CCTV video traces, Internet ping data, sensor readings, weblogs, RFID Tags,…) “Transferring the Data to a centralized repository is usually more expensive than storing it locally”

5 Motivating Question Why design algorithms and systems that a’ priori organize information in centralized repositories? Our Approach: “In-situ Data Storage & Retrieval” –Data remains in-situ (at the generating site). –When Users want to search/retrieve some information they perform on-demand queries. Challenges: –Minimize the utilization of the communication medium –Exploit the network and the inherent parallelism of a distributed environment. Focus on Hierarchical Networks are ubiquitous (e.g. P2P, and sensor-nets). –Number of Answers might be very large  Focus on Top-K

6 Presentation Outline 1.Introduction to Top-K Query Processing 2.Related Work & Algorithms 3.The Threshold Join Algorithm (TJA) 4.Experimental Evaluation using our Middleware Testbed. 5.Related Activities & Future Work.

7 Distributed Top-K Query Processing TOP-k Query Objectives: 1.To find the k highest ranked answers to a user defined scoring function (e.g. Record1: 0.7 red, Record2: 0.4 red, etc) 2.Minimize some cost metric associated with the retrieval of the complete answer set.

8 Distributed Top-K Query Processing Cost Metric in a Distributed Environment A) Bandwidth –Transmitting less data conserves resources, energy and minimizes failures. –e.g. in a Sensor Network sending 1 byte ≈ 1120 CPU instructions. Source: The RISE (Riverside Sensor) (NetDB’05, IPSN’05 Demo, IEEE SECON’05) B) Query Response Time - The #bytes transmitted is not the only parameter. - We want to minimize the time to execute a query.

9 Distributed Top-K Query Processing Motivating Example Assume that we have a cluster of n=5 webservers. Each server maintains locally the same m=5 webpages. When a web page is accessed by a client, a server increases a local hit counter by one. TOTAL SCORE

10 Distributed Top-K Query Processing Motivating Example (cont’d) TOP-1 Query: “Which Webpage has the highest number of hits across all servers (i.e. highest Score(o i ) )?” Score(o i ) can only be calculated if we combine the hit count from all 5 servers. { m n Local score URL TOTAL SCORE

11 Distributed Top-K Query Processing Other Applications Sensor Networks: Each sensor maintains locally a sliding window of the last m readings (i.e. m (ts, val) pairs). Q: Find when did we have the K=3 highest average temperatures across all sensors. Other Applications: Collaborative Spam Detection Networks, Content Distribution Networks, Information Retrieval, etc

12 Presentation Outline 1.Introduction to Top-K Query Processing 2.Related Work & Algorithms 3.The Threshold Join Algorithm (TJA) 4.Experimental Evaluation using our Middleware Testbed. 5.Related Activities & Future Work.

13 Naïve Solution: Centralized Join (CJA) Each Node sends all its local scores (list) Each intermediate node forwards all received lists The Gnutella Approach Drawbacks Overwhelming amount of messages. Huge Query Response Time

14 Improved Solution: Staged Join (SJA) Aggregate the lists before these are forwarded to the parent using: This is essentially the TAG approach (Madden et al. OSDI '02) Advantage: Only (n-1) messages Drawback: Still sending everything!

15 The Threshold Algorithm (Not Distributed) Fagin’s* Threshold Algorithm (TA): Long studied and well understood. * Concurrently developed by 3 groups ΤΑ Algorithm 1) Access the n lists in parallel. 2) While some object o i is seen, perform a random access to the other lists to find the complete score for o i. 3) Do the same for all objects in the current row. 4) Now compute the threshold τ as the sum of scores in the current row. 5)The algorithm stops after K objects have been found with a score above τ.

16 The Threshold Algorithm (Example) Have we found K=1 objects with a score above τ? =>ΝΟ Have we found K=1 objects with a score above τ? =>YES! Iteration 1 Threshold τ = => τ = 423 Iteration 2 Threshold τ (2nd row)= => τ = 354 O3, 405 O1, 363 O4, 207

17 The Threshold Algorithm (Not Distributed) Why Not TA in a distributed Environment? Disadvantages: Each object is accessed individually (random accesses)  A huge number of round trips (phases)  Unpredictable Latency (Phases are sequential)  In-network Aggregation not possible Why is the threshold correct? Because the threshold essentially gives us the maximum Score for the objects not seen (<= τ) Advantages: The number of object accessed is minimized!

18 Presentation Outline 1.Introduction to Top-K Query Processing 2.Related Work & Algorithms 3.The Threshold Join Algorithm (TJA) 4.Experimental Evaluation using our Middleware Testbed. 5.Related Activities & Future Work.

19 Threshold Join Algorithm (TJA) TJA is our 3-phase algorithm that minimizes the number of transmitted objects and hence the utilization of the communication channel. How does it work: 1.LB Phase: Ask each node to send the K (locally) highest ranked results. The union of these results defines a threshold τ. 2.HJ Phase: Ask each node to transmit everything above this threshold τ. 3.CL Phase: If at the end we have not identified the complete score of the K highest ranked objects, then we perform a cleanup phase to identify the complete score of all incompletely calculated scores.

20 Step 1 - LB (Lower Bound) Phase Each node sends its top-k results to its parent. Each intermediate node performs a union of all received lists (denoted as τ): Query: TOP-1

21 Step 2 – HJ (Hierarchical Join) Phase Disseminate τ to all nodes Each node sends back everything with score above all objectIDs in τ. Before sending the objects, each node tags as incomplete, scores that could not be computed exactly (upper bound) } Complete Incomplete

22 Step 3 – CL (Cleanup) Phase Have we found K objects with a complete score? Yes: The answer has been found! No: Find the complete score for each incomplete object (all in a single batch phase) CL ensures correctness! This phase is rarely required in practice.

23 Presentation Outline 1.Introduction to Top-K Query Processing 2.Related Work & Algorithms 3.The Threshold Join Algorithm (TJA) 4.Experimental Evaluation using our Middleware Testbed. 5.Conclusions & Future Work.

24 Experimental Evaluation We implemented a real P2P middleware in JAVA (sockets + binary transfer protocol). We tested our implementation with a network of 1000 real nodes using 75 Linux workstations. We use a trace driven experimentation methodology. For the results presented in this talk: Dataset: Environmental Measurements from 32 atmospheric monitoring stations in Washington & Oregon. ( ) Query: K timestamps on which average temperature across all stations was maximum Network: Random Graph (degree=4, diameter 10) Evaluation Criteria: i) Bytes, ii) Time, iii) Messages

25 Experimental Results TJA requires one order of magnitude less bytes than the Centralized Algorithm!

26 Experimental Results TJA: 3,797ms [ LB:1059ms, HJ:2730ms, CL:8ms ] SJA: 8,224msCJA:18,660ms

27 Experimental Results Although TJA consumes more messages than SJA, these are small size messages

28 The TPUT Algorithm Phase 1 : o1 = = 183, o3 = = 240 τ = (Kth highest score (partial) / n) => 240 / 5 => τ = 48 Phase 2 : Have we computed K exact scores ? Computed Exactly: [o3, o1]Incompletely Computed: [o4,o2,o0] Drawback: The threshold is too coarse (uniform) Q: TOP-1 o1=183, o3=240 o3=405 o1=363 o2’=158 o4’=137 o0’=124

29 TJA vs. TPUT

30 Presentation Outline 1.Introduction to Top-K Query Processing 2.Related Work & Algorithms 3.The Threshold Join Algorithm (TJA) 4.Experimental Evaluation using our Middleware Testbed. 5.Conclusions & Future Work.

31 Conclusions Distributed Top-K Query Processing is a new area with many new challenges and opportunities! We showed that the TJA is an efficient algorithm for computing the K highest ranked answers in a distributed environment. We believe that our algorithm will be a useful component in Query Optimization engines of future Database systems.

32 Future Work Implementation of the TJA algorithm in nesC the programming language of TinyOS. Deployment using the Riverside Sensor Provide the implementation of TJA as an extension of our Open Source P2P Information Retrieval Engine : Explore other domains in which the discussed ideas might be beneficial Grids, vehicular networks, etc. Peerware

33 Related Activity 1: Sensor Local Access Methods TJA assumes that random and sequential access methods to local data is available at each site. Problem: What happens if the target device is a battery-limited sensor device? Distinct Characteristics –New storage medium: FLASH memory –Asymmetric Read/Write Characteristics We propose "MicroHash: An Efficient Index Structure for Flash-Based Sensor Devices", D. Zeinalipour-Yazti, S. Lin, V. Kalogeraki, D. Gunopulos and W. Najjar, The 4th USENIX Conference on File and Storage Technologies (FAST’05), RISE Sensor

34 Related Activity 2: Retrieval using Score Bounds Suppose that each Node can only return Lower and Upper Bounds rather than Exact scores. e.g. instead of 16 it tells us that the similarity is in the range [11..19] We developed two new algorithms: UBK & UBLBK Proposed in “Distributed Spatiotemporal Similarity Search", D. Zeinalipour-Yazti, S. Lin, D. Gunopulos, under review Q

35 References TOP-K Query Processing & In-Situ Data Storage –D. Zeinalipour-Yazti, Z. Vagena, D. Gunopulos, V. Kalogeraki, V. Tsotras, M. Vlachos, N. Koudas, D. Srivastava "The Threshold Join Algorithm for Top-k Queries in Distributed Sensor Networks", Proceedings of the 2nd international workshop on Data management for sensor networks DMSN (VLDB'2005), Trondheim, Norway, 2005.DMSNVLDB'2005 –D. Zeinalipour-Yazti, S. Neema, D. Gunopulos, V. Kalogeraki and W. Najjar, "Data Acquision in Sensor Networks with Large Memories", IEEE Intl. Workshop on Networking Meets Databases NetDB (ICDE'2005), Tokyo, Japan, 2005.NetDBICDE'2005 –D. Zeinalipour-Yazti, V. Kalogeraki, D. Gunopulos, A. Mitra, A. Banerjee and W. Najjar "Towards In-Situ Data Storage in Sensor Databases", 10th Panhellenic Conference on Informatics (PCI'2005) Volos, Greece, 2005.

36 Top-K Query Processing Techniques for Distributed Environments Department of Computer Science - University of Cyprus by Demetrios Zeinalipour Thanks! Wednesday, June 7th, 2006 "Mediteranean Studies" Seminar Room, FORTH, Heraklion, Crete