Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.

Slides:



Advertisements
Similar presentations
Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.
Advertisements

Data Currency in Replicated DHTs Reza Akbarinia, Esther Pacitti and Patrick Valduriez University of Nantes, France, INIRA ACM SIGMOD 2007 Presenter Jerry.
P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Precept 6 Hashing & Partitioning 1 Peng Sun. Server Load Balancing Balance load across servers Normal techniques: Round-robin? 2.
The Chord P2P Network Some slides have been borowed from the original presentation by the authors.
Xiaowei Yang CompSci 356: Computer Network Architectures Lecture 22: Overlay Networks Xiaowei Yang
An Adaptive Energy-Efficient MAC Protocol for Wireless Sensor Network
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Small-world Overlay P2P Network
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
©NEC Laboratories America 1 Hui Zhang Samrat Ganguly Sudeept Bhatnagar Rauf Izmailov NEC Labs America Abhishek Sharma University of Southern California.
Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,
Efficient, Proximity-Aware Load Balancing for DHT-Based P2P Systems Yingwu Zhu, Yiming Hu Appeared on IEEE Trans. on Parallel and Distributed Systems,
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
SafeQ: Secure and Efficient Query Processing in Sensor Networks Fei Chen and Alex X. Liu Department of Computer Science and Engineering Michigan State.
Privacy and Integrity Preserving in Distributed Systems Presented for Ph.D. Qualifying Examination Fei Chen Michigan State University August 25 th, 2009.
Searching in Unstructured Networks Joining Theory with P-P2P.
P2P Course, Structured systems 1 Skip Net (9/11/05)
Ecole Polytechnique Fédérale de Lausanne, Switzerland Efficient processing of XPath queries with structured overlay networks Gleb Skobeltsyn, Manfred Hauswirth,
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Towards Efficient Load Balancing in Structured P2P Systems Yingwu Zhu, Yiming Hu University of Cincinnati.
COCONET: Co-Operative Cache driven Overlay NETwork for p2p VoD streaming Abhishek Bhattacharya, Zhenyu Yang & Deng Pan.
PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands
Fast and deterministic hash table lookup using discriminative bloom filters  Author: Kun Huang, Gaogang Xie,  Publisher: 2013 ELSEVIER Journal of Network.
Wei Gao1 and Qinghua Li2 1The University of Tennessee, Knoxville
Brocade Landmark Routing on P2P Networks Gisik Kwon April 9, 2002.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
Efficient Peer to Peer Keyword Searching Nathan Gray.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Hashing Dr. Yingwu Zhu.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Fateme Shirazi Spring Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum.
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Zone Sharing: A Hot-Spots Decomposition Scheme for Data-Centric Storage in Sensor Networks Mohamed Aly Nicholas Morsillo Panos K. Chrysanthis Kirk Pruhs.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.
Hash Table March COP 3502, UCF 1. Outline Hash Table: – Motivation – Direct Access Table – Hash Table Solutions for Collision Problem: – Open.
1. Outline  Introduction  Different Mechanisms Broadcasting Multicasting Forward Pointers Home-based approach Distributed Hash Tables Hierarchical approaches.
1 30 November 2006 An Efficient Nearest Neighbor (NN) Algorithm for Peer-to-Peer (P2P) Settings Ahmed Sabbir Arif Graduate Student, York University.
1 An Adaptive Energy-Efficient MAC Protocol for Wireless Sensor Networks Tijs van Dam, Koen Langendoen In ACM SenSys /1/2005 Hong-Shi Wang.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Presented By Amarjit Datta
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
Keyword search on encrypted data. Keyword search problem  Linux utility: grep  Information retrieval Basic operation Advanced operations – relevance.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
ROUTING TECHNIQUES IN WIRELESS SENSOR NETWORKS: A SURVEY.
All Your Queries are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption Yupeng Zhang, Jonathan Katz, Charalampos Papamanthou University.
1 VLDB, Background What is important for the user.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
The Chord P2P Network Some slides taken from the original presentation by the authors.
Information Retrieval in Practice
The Chord P2P Network Some slides have been borrowed from the original presentation by the authors.
Primitive Decision Models
RE-Tree: An Efficient Index Structure for Regular Expressions
5.2 FLAT NAMING.
Outline Ganesan, D., Greenstein, B., Estrin, D., Heidemann, J., and Govindan, R. Multiresolution storage and search in sensor networks. Trans. Storage.
Presentation transcript:

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu

Background Instantaneous Query Continuous Query

Instantaneous Query (1) Documents are indexed Node responsible for keyword t stores the IDs of documents containing that term (i.e., inverted lists) Retrieve “one-time” relevant docs Latency is a top priority Query Q = t 1 Λ t 2 … Fetch lists of doc IDs stored under t 1, t 2 …. Intersect these lists E.g.: Google search engine

Instantaneous Query (2) A B D C cat:1,4,7,19,20 dog:1,5,7,26 cow:2,4,8,18 bat: 1,8,31 “cat Λ dog”? cat? cat:1,4,7,19,20 dog? dog:1,5,7,26 Send Result: Docs 1,7

Continuous Query (1) Reverse the role of documents and queries Queries are indexed Query Q = t1 Λ t2 … stored at one of the terms t1, t2 … Question 1: How is the index term selected? (query indexing) “Push” new relevant docs (incrementally) Enabled by “long-lived” queries E.g.: Google New Alert feature

Continuous Query (2) Upon a new doc D = t 1 Λ t 2 (insertion) Contacts the nodes responsible for the inverted query lists of D’s keywords t 1 and t 2 Question 2: How to locate the nodes (query nodes QN)? (document announcement) Resolve the query lists  the final list of satisfied queries (by D) Question 3: What is the resolution strategy? (query resolution) E.g., Term Dialogue, Bloom filters (Infocom’06) Notify owners of satisfied queries

Query Resolution: Term Dialogue AB cat (query): 1.dog 2.horse & dog 3.horse & cow cat dog cow Doc Inver. list for “cat” 1. Document announcement 2. “dog” & “cow” 3. “11” (bit vector) 4. “horse” 5. “0” (bit vector) Notify owner of Q 1 CD Inver. list for “dog” Inver. list for “cow”

Query Resolution: Bloom filters AB cat (query): 1.dog 2.horse & dog 3.horse & cow cat dog cow Doc Inver. list for “cat” 1. Doc announcement “10110” (bloom filter) 2. “dog” (Term Dialogue) 3. “1” (bit vector) Notify owner of Q 1 CD Inver. list for “dog” Inver. list for “cow”

Motivation Latency is not the primary concern, but bandwidth can be one of the important design issues Various query indexing schemes incur different cost Various query resolution strategies cause different costs  Design a bandwidth-efficient continuous query system with “proper” query indexing (Question #1), document announcement (Question #2), and query resolution (Question #3) approaches

Contributions Novel query indexing schemes  Question #1 Focus of this talk! Multicast-based document announcement  Question #2 In the paper Adaptive query resolution  Question #3 Make intelligent decisions in resolving query terms Minimize the bandwidth cost In the full tech. report paper

Focus on simple keyword queries, e.g., Q = t 1 Λ t 2 Λ … Λt n Leverage DHTs Location & storage of documents and continuous queries Query indexing How to choose index terms for queries? Doc. announcement, query resolution Not covered in this talk! Design

Current Indexing Schemes Random Indexing (RI) Optimal Indexing (OI)

Random Indexing (RI) Randomly chooses a term as index term Q = t 1 Λ … Λ t m Index term t i is randomly selected Q is indexed in a DHT node responsible for t i Pros: simple Cons: Popular terms are more likely to be index terms for queries Load imbalance Introduce many irrelevant queries in query resolution, wasting bandwidth

Optimal Indexing (OI) Q = t 1 Λ … Λ t m Index term t i is deterministically chosen, the most selective term, i.e., with the least frequency Q is indexed in a DHT node responsible for t i Pros: Maximize load balance & minimize bandwidth cost Cons: Assume perfect knowledge of term statistics Impractical, e.g., due to large number of documents, node churn, continuous doc updates, ….

Solution 1: MHI Minimum Hash Indexing Order query terms by their hashes Select the term with minimum hash as the index term Q = t 1 Λ… Λ t m Index term t i is deterministically chosen, s.t. h(t i ) < h(t x ) (for all x≠i) Q is indexed in a DHT node responsible for t i

RI v.s. MHI t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 D = {t 2, t 4, t 5, t 6 } Where h(t i ) < h(t j ) for i < j. 3 queries, irrelevant to D: Q 1 = t 1 Λ t 2 Λ t 4 Q 2 = t 3 Λ t 4 Λ t 5 Q 3 = t 3 Λ t 5 Λ t 6 (1) RI: Q 1, Q 2, and Q 3 will be considered in query resolution each with probability of 67% (need to resolve terms t 1,t 2,t 3,t 4,t 5,and t 6 ) (2) MHI All of them will be filtered out!  bandwidth savings! How?

MHI: filtering irrelevant queries! B G F E D = {t 2, t 4, t 5, t 6 } t 2: none t1:Q1t1:Q1 t 3 : Q 2, Q 3 t 6: none C D t 5: none t 4: none No action A Disregarded in query resolution, saving bandwidth! Q1= t1 Λ t2 Λ t4 Q2= t3 Λ t4 Λ t5 Q3= t3 Λ t5 Λ t6

MHI Pros: Simple and deterministic Does not require term stats Saves bandwidth over RI (up to 39.3% saving for various query types) Cons: Some popular terms can be index terms by their minimum hashes in their queries! Load imbalance & irrelevant queries to process

Solution 2: SAP-MHI MHI is good but may still index queries under popular terms SAmPling-based MHI(SAP-MHI) Sampling (synopsis of K popular terms) + MHI Avoid indexing queries under K popular terms Challenge: support duplicate-sensitive aggregates of popular terms as synopses may be gossiped over multiple DHT overlay links and term frequencies may be overestimated! Borrow idea from duplicate-sensitive aggregation in sensor networks

SAP-MHI Duplicate-sensitive aggregation Goal:  a synopsis of K popular terms Based on coin tossing experiment CT(y) Toss a fair coin until either the first head occurs or y coin tosses end up with no head, and return the number of tosses Each node a Produce a local synopsis S a containing K popular terms (the terms with the highest values of CT(y)) Gossip S a to its neighbor nodes Upon receiving a synopsis S b from a neigbor b, aggregate S a and S b, producing a new synopsis S a (max() operations) Thus, each node has a synopsis of K popular terms after a sufficient number of gossip rounds Intuition: If a term appears in more documents then its value produced by CT(y) will be larger than the values of rare terms

SAP-MHI: Indexing Example Query Q=t 1 Λ t 2 Λ t 3 Λ t 4 Λ t 5, where h(t 1 )<h(t 2 )<h(t 3 )<h(t 4 )<h(t 5 ) Synopsis S={t 1,t 2 } Q is indexed on the node which is responsible for t 3, instead of t 1

Simulations ParameterValue DHT1000-node Chord Document collectionTREC-1,2-AP Mean of query sizes5 # of continuous queries100,000 # of docs10,000 # of unique terms46,654 # of unique terms per doc178 Query typesSkew, Uniform, InverSkew Query resolutionTerm Dialogue, Bloom filters

SAP-MHI v.s. MHI SAP-MHI improves load balance over MHI with increasing synopsis size K, for Skew queries.

SAP-MHI v.s. MHI Bloom filters are used in query resolution.

SAP-MHI v.s. MHI Term Dialogue is used in query resolution.

SAP-MHI v.s. MHI This shows why SAP-MHI saves bandwidth over MHI!

Summary Focus on a simple keyword query model Bandwidth is a top priority Query indexing impacts bandwidth cost Goal: Sift out as many irrelevant queries as possible! MHI and SAP-MHI SAP-MHI is a more viable solution Load is more balanced, more bandwidth saving! Sampling cost is controlled  # of popular terms is relatively low  Memberships of popular terms do not change rapidly Document announcement & adaptive query resolution further cut down bandwidth consumption (not covered in this talk)

Thank You!