Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu

Background Instantaneous Query Continuous Query

Instantaneous Query (1) Documents are indexed Node responsible for keyword t stores the IDs of documents containing that term (i.e., inverted lists) Retrieve “one-time” relevant docs Latency is a top priority Query Q = t 1 Λ t 2 … Fetch lists of doc IDs stored under t 1, t 2 …. Intersect these lists E.g.: Google search engine

Instantaneous Query (2) A B D C cat:1,4,7,19,20 dog:1,5,7,26 cow:2,4,8,18 bat: 1,8,31 “cat Λ dog”? cat? cat:1,4,7,19,20 dog? dog:1,5,7,26 Send Result: Docs 1,7

Continuous Query (1) Reverse the role of documents and queries Queries are indexed Query Q = t1 Λ t2 … stored at one of the terms t1, t2 … Question 1: How is the index term selected? (query indexing) “Push” new relevant docs (incrementally) Enabled by “long-lived” queries E.g.: Google New Alert feature

Continuous Query (2) Upon a new doc D = t 1 Λ t 2 (insertion) Contacts the nodes responsible for the inverted query lists of D’s keywords t 1 and t 2 Question 2: How to locate the nodes (query nodes QN)? (document announcement) Resolve the query lists  the final list of satisfied queries (by D) Question 3: What is the resolution strategy? (query resolution) E.g., Term Dialogue, Bloom filters (Infocom’06) Notify owners of satisfied queries

Query Resolution: Term Dialogue AB cat (query): 1.dog 2.horse & dog 3.horse & cow cat dog cow Doc Inver. list for “cat” 1. Document announcement 2. “dog” & “cow” 3. “11” (bit vector) 4. “horse” 5. “0” (bit vector) Notify owner of Q 1 CD Inver. list for “dog” Inver. list for “cow”

Query Resolution: Bloom filters AB cat (query): 1.dog 2.horse & dog 3.horse & cow cat dog cow Doc Inver. list for “cat” 1. Doc announcement “10110” (bloom filter) 2. “dog” (Term Dialogue) 3. “1” (bit vector) Notify owner of Q 1 CD Inver. list for “dog” Inver. list for “cow”

Motivation Latency is not the primary concern, but bandwidth can be one of the important design issues Various query indexing schemes incur different cost Various query resolution strategies cause different costs  Design a bandwidth-efficient continuous query system with “proper” query indexing (Question #1), document announcement (Question #2), and query resolution (Question #3) approaches

Contributions Novel query indexing schemes  Question #1 Focus of this talk! Multicast-based document announcement  Question #2 In the paper Adaptive query resolution  Question #3 Make intelligent decisions in resolving query terms Minimize the bandwidth cost In the full tech. report paper

Focus on simple keyword queries, e.g., Q = t 1 Λ t 2 Λ … Λt n Leverage DHTs Location & storage of documents and continuous queries Query indexing How to choose index terms for queries? Doc. announcement, query resolution Not covered in this talk! Design

Current Indexing Schemes Random Indexing (RI) Optimal Indexing (OI)

Random Indexing (RI) Randomly chooses a term as index term Q = t 1 Λ … Λ t m Index term t i is randomly selected Q is indexed in a DHT node responsible for t i Pros: simple Cons: Popular terms are more likely to be index terms for queries Load imbalance Introduce many irrelevant queries in query resolution, wasting bandwidth

Optimal Indexing (OI) Q = t 1 Λ … Λ t m Index term t i is deterministically chosen, the most selective term, i.e., with the least frequency Q is indexed in a DHT node responsible for t i Pros: Maximize load balance & minimize bandwidth cost Cons: Assume perfect knowledge of term statistics Impractical, e.g., due to large number of documents, node churn, continuous doc updates, ….

Solution 1: MHI Minimum Hash Indexing Order query terms by their hashes Select the term with minimum hash as the index term Q = t 1 Λ… Λ t m Index term t i is deterministically chosen, s.t. h(t i ) < h(t x ) (for all x≠i) Q is indexed in a DHT node responsible for t i

RI v.s. MHI t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 D = {t 2, t 4, t 5, t 6 } Where h(t i ) < h(t j ) for i < j. 3 queries, irrelevant to D: Q 1 = t 1 Λ t 2 Λ t 4 Q 2 = t 3 Λ t 4 Λ t 5 Q 3 = t 3 Λ t 5 Λ t 6 (1) RI: Q 1, Q 2, and Q 3 will be considered in query resolution each with probability of 67% (need to resolve terms t 1,t 2,t 3,t 4,t 5,and t 6 ) (2) MHI All of them will be filtered out!  bandwidth savings! How?

MHI: filtering irrelevant queries! B G F E D = {t 2, t 4, t 5, t 6 } t 2: none t1:Q1t1:Q1 t 3 : Q 2, Q 3 t 6: none C D t 5: none t 4: none No action A Disregarded in query resolution, saving bandwidth! Q1= t1 Λ t2 Λ t4 Q2= t3 Λ t4 Λ t5 Q3= t3 Λ t5 Λ t6

MHI Pros: Simple and deterministic Does not require term stats Saves bandwidth over RI (up to 39.3% saving for various query types) Cons: Some popular terms can be index terms by their minimum hashes in their queries! Load imbalance & irrelevant queries to process

Solution 2: SAP-MHI MHI is good but may still index queries under popular terms SAmPling-based MHI(SAP-MHI) Sampling (synopsis of K popular terms) + MHI Avoid indexing queries under K popular terms Challenge: support duplicate-sensitive aggregates of popular terms as synopses may be gossiped over multiple DHT overlay links and term frequencies may be overestimated! Borrow idea from duplicate-sensitive aggregation in sensor networks

SAP-MHI Duplicate-sensitive aggregation Goal:  a synopsis of K popular terms Based on coin tossing experiment CT(y) Toss a fair coin until either the first head occurs or y coin tosses end up with no head, and return the number of tosses Each node a Produce a local synopsis S a containing K popular terms (the terms with the highest values of CT(y)) Gossip S a to its neighbor nodes Upon receiving a synopsis S b from a neigbor b, aggregate S a and S b, producing a new synopsis S a (max() operations) Thus, each node has a synopsis of K popular terms after a sufficient number of gossip rounds Intuition: If a term appears in more documents then its value produced by CT(y) will be larger than the values of rare terms

SAP-MHI: Indexing Example Query Q=t 1 Λ t 2 Λ t 3 Λ t 4 Λ t 5, where h(t 1 )<h(t 2 )<h(t 3 )<h(t 4 )<h(t 5 ) Synopsis S={t 1,t 2 } Q is indexed on the node which is responsible for t 3, instead of t 1

Simulations ParameterValue DHT1000-node Chord Document collectionTREC-1,2-AP Mean of query sizes5 # of continuous queries100,000 # of docs10,000 # of unique terms46,654 # of unique terms per doc178 Query typesSkew, Uniform, InverSkew Query resolutionTerm Dialogue, Bloom filters

SAP-MHI v.s. MHI SAP-MHI improves load balance over MHI with increasing synopsis size K, for Skew queries.

SAP-MHI v.s. MHI Bloom filters are used in query resolution.

SAP-MHI v.s. MHI Term Dialogue is used in query resolution.

SAP-MHI v.s. MHI This shows why SAP-MHI saves bandwidth over MHI!

Summary Focus on a simple keyword query model Bandwidth is a top priority Query indexing impacts bandwidth cost Goal: Sift out as many irrelevant queries as possible! MHI and SAP-MHI SAP-MHI is a more viable solution Load is more balanced, more bandwidth saving! Sampling cost is controlled  # of popular terms is relatively low  Memberships of popular terms do not change rapidly Document announcement & adaptive query resolution further cut down bandwidth consumption (not covered in this talk)

Thank You!

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.

Similar presentations

Presentation on theme: "Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.

Similar presentations

Presentation on theme: "Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu."— Presentation transcript:

Similar presentations

About project

Feedback