Count / Top-k Continuous Queries on P2P Networks 01/11/2006
Outline Problem Definition P2P Architecture Count Top-K Experiment Setup Future Work
Streaming Data in P2P P2P Dynamic changing topology, large scale, … Streaming data Continuous, unbounded, rapid, time-varying, noise P2P + Streaming data Dynamic in both data and topology
Objective and Goal Objective Issue a continuous query to estimate count and top-K Goal Lower down the communication cost Lightweight maintenance Approximated answers An adaptive and progressive approach
Naïve approach Flooding the overlay continuous Pros Closer to the exact answer Cons Network congestion Still non-real time
The State-of-the-Art Count Focus on one-time answer in P2P Deal with streaming data only Top-K P2P environment without streaming data Distributed environment not P2P
P2P architecture Assumption Hierarchical P2P (Focused) Super-peer hierarchical structure Query issuer is a super-peer Super peer connect with other super peers Each peer belongs to only one super peer Pure unstructured P2P
Big picture Group Accumulate information within a group based on the constraint and statistics Set Constraint Report changes Approximated answer
Group in hierarchical P2P Issuer Coordinator Peer
Group in hierarchical P2P
After partition Group1 Group3 Group2 Assume we have N objects and K Groups after partition
User-specified Epsilon Group1 Group3 Group2 User-specified ε(Precision)
Consider a group P4P4 P1P1 P3P3 P2P2 Coordinator Node Objects O1O1 O2O2 O3O3
Each node maintain the distribution information of owning objects P2P2 P4P4 P1P1 P3P3 object Rate # R1R1 R2R2 R3R3 R4R4
At initial - Polling P4P4 P1P1 P3P3 P2P2 Coordinator Node
At initial - Polling P4P4 P1P1 P3P3 P2P2 Coordinator Node
Information at coordinator after polling object # P4P4 P3P3 P2P2 P1P1
Statistics information object # P 1 P 2 P 3 P 4 Δ O 1 1/1 6/6 10/10 5/5 22 O 2 11/11 13/13 5/5 9/7 36 O 3 15/15 6/6 3/3 9/9 33 R T Updated time stamp Maximum changing rate(+/-) of objects in each peer Change value for each object Latest real value Estimated value
Update to Coordinator ( Δ 11, Δ 21, Δ 31) T2T2 ( Δ 12, Δ 22, Δ 32) ( Δ 13, Δ 23, Δ 33)
Calculate Count
Redistribute Epsilon w i =Max(Δ i )/C x,0 where x is the i-index of Max(Δ i ) δ i =w i εC x,0 / ∑w i
Visiting sequence P4P4 P3P3 P2P2 P1P1 Pick those peers would violate δ
Update information Group P 1 P 2 P 3 P 4 Δ O 1 1/1 6/6 10/10 8/8 - O 2 11/11 11/11 5/5 6/6 - O 3 15/15 5/5 3/3 11/11 - R T
For those nodes not being visited Group P 1 P 2 P 3 P 4 Δ O 1 1/2 6/6 10/9 8/8 25 O 2 11/13 11/11 5/4 6/6 34 O 3 15/18 5/5 3/2 11/11 36 R T
Un-notified Leave P1P1 Ping P 1 is dead Remove P 1 ’s information P4P4 P3P3 P2P2
Experiment Setup Generate synthetic data set by statistics distribution for Streaming data Life time of peers Metrics Message size Communication cost Response latency Result accuracy
Top-K Use Regression to predicate the reasonable trend of changes Once a updated result is required, Super Peer only need to ask those doubtful peers for doubtful objects Update its counting list, and return the top k objects
Future Work Connect and recommend latent good friends for each user Good friends: the ones with the same interests (behaviors) Exploiting current connecting peers to discover good friends bit by bit Design a system that could make clusters reflecting current interests of individual peers and connecting them together based on their similarity by using user’s social network
Advantages Reduce search time and diminish query traffic by using friends list By utilizing their different strength of arcs/edges/ties = friendshipness, social networks exceed random-walk networks in quickly finding target objects
Example Level 1 Level 2
Example has larger weight than Score(N i ) Similarity