MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Distributed Ranked Data Dissemination in Social Networks Joint work with: Mo Sadoghi Vinod Muthusamy Hans-Arno Jacobsen University of Toronto ICDCS, July 9-11 th 2013 Kaiwen Zhang
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 2 Top-k & publish/subscribe for social networks 2 broker match & forward Advertisement path Subscription path Publication path publisher name = `John Doe’ location = `New York’ subscriber name = `John Doe’ subscriber location = `America’ k = 1, W = 2 Closest to Philadelphia name = ‘John Doe’ location = `USA’ name = `John Doe’ location = `Philadelphia’ location = `America’ k = 1, W = 2 Closest to Philadelphia
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 3 Use cases Events-heavy applications require top-k Social networks News feeds homepage Location-based applications Online games Efficient support for top-k in pub/sub Top-k publications for a subscription Mixed subscriptions (top-k and regular) Topology is provided
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 4 Outline Top-k model for publish/subscribe Related work Current late vs. naive early approach Proposed window chunking solution Evaluation
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 5 Top-k processing Regular broker operation Count-based window parameters supplied by subscription: k is # publications W is window size δ is shift size Each publication is scored and the top-k are extracted
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 6 Related work Top-k computation for pub/sub Defining scoring functions Data structures for storing top-k results Approximate solutions based on histograms Top-k processing in database Reverse problem: find existing data for a query No work on top-k dissemination in pub/sub Top-k computation occur within a single broker Collect the entire stream at the edge
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 7 Current late approach Maintains top-k processing & converts into regular subscription Submit a top-k subscription {k = 2, w = 4, δ = 1} Rest of the topology is agnostic to the top-k semantics: they simply forward matching events The entire matching stream is collected at the edge and processed to determine the top-k. This approach is not efficient! Low scoring publications are propagated to the edge and then filtered out. [1,2,a,b][2,a,b,3][a,b,3,c][b,3,c,4][3,c,4,d] Can we reduce intra-network traffic by pushing the top-k computation upstream? (1,2,3,4) (a,b,c,d) (1,2,a,b,3,c,4,d) => [1,2,a,b][2,a,b,3][a,b,3,c][b,3,c,4][3,c,4,d]
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 8 Naive early approach Fill windows at publisher edge & compute top-k publications Only disseminate top-k publications Merge top-k streams to obtain final results (1,2,b,c) [1,2,3,4] [a,b,c,d] (1,2)(1,2) (b,c)(b,c)
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 9 Correctness criterion Goal - same result as the late approach No false positives or negatives Stream reconstruction criterion A stream of top-k publications is correct if its reconstructed stream of all publications, possible according to the ordering semantics, can be processed centrally to obtain the same result. Ordering guarantees? Consider per-publisher FIFO ordering Multiple interleavings of publications possible
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 10 Naive counter-example k = 2, W = 4, δ = 1 Forward local top-k results Publications are delivered according to per-source ordering Reconstructing the stream: we fail to consider windows such as [b,c,d,1] which are “overlapping” Fill windows at the publisher edge brokers
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 11 Overlapping windows problem Key idea: Send a few more publications Enough to prevent false negatives Less than all publications to be efficient Key insight: Computing overlapping windows Publishers compute windows for own publications Windows which contain publications from different source brokers can only be computed downstream Need full knowledge of publications in such windows
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 12 Window chunking technique Hybrid solution Send all publications for overlapping windows Reduce the occurrence of overlapping windows Early top-k filtering of local source windows Late top-k filtering of overlapping windows Each chunk contains publications from a single source broker Chunks contain a stream of top-k publications for successive windows Left and right guards are full windows of publications that start and end a chunk The subscriber edge broker must fully process one chunk......before choosing another chunk to process Overlapping windows can only occur in the intrachunk region, for which we have guard windows which can be processed downstream
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 13 Evaluation summary Setup: PADRES implementation SciNet: cluster of ~1000 cores Main metric: throughput reduction Normalized to late approach Sensitivity analysis Top-k semantics, workload, etc. Performance analysis Traces from Twitter and Facebook
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 14 Timing sensitivity Does not scale when mixed
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 15 Offset chunks Publication is filtered for S1 but part of the guard of S2: must be forwarded A publication can only be filtered if it is not part of any guards Future work: solve the issue by synchronizing chunks adaptively
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 16 Social workload properties Use of popularity as scoring function
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 17 Social workload properties Offset chunks are present: windows are filled at different times Top-k “cuts” the long tail of Facebook popularity function: unpopular publications are filtered Twitter has a wider tail: a wider variety of publications are found in top-k's
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 18 Conclusions Top-k support for publish/subscribe For event-heavy applications (social networks) Efficient top-k distribution Reduce intra-network traffic Maintain correctness Proposed hybrid chunking solution Early top-k computation of local windows Late top-k computation of overlapping windows Evaluation observations Need for chunk synchronization Topic popularity in social networks beneficial
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 19 Thank you! Questions? padres.msrg.org
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 20 Scoring function sensitivity Same top-k for every subscription: maximize pruning Uniform distribution: does not scale as all publications are selected by at least one subscriber Zipfian distribution: traffic reduction even at 1000 subscriptions
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 21 Impact of deduplication Non-buffering solution even worse! Deduplication is essential Constant traffic reduction (Best-case scalability)
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 22 Latency comparison Similar latency: Computation overhead is not considered