MIDDLEWARE SYSTEMS RESEARCH GROUP Divide and Conquer Algorithms for Pub/Sub Overlay Design Chen Chen 1 joint work with Hans-Arno Jacobsen 1,2, Roman Vitenberg 3 1 Department of Electrical and Computer Engineering 2 Department of Computer Science University of Toronto 3 Department of Informatics University of Oslo ICDCS’10 Genoa, Italy1
MIDDLEWARE SYSTEMS RESEARCH GROUP Example: Pub/Sub Interests: boy Interests: girl boy girl ICDCS’10 Genoa, Italy2
MIDDLEWARE SYSTEMS RESEARCH GROUP Pub/Sub A communication paradigm –Subscribers express their interests –Publishers disseminate messages Many applications and industry standards –Application integration, financial data dissemination, RSS feed distribution, business process management –WS Notifications, WS Eventing, OMGs’ Real-time Data Dissemination Service Topic-based pub/sub –TIBCO RV –Google’s GooPS ICDCS’10 Genoa, Italy3
MIDDLEWARE SYSTEMS RESEARCH GROUP Two components in pub/sub implementation Design of routing protocols The design of protocols so that publications and subscriptions are sent most efficiently across the overlay network. G. Li et al., ICDCS’08 M. Castro et al., JSAC’02 Construction of overlay The construction of the overlay topology such that network traffic is minimized. Chockler et al., PODC’07 Onus et al., INFOCOM’09 ICDCS’10 Genoa, Italy4
MIDDLEWARE SYSTEMS RESEARCH GROUP Desirable properties for overlays Low average node degree Low fan-out of a node Low diameter Topic-connectivity Efficiency to construct Adaptability to churn Ease of distributed implementation ICDCS’10 Genoa, Italy5
MIDDLEWARE SYSTEMS RESEARCH GROUP Our contributions ICDCS’10 Genoa, Italy6 Previous algorithm: GM High running time cost Full knowledge requirement Centralized operation (difficult to decentralize) No support for dynamic changes Constructing from scratch only (No support for incremental addition) Our algorithms Low running time cost Partial knowledge requirement Centralized operation (easy to decentralize) No direct support for dynamic changes Constructing both from scratch and incrementally
MIDDLEWARE SYSTEMS RESEARCH GROUP Topic-connectivity V5 {a,c} V1 {b,c,d} V2 {a} {b,d} V4 {a,b} V3 V5 {a,c} V2 {a}{a} V4 {a,b} V1 {b,c,d} {b,d} V4 {a,b} V3 An overlay G Suboverlay Ga is topic-connected Suboverlay Gb is NOT topic-connected ICDCS’10 Genoa, Italy7
MIDDLEWARE SYSTEMS RESEARCH GROUP MinAvg-TCO problem V5 V1 {b,c,d} V2 {a} {b,d} V4 {a,b} V3 TCO 1 has 5 edges {a,c} V5 V1 {b,c,d} V2 {a} {b,d} V4 {a,b} V3 TCO 2 has 10 edges {a,c} ICDCS’10 Genoa, Italy8
MIDDLEWARE SYSTEMS RESEARCH GROUP MinAvg-TCO problem A high-quality overlay –Topic-connectivity –Total number of edges Input: –a set of nodes V, –a set of topics T, –the interest function Int MinAvg-TCO(V,T,Int) (optimization version) Construct a TCO(V,T,Int,E) such that |E| is minimum. Avg-TCO(V,T,Int,k) (decision version) Is there a TCO(V,T,Int,E) such that |E|=k? Theorem : MinAvg-TCO is NP-complete V5 V1 {b,c,d} V2 {a} {b,d} V4 {a,b} V3 {a,c} ICDCS’10 Genoa, Italy9
MIDDLEWARE SYSTEMS RESEARCH GROUP Greedy-Merge (GM) algorithm Greedy: always making the choice that looks best at the moment GM for MinAvg-TCO: always adding an edge with maximum link contribution Running Time: O(|V| 2 |T|) Approximation Ratio: O(log(|V| |T|)) ICDCS’10 Genoa, Italy10
MIDDLEWARE SYSTEMS RESEARCH GROUP Our contributions ICDCS’10 Genoa, Italy11 Previous algorithm: GM High running time cost Full knowledge requirement Centralized operation (difficult to decentralize) No support for dynamic changes Construction from scratch only (No support for incremental addition) Our algorithms Low running time cost Partial knowledge requirement Centralized operation (easy to decentralize) No direct support for dynamic changes Construction both from scratch and incrementally
MIDDLEWARE SYSTEMS RESEARCH GROUP TCO join problem Given p TCOs: TCO d (V d,T d,Int d,E d ), d=1,..,p MinAvg-TCO-Join(V,T,Int,p) (optimization version) Construct a TCO(V,T,Int,E) such that |E| is minimum Avg-TCO-Join(V,T,Int,p,k) (decision version) Is there a TCO(V,T,Int,E) such that |E|=k? MinAvg-TCO is a special case of MinAvg-TCO-Join: Theorem: MinAvg-TCO-Join is NP-complete ICDCS’10 Genoa, Italy12
MIDDLEWARE SYSTEMS RESEARCH GROUP Solving MinAvg-TCO-Join MinAvg-TCO-Join could be solved by GM, but NOT practical: –Tear down all existing links –Rebuild the overlay from scratch using GM It is better to preserve all existing edges and only add edges incrementally. ICDCS’10 Genoa, Italy13
MIDDLEWARE SYSTEMS RESEARCH GROUP Bad case for incremental addition of edges ICDCS’10 Genoa, Italy V2V2 V1V1 VnVn ViVi V n-1 TCO 0 : V2V2 V1V1 VnVn ViVi V n-1 V all V all : interested in all topics in T TCO 1 : TCO 2 : Constructing incrementallyConstructing from scratch V2V2 V1V1 VnVn ViVi V n-1 V all 14
MIDDLEWARE SYSTEMS RESEARCH GROUP Naive Merge (NM) algorithm GM algorithm Input: (V,T,Int) Output: one TCO Algorithm: - Start with an empty edge set; - Always add an edge with maximum link contribution. Running time: NM algorithm Input: (V d,T d,Int d,E d ), d=1,...,p Output: one TCO Algorithm: - Start with existing internal-TCO links; - Always add a cross-TCO edge with maximum link contribution. Running time: NM is based on the same greedy heuristic as GM. ICDCS’10 Genoa, Italy15
MIDDLEWARE SYSTEMS RESEARCH GROUP Example of NM V 12 V0V0 {c} V3V3 {d} V9V9 {a,b,c} V6V6 {d} {a,b,c} V8V8 V 11 V2V2 {a} V5V5 {a,b,d} V 14 {b,c,d} {a,b,c} {a,b,d} V 13 V1V1 V4V4 {c} V 10 V7V7 {c} {a,c,d} {c} {a} ICDCS’10 Genoa, Italy Still a prohibitively high running time!!! 16
MIDDLEWARE SYSTEMS RESEARCH GROUP Star set V5V5 {a,c} V1V1 {b,c,d} V2V2 {a} {b,d} V4V4 {a,b} V3V3 A topic-connected overlay {v 3, v 5 } is a star set which covers all topics {a,b,c,d} {v 2, v 3, v 4 } is not a star set; it only covers {a,b,d} V5V5 V1V1 {b,c,d} V2V2 {a} {b,d} V4V4 {a,b} V3V3 V5V5 V1V1 {b,c,d} V2V2 {a}{a} {b,d} V4V4 {a,b} V3V3 {a,c} Given a TCO (V,T,Int,E) A Star set S is a subset of V that covers all V’s topics. ICDCS’10 Genoa, Italy17
MIDDLEWARE SYSTEMS RESEARCH GROUP Star set Star set nodes –Represents the interests of all the nodes –Can function as bridges to determine cross-TCO links Observation: minimal star sets tend to be substantially smaller than the total number of nodes. How to find a minimum star set S * for (V,T,Int)? –Equal to classic set cover problem: NP-complete –Could be approximated with a log approximation ratio ICDCS’10 Genoa, Italy18
MIDDLEWARE SYSTEMS RESEARCH GROUP Star Merge (SM) algorithm NM algorithm Input: (V d,T d,Int d,E d ), d=1,..,p Output: one TCO Algorithm: - Start with existing internal-TCO links; - // Do nothing; - Always add a cross-TCO edge with maximum link contribution. SM algorithm Input: (V d,T d,Int d,E d ), d=1,..,p Output: one TCO Algorithm: - Start with existing internal-TCO links; - Find a star set for each sub-TCO; - Always add a cross-Star edge with maximum link contribution. ICDCS’10 Genoa, Italy19
MIDDLEWARE SYSTEMS RESEARCH GROUP Example of SM V 12 V0V0 {c} V6V6 {d} V9V9 {a,b,c} V3V3 {d} {a,b,c} V8V8 V 11 V2V2 {a} V5V5 {a,b,d} V 14 {b,c,d} {a,b,c} {a,b,d} V 13 V1V1 V4V4 {c} V 10 V7V7 {c} {a,c,d} {c} {a} ICDCS’10 Genoa, Italy Running time largely improved because #stars << #nodes for most cases. 20
MIDDLEWARE SYSTEMS RESEARCH GROUP Divide and Conquer (DC) for MinAvg-TCO The number of nodes is a dominant factor for the running time of the GM algorithm. Divide-and-conquer –Divide the MinAvg-TCO problem into several sub- overlay construction problems –Conquer the sub-MinAvg-TCO problems independently and build sub-overlays into sub-TCOs –Combine these sub-TCOs to one TCO ICDCS’10 Genoa, Italy21
MIDDLEWARE SYSTEMS RESEARCH GROUP Design of DC algorithm How to divide the node set V: –Node clustering vs. random partitioning –The number of partitions p The balance between conquer and combine –p = 1 (single partition): conquer only = GM –p = |V| (each node is a partition): combine only = GM How to decentralize DC: –Note the DC algorithm as presented is fully centralized. –However, it is possible to decentralize it. Theoretical analysis: not straightforward. ICDCS’10 Genoa, Italy22
MIDDLEWARE SYSTEMS RESEARCH GROUP Example of DC V 12 V0V0 {c} V6V6 {d} V9V9 {a,b,c} V3V3 {d} {a,b,c} V8V8 V 11 V2V2 {a} V5V5 {a,b,d} V 14 {b,c,d} {a,b,c} {a,b,d} V 13 V1V1 V4V4 {c} V 10 V7V7 {c} {a,c,d} {c} {a} ICDCS’10 Genoa, Italy - Divide overlay based on V - Conquer each sub-TCO by GM - Combine TCO into one by SM 23
MIDDLEWARE SYSTEMS RESEARCH GROUP Experiment setting The number of nodes |V| = 1000ranging from 1000 to 8000 The number of topics |T| = 100ranging from 100 to 1000 The number of topics that subscribed by a node NodeIntSize=20 ranging from 10 to 100 Topic distribution uniform, zipf, exponential ICDCS’10 Genoa, Italy24
MIDDLEWARE SYSTEMS RESEARCH GROUP Experiment design Evaluation: average node degree, running time –Star Merge for MinAvg-TCO-Join –DC for MinAvg-TCO Random node partitioning The effects of the number of nodes The effects of the number of topics The effects of average subscription size of a node Comparison with RingPT RingPT is an algorithm that mimics the common practice of building separate overlay for each topic. ICDCS’10 Genoa, Italy25
MIDDLEWARE SYSTEMS RESEARCH GROUP Star Merge SM vs NM vs GM ICDCS’10 Genoa, Italy26
MIDDLEWARE SYSTEMS RESEARCH GROUP Divide-and-conquer The effect of the number of nodes ICDCS’10 Genoa, Italy27
MIDDLEWARE SYSTEMS RESEARCH GROUP Divide-and-conquer DC vs GM vs RingPT ICDCS’10 Genoa, Italy28
MIDDLEWARE SYSTEMS RESEARCH GROUP Algorithm summary ICDCS’10 Genoa, Italy29 Running timeQuality of overlay #edges (avg node degree) Required information Potential to Decentralize RingPTgoodpoorfull knowledgegood GM poor: O(|V| 2 |T|)good: O(log(|V| |T|)) full knowledgepoor NMpoor: 75% of GM goodfull knowledgegood SMgood: 1.0% of GM good: ≤ 0.15 compared to GM partial knowledgegood DCgood: 1.7% of GM good: ≤ 2.12 compared to GM partial knowledgegood
MIDDLEWARE SYSTEMS RESEARCH GROUP ICDCS’10 Genoa, Italy30
MIDDLEWARE SYSTEMS RESEARCH GROUP Minimal Number of Links A typical pub/sub system combines a number of protocols, many of which maintaining per-link state –A node must constantly monitor the availability of each of its neighbors (heartbeats and keep-alive state) –If the links are maintained using TCP, there is the cost of connection state for each link –The more links there are, the fewer topics can be routed over each individual link, thereby diminishing cross-topic aggregation benefits –If sequential-diff-based compression scheme is used, there is an extra cost associated with a history table