Achieving fast (approximate) event matching in large-scale content- based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating.

Slides:

Advertisements

Similar presentations

For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.

Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.

Problem Solving by Searching Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 3 Spring 2007.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Outline Introduction Related work on packet classification Grouper Performance Empirical Evaluation Conclusions.

Searching Kruse and Ryba Ch and 9.6. Problem: Search We are given a list of records. Each record has an associated key. Give efficient algorithm.

Search Engines and Information Retrieval

Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.

©NEC Laboratories America 1 Hui Zhang Samrat Ganguly Sudeept Bhatnagar Rauf Izmailov NEC Labs America Abhishek Sharma University of Southern California.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.

Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.

Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.

XtreemOS IP project is funded by the European Commission under contract IST-FP XtreemOS WP3.2 - T3.2.3 Scalable Directory Service Design State.

Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.

Hashing General idea: Get a large array

Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.

Marina Drosou Department of Computer Science University of Ioannina, Greece Thesis Advisor: Evaggelia Pitoura

Data Structures Introduction Phil Tayco Slide version 1.0 Jan 26, 2015.

Search Engines and Information Retrieval Chapter 1.

Identifying Reversible Functions From an ROBDD Adam MacDonald.

MIDDLEWARE SYSTEMS RESEARCH GROUP Denial of Service in Content-based Publish/Subscribe Systems M.A.Sc. Candidate: Alex Wun Thesis Supervisor: Hans-Arno.

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

Database Management 9. course. Execution of queries.

Data Distribution Dynamic Data Distribution. Outline Introductory Comments Dynamic (Value based) Data Distribution: HLA Data Distribution Management –Routing.

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.

Today  Table/List operations  Parallel Arrays  Efficiency and Big ‘O’  Searching.

Socially-aware pub-sub system for human networks Yaxiong Zhao Jie Wu Department of Computer and Information Sciences Temple University Philadelphia

Chapter 11 Arrays Continued

Hashing Table Professor Sin-Min Lee Department of Computer Science.

CS 338Query Evaluation7-1 Query Evaluation Lecture Topics Query interpretation Basic operations Costs of basic operations Examples Textbook Chapter 12.

CSC 211 Data Structures Lecture 13

Talk at the 4th International Workshop on Distributed Event-Based Systems at the Conference ICDCS 2005 On the Benefits of Non-Canonical Filtering in Publish/Subscribe.

MIDDLEWARE SYSTEMS RESEARCH GROUP Modelling Performance Optimizations for Content-based Publish/Subscribe Alex Wun and Hans-Arno Jacobsen Department of.

Spatial Issues in DBGlobe Dieter Pfoser. Location Parameter in Services Entering the harbor (x,y position)… …triggers information request.

The Curse of Dimensionality Richard Jang Oct. 29, 2003.

Parallel Event Processing for Content-Based Publish/Subscribe Systems Amer Farroukh Department of Electrical and Computer Engineering University of Toronto.

Analysis and algorithms of the construction of the minimum cost content-based publish/subscribe overlay Yaxiong Zhao and Jie Wu

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.

CS4432: Database Systems II Query Processing- Part 2.

Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.

Copyright © Curt Hill Hashing A quick lookup strategy.

Hash Tables © Rick Mercer.  Outline  Discuss what a hash method does  translates a string key into an integer  Discuss a few strategies for implementing.

1 Subscription Partitioning and Routing in Content-based Publish/Subscribe Networks Yi-Min Wang, Lili Qiu, Dimitris Achlioptas, Gautam Das, Paul Larson,

1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.

CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )

Analysis of the Traveling Salesman Problem and current approaches for solving it. Rishi B. Jethwa and Mayank Agarwal. CSE Department. University of Texas.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)

Peter R Pietzuch and Jean Bacon Peer-to-Peer Overlay Networks in an Event-Based Middleware DEBS’03, San Diego, CA, USA,

File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Congestion Avoidance with Incremental Filter Aggregation in Content-Based Routing Networks Mingwen Chen 1, Songlin Hu 1, Vinod Muthusamy 2, Hans-Arno Jacobsen.

Divide and Conquer.

Record Storage, File Organization, and Indexes

Database Management System

File System Structure How do I organize a disk into a file system?

Problem Solving by Searching

Chapter 12: Query Processing

Evaluation of Relational Operations

Chapter 15 QUERY EXECUTION.

Spatial Online Sampling and Aggregation

Overview of Query Evaluation

Implementation of Relational Operations

Presentation transcript:

Achieving fast (approximate) event matching in large-scale content- based publish/subscribe networks Yaxiong Zhao and Jie Wu The speaker will be graduating next summer

Outline Background – Content-based pub/sub networks – Counting algorithm The design – Problem formulation – Index tree based discretization Potential improvements Performance evaluation

Outline Background – Content-based pub/sub networks – Counting algorithm The design – Problem formulation – Index tree based discretization Potential improvements Performance evaluation

Content-based pub/sub networks Pub/sub provides hustle-free messaging between users Content-based pub/sub (CBPS) provides yet more expressiveness Subscription Message Pub/sub routing Content representation model: how to describe contents? Boolean expression model Attribute constraints: a primitive description of the constraints on a attribute’s value A subscription/filter is defined as a conjunction of multiple attribute constraints Each message has multiple attribute assignments conference = ICDCS ᴧ keywords ∈ {content, pub, sub} Title = {xyz} ᴧ conference = ICDCS ᴧ keywords = {content}

Preliminary: counting algorithm Since each subscription is a conjunctive form of multiple attribute constraints – A message matches a subscription i.f.f. it matches all attribute constraints of the subscription – Counting algorithm works by counting the number of matched attribute constraints – Matches if the number is equal to the number of all the attribute constraints of the subscription Title = {xyz} ᴧ conference = ICDCS ᴧ keywords = {content} conference = ICDCS ᴧ keywords ∈ {content, pub, sub} Matches 2 attribute constraints so it matches the subscription

Outline Background – Content-based pub/sub networks – Counting algorithm The design – Problem formulation – Index tree based discretization Potential improvements Performance evaluation

Dissecting the counting algorithm The algorithm goes through two steps – Retrieving matched attribute constraints The most time consuming stage The focus of this paper – Comparing the number of matched constraints with the number of constraints for each subscription and return matched subscription It’s possible to shortcut this process

Optimizing the retrieval stage The naïve approach, i.e. examining all attribute constraints, only works for very small amount – A few thousand More intelligent techniques – Binary search to eliminate unmatched constraints SIENA (Sigcomm’03) – Clustering possibly-matched constraints Faster matching (Sigmod’01)

Range-based attribute constraints Range-based attribute constraints represent an cell that an attribute’s value should be in – Can be seen as a conjunction of two primitive attribute constraints General enough to replace primitive attribute constraints – A primitive constraint can be translated to a range- based constraint – Range-based constraints are highly desirable Height > 100 & Height < 200  Height ∈ (100, 200) Height Height > 0 & Height < 200

Basic idea Reverse indexing of subscriptions – Instead of check each subscription to see what assignments match it – Directly retrieve the matched attribute constraints for each attribute assignment We need a data structure to mapping between attribute value and attribute constraints

Outline Background – Content-based pub/sub networks – Counting algorithm The design – Problem formulation – Index tree based discretization Potential improvements Performance evaluation

Reverse indexing Indexing subscriptions through their represented ranges for each attribute – [100 < height < 200] – [100, 200] subscription 1 – Given an assignment “height = 150” we can find all its matched subscriptions by find the ranges containing this value

Discretization Represent an arbitrary range by evenly separated cells – Mapping between value and its corresponding cell is fast Results in false positive/negative (approximate matching) – We only accept false positives – Guarantee user satisfaction [0, 1, 2] is used to represent this attribute constraint

Index tree and subscription discretization The naïve discretization has a scalability issue – The worst-case false positive is 2 * cell-length / range-length – To achieve a false positive of p The number of cell is 1/p An analog is counting numbers – Need exactly “n” tokens to represent the number “n”

A very simple remedy Just like counting in positional notation Binary separating the attribute value space – Each cell is evenly divided into two in the next level – Log(1/p) cells – Much better scalability {level 2 : 1; level 3: 1, 4; level 4 : 1, 10}

Working with counting algorithm Each range attribute constraint is discretized on multiple levels – Each cell ID associates with a subscription ID Retrieval stage – Table lookup Counting stage – Incremental the counters for subscription IDs – Comparing counters’ values to the number of attribute constraints of subscription Attributes levels cell IDs Subscription IDs

Implementation Data structure organization – Matching table: for discretized cell ID and subscription ID mapping – Subscription ID and attribute constraint counts mapping Linear table More details are in the paper

Dynamic matching for shortcutting matching process In many situations we do not need to find all matched subscriptions – Interface matching Stop matching once any one of the associated subscription matches Just examine a fraction of the discretization levels

Optimal binary separation The above analysis assumes uniform distribution of the attribute values of events The analysis holds for non-uniform distribution only when – Event values are evenly distributed on each cell – I.e. the number of events fall into each cell is the same Optimal binary separation does this – Bisecting a range at its median – Ensure that each cell contains the same amount of events If 90% event’s attribute values fall into here

Outline Background – Content-based pub/sub networks – Counting algorithm The design – Problem formulation – Index tree based discretization Potential improvements Performance evaluation

Eliminating false positives Two types: interface false and subscription – Matches a wrong interface – Matches a wrong subscription Situations that no false positive occurs – Interface/subscriptions matched before the last discretization level – Can be used to short-cutting interface matching To eliminate false positives – Double check the subscriptions that are matched at the last discretization level

Outline Background – Content-based pub/sub networks – Counting algorithm The design – Problem formulation – Index tree based discretization Potential improvements Performance evaluation

Experiment settings A working prototype written in C++ – As a forwarding component in Siena – Total number of attributes is 1,000 – Number of attributes per event 100 – 1,000 – Relative width of attribute constraints 0.01 – 0.1

Subscription matching time (degenerate case) With of attribute constraints: 0.01 (relative to the entire value space) 10 range constraints per subscription Each event has 100 attribute assignments 3.23ms to return all matched subscription for an event with 20 million attribute constraints – Orders of magnitude faster than Siena

Interface matching speeding w/ vs w/o shortcutting Fix the total number of subscriptions to 20,000 Vary the number of subscriptions (filters) per interface Present the changes of interface matching with two different widths of attribute constraints Relative value of the width of attribute constraints

Subscription matching FPR Stores 20,000 subscriptions Vary the width of attribute constraints and the number attributes per subscription 10 8 events Feed 10,000 events to the matching table

Interface matching FPR 20,000 subscriptions 2,000 subscriptions per interface Change the width of range constraints and the number of attributes per subscription

Q&A Thank you for listening! Drop me an if your questions are not answered here