MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Predictive Publish/Subscribe Matching Joint work with Vinod Muthusamy & Haifeng Liu University of Toronto P-ToPSS project Hans-Arno Jacobsen
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Little Anecdote 2 Date: Mon, 14 Sep … 10:37: From: " To: … Cc: … CNS Security Admin Subject: DDoS attack originating from …
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org /var/log/secure* & LogWatch aaron/password from : … abdullah/password from : abraham/password from : abram/password from : account/password from : account/password from : adam/password from : addison/password from : aditya/password from : admin/password from : 18 Time(s) admin/password from : 18 Time(s) administrator/password from : 3 Time(s) administrator/password from : 3 Time(s) jacobsen/password from : 2 Time(s) 3
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org And So It Happened: Post-mortem forensics via events across different logs …denied John successfultimestamp … Johnlogofftimestamp … John successfultimestamp … Johnpassword changed 4 Had set user john with password john!
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Predictive Analytics? Series of failed login attempts from same IP – System is under attack Series of failed login attempts from same IP, followed by successful login from that IP, followed by immediate logoff – System compromised Could we predict that the system is going to be compromised soon with a certain probability, after observing a partial match of the above pattern? – E.g.,: "failed logins from IP, successful login from IP” 5 Compromised?
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Events, Subscriptions & Publish/Subscribe Here, events are – Login attempts, logoff, system compromised Here, subscriptions are – Specific patterns of interest Series of login attempts from same IP Series of login attempts from same IP, followed by logoff The publish/subscribe system is the abstraction that matches subscriptions based on events observed A match detects the event, e.g., system compromised 6
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Outline Predictive Toronto Publish/Subscribe System Event & subscription language model Matching with P-ToPSS Predicting with P-ToPSS Evaluation 7
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org P-ToPSS is Latest ToPSS Member For many applications raising an alert after a malicious activity occurred is too late – Credit card fraud (fraud committed) – Network intrusion (system compromised) – Problem determination (problem occurred) – Root-cause analysis (system crashed, poor user experience) Capability to predict the probability that a given subscription will match in the future is needed. P-ToPSS computes the probability that a subscription will match based on the event history and based on partial matches observed so far. 8
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org P-ToPSS Model 9 Match Engine cs1 will be matched with Probability 0.5 cs4 will be matched with Probability 0.75 cs2 is matched cs1 is fully matched cs1 will be matched with Probability 0.8 Publish/Subscribe matching problem Find all matches Publish/Subscribe prediction problem Find partial matches Determine subscriptions with matching probability > threshold
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org 10 Event Model An event: e = {(a 1,v 1 ),(a 2,v 2 ), …(a n,v n )} Event stream: {e 1, e 2, … e k, …} Events are ordered (system timestamps)
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org 11 Subscription Language Model Primitive subscriptions – S = p 1 p 2 p 3, … – p i is a Boolean predicate Composite subscriptions – CS = R(S 1, S 2, S 3, … S m ) R: Operators – Temporal operators:, : contiguous sequence ; : non-contiguous temporal operator – Boolean operators: : conjunction : disjunction Contiguous event sequence No event can be skipped Non-contiguous event sequence Events can be skipped
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Example s 1 : ip=$x login=denied s 2 : ip=$x login=denied s 3 : ip=$x login=success s 4 : ip=$x login=success s 5 : ip=$x action=passwd s 6 : ip=$x action=logoff 12 cs intrusion matched by {e 0, e 1 }, e 2, e 3, e 4 cs intrusion = s 1 ; ( ( s 2 ;s 3 -t 2 <d) ) (s 4,s 5 ) );s 6
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Problem Statement Matching Problem – Given a set of composite subscriptions, CS, and an event stream, {e i }, find all cs = R(s 1, s 2, …, s n ) such there that exists {e j1,e j2,…, e jn } {e i } and e j1 matches s 1, …, e jn matches s n subject to R and all time constraints are satisfied. Prediction Problem – Find all partially matched cs such that Pr cs (full match | partial) > θ cs 13
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Required Matching Tasks Composite subscription: s 1 ; ( (s 2 ;s 3 -t 1 <d) ) (s 4,s 5 ) );s 6 Primitive subscriptions, like s i, matching single events (i.e., sets of attribute-value-pairs) Sequences of primitive subscriptions matching consecutive and non-consecutive events in the input Boolean expressions, like term 1 term 2 above, matching higher-level patterns of events Computation of probabilities to predict full matches given partial matches 14
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Matching Engine 15 Primitive Subscriptions Matcher State Machine Engine Boolean Expression Tree Matcher Prediction Engine Full matches Event stream Derived events Partial matches Partial matches Predictions (subscription, matching probability > θ S ) Primitive subscription matches s 1 ; ( (s 2 ;s 3 -t 1 <d)) (s 4,s 5 ) );s 6 term 1 term 2 s 2 ;s 3 s3s3
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Algorithms for Matching Tasks Primitive Subscription Matcher – BDD-based approach (our ICDCS’05 algorithm) – Alternatively, our SIGMOD’01 algorithm or our new indEX (fastest Boolean Expression Index in the market) Boolean Expression Tree Matcher (state-based) – Extension of the Rete algorithms as in-memory event processing network (Forgy, 1982) – For extensions & implementation, see our PADRES code base at padres.msrg.orgpadres.msrg.org 16
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Algorithms for Matching Tasks State Machine Engine – Based on evaluating finite state machines (FSMs) – Combined with techniques to merge states to amortize processing of similar subscriptions – Combined with algorithms and data structures to track time conditions Prediction Engine – Based on training and evaluating a Markov model Trained on past events Evaluation over event stream 17
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org State Machine Engine State machine creation State machine evaluation 18
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Example: F, F, N3 -t N1 <d), S We abstract for ease of presentation F represents a primitive subscription that evaluates to true for a failed login S represents a primitive subscription that evaluates to true for a successful login Index in time constrain refers to position (state) in the subscription (FSM) 19 N0N0 N 1 (F) FF S3 -t S1 <d) S N 2 (F,F) N 3 (F,F,F) N 4 (F,F,F, S) t Time of the most recent transition into the state Explicit temporal operator treated as another predicate to be evaluated over transition times tracked for all states Contiguous sequence operator
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org 20 N 1 (F) FF S3 -t S1 <d) S N 2 (F,F) N 3 (F,F,F) t1t1 t2t2 t3t3 FF Event stream F time N 1 (F) FF S3 -t S1 <d) S N 2 (F,F) N 3 (F,F,F) S = F, F, N3 -t N1 <d), S Current state N 1 (F) t1t1 At t 1 At t 2 At t 3 F N 1 (F) FF S3 -t S1 <d) S N 2 (F,F) N 3 (F,F,F) FF F N 2 (F,F) t2t2 N 1 (F) t2t2 F F F F F N 3 (F,F,F) t3t3 N 2 (F,F) t3t3 N 1 (F) S3 -t S1 <d) Contiguous sequence operator
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Example: F; S 1 ; F; S S2 -t S1 <T) 21 N0N0 N 1 (F) FS F N 2 (F;S) N 3 (F;S;F) N 4 (F;S;F; Events not contributing to matching a subscriptions are allowed to occur (must remain in current state; achieved via self-links) Upon a match of the next primitive subscription Time conditions are checked, if any Transition times are updated Transition times are only tracked for primary & secondary links Non-contiguous sequence operator F * S * F * Primary link Secondary link Self link Triggered for every event except those that trigger primary & secondary links. First transition into state Continued matching of primitive subscription that led to the transitioning into this state.
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org 22 N0N0 N 1 (S 1 ) S1S1 S 1 S 2 S 1 ; S 1 ; S T 3 N 2 (S 1 ;S 2 ) N 3 (S 1 ;S 2 ;S 3 ) not(S 2 ) S 2 not(T 1 ) not(S 3 ) S 3 ( not(T 2 ) not(T 3 ) ) T 1 : (t S2 -t S1 < 3) T 2 : (t S3 -t S1 < 6) T 3 : (t S3 -t S2 > 3) t1t1 time S1S1 t4t4 t7t7 S1S1 S1S1 S2S2 S2S2 S2S2 S3S3 S3S3 not(S 1 ) S1S1 Time(S 1 ): S 1 :t 1 S 1 :t 2 S 1 :t 3 Time(S 2 ): S 2 :t 4 T c (S 1 ) = {t 2, t 3 } S 2 :t 5 T c (S 1 ) = {t 3 } Time(S 3 ): S 3 :t 8 Tc(S 2 ) = {t 4 } Tc(S 1 ) = {t 3 } S 2 : t 4 S 2 : t 5 S 2 : t 6 S 3 : t 7 S 3 : t 8 S1S1 S2S2 S3S3
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org 23 Merging State Machines Two states N 1 and N 2 are equivalent iff: 1. The number of incoming transitions of N 1 and N 2 are equal. 2.Any incoming transitions arrive from equivalent states and are triggered by the same set of events. Initial states are equivalent. N0N0 a N 2 (a;b) bc N 1 (a) * N 3 (a;b,c) N0N0 a N 2 (a;b) bd N 1 (a) * N 3 (a;b,d) N0N0 a N 1 (a) M0M0 M 2 (a;b) b c M 1 (a) a * M 4 (a;b,d) M 3 (a;b,c) d N 5 (a) a Merged: a; b; c a; b; d a
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org 24 Markov Model for Prediction FSMs record incremental matches of subscriptions Probability of transitioning to next state for a given event depends only on current state Our FSMs are Markov processes Our prediction algorithm uses the properties of Markov processes to predict future matches based on current state and event history – Probability of reaching the final state in n events – … of reaching final state in the next 1, 2, 3, … n events
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org 25 Prediction & Training Compute long-run transition probability of reaching a given state Based on the input (event history), we count the number of times transitions are taken Based on counters, we compute transition probabilities of the model Transition probability from state i to j is Complete Markov chain with finite state space p ij = Pr(X n+1 = j| X n = i) – Conditional probability of transitioning to j given i # times transition taken all incoming transitions
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Experiments Synthetic workload Real data set 26
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Effect of Number of Subscriptions 27 Merging reduces number of states by up to 30% for given data set Number of states increases linearly in number of subscriptions More states are required for workloads with less state sharing potential Number of states Average matching time per event Matching time increases in the number of subscriptions More sharing requires more processing as a given event may trigger more transitions Gaussian Uniform More sharing Less sharing
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Effect of Number of Non-contiguous Operators Matching time increases in number of non-contiguous operators More and more subscription instances are partially matched waiting for events Asks for a garbage collection scheme 28 Average matching time per event
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Experiments on Synthetic Workload 29 Precision decreases as look-ahead increases Precision increases as prediction-threshold increases and stabilizes for large thresholds
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Expert Model (full) vs. Learned Model 30 Full model (about 1400 states)Learned model (5 states) Precision defined as True positives / All predictions Result: With increasing look-ahead learned model results in higher precision.
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org Conclusions P-ToPSS is a new publish/subscribe model for event stream processing Predicts the probability a subscription will match in the future Performs traditional publish/subscribe matching Supports state-based, temporal and Boolean operators over predicates (complex subscriptions) Based on Markov chains for prediction Prediction performance of learned model is better than hand-crafted model in our experiments 31
MIDDLEWARE SYSTEMS RESEARCH GROUP msrg.org 32