Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 † Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.) Efficient Mining.

Similar presentations


Presentation on theme: "1 † Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.) Efficient Mining."— Presentation transcript:

1 1 † Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.) Efficient Mining of Recurrent Rules from a Sequence Database ‡ Data Mining Group Department of Computer Science Uni. of Illinois at Urbana- Champaign Current: (Microsoft Research, Redmond) David Lo †* Joint work with: Siau-Cheng Khoo † and Chao Liu ‡

2 2 Motivation o Huge amount of data exists, we want to mine knowledge from data. o Recurrent Rules “Whenever a series of precedent events (pre) occurs, eventually another series of consequent events (post) occurs.” Denoted as: pre->post o We want to mine for recurrent rules from a sequence database.

3 3 Recurrent Rules – Intuitive Examples o Locking Protocol o Internet Banking “Whenever a lock is acquired, eventually it is released” “Whenever a connection to a bank server is made and authentication is completed, money transfer command is issued and verified, eventually money is transferred and notification is displayed.”

4 4 Soft. Specifications & Recurrent Rule o Recurrent rule – Corresponds to a family of program properties useful for software verification o Formalized in Linear Temporal Logic o Mining for these software specs are often incomplete, outdated [ABL02,DSB04,LKL07] o Mining specifications helps in: – Understanding existing/legacy systems – Help verification tools to ensure correctness of systems and detect bugs.

5 5 Problem Statements “Given a set of sequences, find rules that recur (are satisfied) a significant number of times within a sequence and across multiple sequences. A rule is significant if it satisfies minimum thresholds of supports and confidence. ” Problem 2 “Mine a set of non-redundant significant recurrent rules.” Problem 1

6 6 Extending Sequential Rules [S99] oSequential rule pre->post: – Rules formed by composing sequential patterns [AS95,YHA03,WH04]: series of events supported (i.e. a sub-sequence of) by a significant number of sequences. – Whenever a sequence is a super-seq. of pre it will also be a super-seq. of pre++post oRecurrent rule: - Multiple occurrences of the rule’s premise and consequent both within a sequence and across multiple sequences are considered

7 7 Extending Episode Rules [MTV97] oEpisode rule pre->post: – Episode: series of events occurring close together (e.g., in a window). – Whenever a window is a super-seq. of pre it will also be a super-seq. of pre++post. oRecurrent rule: – Handle multiple sequences – We want to break the window barrier – It is hard to tell the right window size –Lock separated frm unlock by arbitrary no of evs – We mine a non-redundant set of rules

8 8 Preliminaries

9 9 Linear Temporal Logic (LTL) oFormalism to precisely specify temporal requirements. oIt works on paths [HR03] oThere are a number of operators: oG p – Globally at every point in time p holds oF p – At that point in time or eventually (Finally) p holds oX p – p holds at the neXt point in time RuleLTL a -> bG(a->XF(b)) -> G(a->XG(b->XF(c^XF(d))))

10 10 Checking or Verifying Temporal Logics Automata Model main lock use unlock lock use unlock lockend To Check Violation LTL property to check -> Transform Possible Traces or Sequences main lock use unlock lock end main lock use unlock lock use unlock end main lock use unlock end … main(x){ if (lock=0) lock;use;unlock;lock; else for i: 1 to 10 lock;use;unlock } Program 10

11 11 Concepts, Definitions And Rules Semantics

12 12 Temporal Points “Whenever a series of precedent events occurs at a point in time or temporal point, eventually another series of consequent events occurs.” -Peek at interesting temporal points & see what series of evs are likely to happen next -Temporal points in a sequence S - The indices in S, starting from 1. - Consider a sequence. There are 6 temporal points in the sequence. -For a temporal point j in S=, the prefix of S is called j-prefix of S.

13 13 Occurrences & Instances oConsider a pattern P, and a sequence S oThe set of all occurrences of P in S, Occ(P,S) is the set: {j| P j-prefix of S && last (P) = S[j] } oThe set of all instances of P in S, Inst(P,S) is the set: {j-prefix of S | j is in Occ(P,S)} oConsider the sequence – The set of occurrences of is {2,4,6} – Instances of is: {,, } – Correspond to temporal points to be checked for rules with as premise

14 14 Projected and Projected-all DB oA database SeqDB projected on pattern P is defined as: SeqDB P = {(j,sx)| s = SeqDB[j], s = px++sx, where px is the minimal prefix of s containing P} ID.Sequence S1 S2 ID.Sequence S1 S2 SeqDB

15 15 Projected and Projected-all DB oA database SeqDB projected-all on pattern P is defined as: SeqDB P = {(j,sx)| s = SeqDB[j], s = px++sx, where px is an instance of P} oReturn temporal points to check all ID.Sequence S1 S2 SeqDB ID.Sequence S1 i S1 ii S2 i S2 ii SeqDB all

16 16 Counting Supports and Confidence oConsider the rule pre->post oSequence Support (s-sup): The number of sequences where the prefix pre appears. oInstance support (i-sup): The number of instances of pre++post. oConfidence (conf): The likelihood that post appears after pre. This can be found by computing the ratio: Instances of pre, where post eventually occurs afterwards ----------------------------- = |Instances of pre| |(SeqDB pre ) post | ---------- |SeqDB pre | all

17 17 Counting Supports and Confidence s-sup ( -> ) = 2 i-sup ( -> ) = 3 conf( -> ) = 1.0 conf( -> ) = 0.5 Seq ID.Sequence S1 S2 X X

18 18 Properties, Theorems, and Algorithms

19 19 Apriori Properties – Support & Conf. Theorem 1. Consider two rule Rx = p->c & Ry = q -> c. If p q and s-sup(Rx) < min-s-sup, then s-sup(Ry) < min-s-sup. Rx: a -> z ; s-sup(Rx) < min_s-sup a,b -> z a,b,c -> z a,c -> z a,b,d -> z …. Non- significant Ry s Theorem 2. Consider two rule Rx = p->c & Ry = p -> d. If c d and conf(Rx) < min-conf, then conf(Ry) < min-conf. Rx: a -> z ; conf(Rx) < min_conf a -> b,z a -> b,c,z a -> c,z a -> b,d,z …. Ry s

20 20 Rule Redundancy oConsider two rules Rx = p->c and Ry = q -> d. Rx is redundant if the following conditions hold: 1.Rx is a sub-seq. of Y (i.e., p++c q++d) 2.Rx & Ry have the same sup. and conf. values. Redundant rules are identified and removed early during mining process. a -> b a -> c a -> b,c a -> b,d …. Redundant iff sup and conf are the same Rx: a -> b,c,d Ry s

21 21 Theorem 3. Given two pre-conditions PX and PY where PX PY, if SeqDB PX = SeqDB PY then for all sequences of events post, rules PX -> post is rendered redundant by PY -> post. -> post Redundant Rules: …. Theorem 4. Given two rules RX (pre -> CX) and RY (pre -> CY ) if CX CY and (SeqDB pre ) CX = (SeqDB pre ) CY then RX is rendered redundant by RY and can be pruned. all pre -> Redundant Rules: ….

22 22 Algorithm oStep 1: Mine a pruned set of pre-conditions – Satisfy min-s-sup threshold – Use Theorems 1 & 3 oStep 2: For each pre-cond. pre, create SeqDB pre. oStep 3: Mine a pruned set of post-conditions – Corresponding rules satisfy min-conf. – Use Theorems 2 & 4 oStep 4: Remove rules that don’t satisfy min-i-sup. oStep 5: Filter any remaining redundant rules. all

23 23 Equiv. Proj DB & LS-Set Patterns oFrom Theorem 3 (& 4), a pre- (post-) condition is not pruned iff: there does not exist any super-sequence pattern having the same projected database. oAlso referred to as projected-database closed or LS-Set (Yan and Han, 2003) oWe generate this set by modifying BIDE (Wang and Han, 2004) - Keep the search space pruning strategy - Remove the closure checks - Proof of completeness in technical report

24 24 Mine Pruned Pre-Conds Mine Pruned Post-Conds Check Instance Support & Remove Remaining Red. Rules

25 25 Performance & Case Study

26 26 Synthetic Dataset D5C20N10S20 147x Faster, 8500x More Compact

27 27 Gazelle Dataset KDD Cup 2000 Full-set of significant rules is not minable

28 28 JBoss Security Premise Consequent XLoginConfImpl.getCfgEntry() AuthenticationInfo.getName() ClientLoginModule.initialize() ClientLoginModule.login() ClientLoginModule.commit() SecAssocActs.setPrincipalInfo() SetPrincipalInfoAction.run() SecAssocActs.pushSubjectCtx() SubjectThdLocalStack.push() SimplePrincipal.toString() SecAssoc.getPrincipal() SecAssoc.getCredential() SecAssoc.getPrincipal() SecAssoc.getCredential() Whenever login configuration information is checked, eventually invocations of authentication events, binding of principal to subject, utilization of subject & principal information occur

29 29 Conclusion oWe propose a novel framework to mine a non- redundant set of significant recurrent rules: “Whenever a series of precedent events occurs, eventually a series of consequent events occurs” oEmploy 2 apriori properties and 2 redundancy thms oMajor speedup and reduction of rules by non- redundant rule mining strategy. oWe show the utility in mining behavior of JBoss Security Future Work o Improve mining speed o More case studies and apps to DM/SE problems


Download ppt "1 † Prog. Lang. & Sys. Lab Dept of Comp. Science National Uni. of Singapore Current: (Sch. of Info. Systems, Singapore Management Uni.) Efficient Mining."

Similar presentations


Ads by Google