Download presentation
Presentation is loading. Please wait.
Published byDante Bolding Modified over 9 years ago
1
Sequential Patterns & Process Mining Current State of Research Edgar de Graaf LIACS
2
2/34 Mining Sequential Patterns Sequential Patterns Sequence Databases AprioriAll PrefixSpan Gap Constraints
3
3/34 Sequential Patterns contained in not contained in
4
4/34 Sequential databases The Database with sequences
5
5/34 Sequential databases Support count 0 A Generated Candidate Pattern
6
6/34 Sequential databases Support count 0 1
7
7/34 Sequential databases Support count 1 Not Contained → Not Counted
8
8/34 Sequential databases Support count 1 Contained 2345 IF Minimal Support ≤ 50% THEN frequent
9
9/34 Lifting order (1) Notation by examples , a ordered list of sets ≡ sequence Every set A,B and C is unordered. E.g. A = (x,y,z) = (y,z,x) = (z,y,x) = … [x,y,z] is an extension: we ignore the order when counting frequency
10
10/34 Lifting order (2) and frequent → is frequent Says: t3 and t2 occurs frequent in- between t1 and t4 in either order
11
11/34 Lifting Order (3) and infrequent suppose (t1)[t3,t2](t4) frequent Says: often t3 and t2 occur in-between t1 and t4
12
12/34 Existing Algorithms AprioriAll: the first algorithm based on the anti-monotone principles PrefixSpan: currently the fastest algorithm around, it uses projected databases
13
13/34 AprioriAll (1) AprioriAll(DB, min_sup){ L 1 = {frequent sequences size 1} k = 2 while(L k-1 is not empty){ C k = candidateGeneration(L k-1,k) C k = candidatePruning(C k, k) L k = supportBasedPruning(C k ) k++ }
14
14/34 AprioriAll (2) candidateGeneration(L k-1, k){ C k = ø for each a in L k-1 { for each b in L k-1 { if(all n, 1 ≤ n ≤ k-2: a n = b n ) toevoegen aan C k de sequences: {a 1 …a k-2, a k-1, b k-1 }en {a 1 …a k-2, b k-1, a k-1 } }
15
15/34 PrefixSpan (1) Assume that the prefix = 1. Scan de projected database to find every frequent item x such that 1. is frequent or 2. is frequent 2. Append the x to the prefix and output the pattern 3. Now call recursively e.g. PrefixSpan(, newProjDB)
16
16/34 PrefixSpan (2) A projected DB only stores the postfix E.g. if prefix = then we store as New projected DB = Old projected DB – sequences without prefix
17
17/34 PrefixSpan (3) Faster than AprioriAll No non-existing candidates Testing on a shrinking projected DB
18
18/34 Gap Constraint Simple idea: between sequence-item- sets a maximal distance, e.g. pattern = and gap = 1 then this sequence is not counted
19
19/34 Process Mining What is process mining? Using D/F tables and graphs Genetic Algorithms Problem areas Using sequential patterns
20
20/34 What is process mining? (1) The ordering of events is known e.g. Process mining constructs a petri net: claimregisterto_be_evaluated pay send_letter ready Source: Workflow Management by W. van der Aalst and K. van Hee. (1997)
21
21/34 What is process mining? (2) Usability of process mining: Given the audit trails, what is the workflow network? Mined workflow network ≡ original design? (Delta Analysis) Mined workflow network better than the original design? (Performance Analysis)
22
22/34 Using D/F tables and graphs (1) B#BB<AA>BB<<<AA>>>BA→B T11000006870-0.246 T21994001035505-0.487 For every task a D/F table: Intuition: if A is often followed by B then the probability of A causing B increases
23
23/34 Using D/F tables and graphs (2) A D/F graph is constructed: IF((A→B ≥ N) AND (A > B ≥ σ) AND (B < A ≤ σ) THEN connection A to B More complicated rules deal with recursion and short loops
24
24/34 Using D/F tables and graphs (3) D/F Graph example:
25
25/34 Using D/F tables and graphs (4) AND/OR-Splits: OR if neither C > B or B > C is higher than the threshold AND if both are higher than threshold A B C
26
26/34 Genetic Algorithms (1) 1. Create a initial population of workflows 2. Calculate their fitness using audit trails 3. Create a child 4. Mutate the child 5. Repeat 3 to 4 to create the new population 6. Go to 2
27
27/34 Genetic Algorithms (2) Advantages: Can deal with duplicate tasks and non- free choice. Disadvantages: The structure of the “chromosome” How do we measure fitness? How do we do cross-over and mutation?
28
28/34 Problem Areas (1) Hidden tasks: Duplicate tasks: when tasks have the same name B C
29
29/34 Problem Areas (2) Mining non-free-choice A B C D E
30
30/34 Problem Areas (3) Mining Loops: ABCDBCD BC DA
31
31/34 Problem Areas (4) Delta analysis: how do we compare two models? Other problems: time, dealing with noise and incompleteness.
32
32/34 Using sequential patterns Mining loops? Fitness measure in a GA? Use in delta analysis? Generate the important frequent subsequences to help the designer
33
33/34 Further research in sequences How about gaps between items in different item sets? What type of frequent subsequences to use in fitness? Lifting order, is it useful in workflow generation? Further research of lifting order
34
34/34 The End Thank you for your attention Edgar de Graaf edegraaf@liacs.nl
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.