Panagiotis Papapetrou Department of Computer Science Boston University Constraint-based Mining of Frequent Arrangements of Temporal Intervals Master Thesis.

Slides:



Advertisements
Similar presentations
An Efficient Algorithm for Mining Time Interval-based Patterns in Large Databases Yi-Cheng Chen, Ji-Chiang Jiang, Wen-Chih Peng and Suh-Yin Lee Department.
Advertisements

Sequential PAttern Mining using A Bitmap Representation
Mining Association Rules from Microarray Gene Expression Data.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
gSpan: Graph-based substructure pattern mining
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Rakesh Agrawal Ramakrishnan Srikant
Efficiently Mining Long Patterns from Databases Roberto J. Bayardo Jr. IBM Almaden Research Center.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Sequential Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequence Databases & Sequential Patterns
Mining Sequential Patterns Dimitrios Gunopulos, UCR.
Mining Time-Series Databases Mohamed G. Elfeky. Introduction A Time-Series Database is a database that contains data for each point in time. Examples:
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.
Data Mining Association Analysis: Basic Concepts and Algorithms
CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations.
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Fast Algorithms for Association Rule Mining
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Performance and Scalability: Apriori Implementation.
A Short Introduction to Sequential Data Mining
What Is Sequential Pattern Mining?
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Ch5 Mining Frequent Patterns, Associations, and Correlations
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
Sequential PAttern Mining using A Bitmap Representation
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Lecture 11 Sequential Pattern Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University.
Sequential Pattern Mining
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data Yi-Cheng Chen, Wen-Chih Peng and Suh-Yin Lee ICDM 2011.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Generalized Sequential Pattern Mining with Item Intervals Yu Hirate Hayato Yamana PAKDD2006.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
Panagiotis Papapetrou, George Kollios, Stan Sclaroff, Dimitrios Gunopulos Department of Computer Science Boston University University of California, Riverside.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Temporal Database Paper Reading R 資工碩一 馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Salvatore Orlando Raffaele Perego Claudio Silvestri 國立雲林科技大學 National.
18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
1 Finding Periodic Partial Patterns in Time Series Database Huiping Cao Apr. 30, 2003.
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Discovering Frequent Arrangements of Temporal Intervals Papapetrou, P. ; Kollios, G. ; Sclaroff, S. ; Gunopulos, D. ICDM 2005.
Data Mining Association Analysis: Basic Concepts and Algorithms
Sequential Pattern Mining Using A Bitmap Representation
Frequent Pattern Mining
CARPENTER Find Closed Patterns in Long Biological Datasets
Data Mining Association Analysis: Basic Concepts and Algorithms
Farzaneh Mirzazadeh Fall 2007
Discovering Frequent Poly-Regions in DNA Sequences
Presentation transcript:

Panagiotis Papapetrou Department of Computer Science Boston University Constraint-based Mining of Frequent Arrangements of Temporal Intervals Master Thesis Defense

Introduction and Motivation  Sequential pattern mining has received particular attention in the last decade: Database of sequences: ordered lists of instantaneous events. Extract frequent sequential patterns.  In many applications events occur over time intervals.  Extracting frequent arrangements of these temporally correlated labeled intervals may lead to useful observations.  So far, algorithms concentrate on the case where events occur instantaneously. Several works on mining temporal patterns of interval-based events. However, the mining algorithms were apriori-based and in some cases [1] the extracted patterns were restricted to certain forms. 1. P. Kam and A. W. Fu. “Discovering temporal patterns of Interval-based Events”. In Proc. of the DaWak, pages 317–326, London, UK, Springer-Verlag.

Applications (1/4) Linguistics  ASL Database Collections of utterances. Utterance:  Associates a segment of video with a detailed transcription.  Number of ASL fields occurring over time intervals.  Syntactic Structures: Wh-Question. Negation. Yes/No Question.  Gestural Fields: Head-shake. Eye-brow raise/lower.

Applications (2/4) Linguistics (An example) > Who drove the car? (Eye-brow Lower) (Wh-Question) (Wh-Word) time (Rapid head shake)

Applications (3/4) Networks Router 1 Router 2 IPs A B (D, C) (D, B) (A, B) D C time

Applications (4/4) Biology Human Gene (Nucleodite C) (Nucleodite G) (Nucleodite A) Position in the Gene

Main Contributions  Formal definition of the problem of mining frequent temporal arrangements of intervals in an interval database using temporal and structural constraints.  Development of three algorithms: BFS-based DFS-based Prefix-based  Further improvement of the mining process with the incorporation of interestingness measures for the extracted arrangement rules.  Extensive experimental evaluation and comparison with a standard sequential pattern mining method both on real and synthetic datasets.

Outline  Preliminaries  Problem Formulation  Proposed Algorithms BFS-based DFS-based Prefix-based  Extraction of Arrangement Rules  Experimental Evaluation  Related Work  Conclusions and Future Work

Preliminaries (1/9)  There can be many types of relations between two event intervals 2.  We consider seven of them: 2. J. F. Allen and G. Ferguson. “Actions and events in interval temporal logic”. Technical Report 521, The University of Rochester, July 1994”.

Preliminaries (2/9) S = {E 1, E 2, …, E m } be an ordered set of event intervals, called event interval sequence, or e-sequence.  Let S = {E 1, E 2, …, E m } be an ordered set of event intervals, called event interval sequence, or e-sequence.  Each E i is a triple (e i, t i start, t i end ) e i : an event label. e i : an event label. t i start: : the event start time. t i start: : the event start time. t i end: : the event end time. t i end: : the event end time. Note: S is ordered by t i start.  k-e-sequence  k-e-sequence: an e-sequence of size k.  e-sequence database D:  e-sequence database D: a set of e-sequences.

Preliminaries (3/9)  Example of a 5-e-sequence: S= { (A,1,7), (B,3,19), (D,4,30), (C,7,15), C,23,42) } S = { (A,1,7), (B,3,19), (D,4,30), (C,7,15), C,23,42) }

Preliminaries (4/9)  k-Arrangement:k E  k-Arrangement: a set of k temporally correlated events in an e-sequence, denoted as A = {E, R}, where: E E : the set of labels of the event intervals in the arrangement. R R : the set of temporal relations between the events in E. E i E j where is the temporal relation between E i and E j.

Preliminaries (5/9) SAER  Given an e-sequence S and an arrangement A = {E, R}: SAES R S contains A, if all the events in E appear in S, with the relations defined in R. D min_sup  Given an e-sequence database D and a minimum support threshold min_sup: A min_supD An arrangement A is frequent, if it is contained in at least min_sup e-sequences (i.e. records) of D.

Preliminaries (6/9) S= {(A,1,7), (B,3,19), (D,4,30), (C,7,15), C,23,42)} S = {(A,1,7), (B,3,19), (D,4,30), (C,7,15), C,23,42)} A S  Example of an arrangement A, contained in an e- sequence S:

Preliminaries (7/9)  Arrangement Rule:  AER  A = {E, R} is split into: A i E i R i A i = {E i, R i } A j E j R j A j = {E j, R j } E i E j = Ø R ij : defines the relations between E i and E j. λ: an interestingness measure.

Preliminaries (8/9) r, A  Example of an arrangement rule r, given arrangement A = r  r :

Preliminaries (9/9)  Monotone Interestingness Measures: Support (A) = |A|/|D| Support (A) = |A|/|D| All-Confidence (A) = sup(A)/max{sup(A k )} All-Confidence (A) = sup(A)/max{sup(A k )}  Anti-Monotone Interestingness Measures: Confidence (r) = support (r) / coverage (r) Confidence (r) = support (r) / coverage (r) Lift (r) = support (r) / cover (A) * cover (B) Lift (r) = support (r) / cover (A) * cover (B) Leverage (r) = support (r) – cover (A) * cover (B) Leverage (r) = support (r) – cover (A) * cover (B) Conviction (r) = (1-support(B))/(1-confidence (r)) Conviction (r) = (1-support(B))/(1-confidence (r)) Cover (A) = |A|/|D| Coverage (r : A->B) = Cover (A)

Problem Formulation 1.Find the complete set of frequent arrangements given: D. An e-sequence database D. min_sup. A minimum support threshold min_sup. 2.Find the top K frequent arrangement rules given: D. An e-sequence database D. min_sup. A minimum support threshold min_sup. C A set of constraints C. λ An interestingness measure λ. K An integer K.

Constraints R  Regular Expressions R: A set of regular expressions that limit the form of the extracted patterns. C g  Gap Constraint C g : A Follow should be separated by at most C g units. C o  Overlap Constraint C o = {C ol, C or }: An Overlap should be between C ol % and C or %. C t  Contain Constraint C t = {C tl, C tr }: A Contain should be between C tl % and C tr %. C d  Duration Constraint C d : Each event interval should have a duration of at least C d units.

Apply a sequential pattern mining algorithm?  Consider start and end points of an interval as two instantaneous events.  Convert each e-sequence into a regular sequence.  Apply an efficient sequential pattern mining algorithm + post- processing.  Basic drawbacks: k-e-sequence = sequence of 2k events. May produce 2 2k patterns. Can we reduce it to 2 k ? Extracted patterns will carry lots of redundant information. {A start, B start, A end, B end }, but also: {A start, B start },…

Frequent Arrangement Mining Algorithms  Use a logical Tree-like structure to enumerate the arrangements 4.  Traverse the Tree using: BFS DFS Hybrid DFS  BFS for the first two levels.  DFS for the rest of the mining process. 4. R. J. Bayardo. “Efficiently mining long patterns from databases”. In Proc. of ACM SIGMOD, pages 85–93, 1998.

The Arrangement Enumeration Tree Let LEVEL 3 LEVEL 2 LEVEL 1 Intermediate

BFS-based Approach (1/4)  Traverse the Tree in BFS order.  2 database scans.  On each step k:  Build candidate k-arrangements based on (k-1)-arrangements.  Find 2-relations by scanning the second level of the Tree. min_sup  Determine frequency: min_sup threshold must be satisfied.  If a node is not frequent, do not expand sub-tree (Apriori Principle) 5.  Stop at step k, where no frequent arrangements are found. 5. R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”. In Proc. of VLDB, pages , 1994.

BFS-based Approach (2/4) An Example

BFS-based Approach (3/4) Creating a 2-arrangement (Example)

BFS-based Approach (4/4) Creating a 3-arrangement (Example)

DFS-based Approach  Candidate generation in DFS order.  Leads to frequent large arrangements faster.  Skips expansions of nodes that are definitely going to lead to frequent arrangements.  DFS is inappropriate: For each node we would have to scan the database multiple times to detect the 2-relations among the items in the node.  Hybrid-DFS Generates the first two levels of the Tree using BFS, then uses DFS. Eliminates multiple database scans, 2-relations are available.

Support Counting

Prefix-based Approach (1/8) The Sequential Approach  Prefix and Suffix (Projection),, and are prefixes of sequence Given sequence PrefixSuffix (Prefix-Based Projection)

Prefix-based Approach (2/8) Example Sequence_id Sequence (min_sup=2)

Prefix-based Approach (3/8) The Sequential Approach (continued) Step1: Find length-1 sequential patterns; :4, :4, :4, :3, :3, :3 pattern support Step2: Divide search space; six subsets according to the six prefixes; Step3: Find subsets of sequential patterns; By constructing corresponding projected databases and mine each recursively.

Prefix-based Approach (4/8) Example (continued) Sequence_id Original Sequences Projected Sequences  New locally frequent items: a : 2 b : 4 d : 2 c : 4 f : 3

Prefix-based Approach (5/8) Example (continued) Sequence_id Original Sequence Projected Sequences Sequence_id Original Sequences Projected Sequences

Prefix-based Approach (6/8) The Interval-based Approach S A  Use similar definition for the projection of an e-sequence S with respect to an arrangement A to that of the sequential approach.  Problem: May skip frequent patterns.  Solution: AS Find every occurrence of A in S and project with respect to each one of them.

Prefix-based Approach (7/8) An Example of A Projection

Prefix-based Approach (8/8) An Example That Works And One That Does Not

Extracting Arrangement Rules K λ  Discover top K arrangement rules that maximize a given interestingness measure λ.  How deep can we push λ in the mining process? Depends on antimonotonicity. If λ is antimonotone:  Can prune a subset of the candidate arrangement rules. If λ is non-antimonotone:  Pruning cannot be done.

Non-Antimonotone λ (1/2)  First discover the set of frequent arrangements. C  The set of constraints C is applied during the mining process.  Infer the arrangement rules from the extracted patterns after the completion of the mining process.

Non-Antimonotone λ (2/2) AER  Given a frequent arrangement A = {E, R}  A is split into: A i E i R i A i = {E i, R i } A j E j R j A j = {E j, R j }  Rule is defined. r  If r satisfies λ, add it into the set of valid rules.

Antimonotone λ A A  If A is reached and valid no rule is inferred from A A The sub-tree of A is pruned. R A A  Otherwise, a set of rules R A exists for node A. CER For each new arrangement C = {E, R}  E E 1 E 2.  E is split into E 1 and E 2. A i = E i,R i R A  If A i = {E i,R i } in the antecedent part of any rule in R A such that E 1 C. Then E 1 cannot be the antecedent part of any rule inferred from C. The Split is skipped.

Experimental Setup (1/4) Real Datasets  SignStream Database Created by the National Center for Sign Language and Gesture Resources at Boston University. Collection of 884 utterances. Some types of event labels:  Grammatical or syntactic structures: Wh-Question. Negation. Yes/No Question.  Gestural Fields: Head-shake. Eye-brow raise/lower.

Experimental Setup (2/4) Real Datasets  Network Data Sampled from flow data. Two routers with high communication rate:  ATLA: router in Atlanta.  LOSA: router in LA. Monitored communication for 10 days, between 200 IPs. An e-sequence is a set of IP connections for every 15 minutes:  An event label is the two IPs (source-destination).  The interval corresponds to the duration of this communication. Size of dataset: 960 e-sequences.

Experimental Setup (3/4) Synthetic Datasets  Generated considering the following factors: Number of e-sequences in the Database. Average e-sequence size. Number of distinct items. Density of frequent patterns.

Experimental Setup (4/4) Algorithms  Compared: BFS. Hybrid-DFS. Prefix-based. SPAM 6, modified as follows:  Considered the start and end points of each interval as two instantaneous events.  Post-processed the extracted sequential patterns to convert them into arrangements. 6. J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. Sequential pattern mining using a bitmap representation. In Proc. of ACM SIGKDD, pages 429–435, 2002.

Performance Analysis  BFS outperforms SPAM in large database sizes and small supports.  Hybrid-DFS outperforms both SPAM and BFS.  In low supports Hybrid-DFS is twice as fast as BFS.  In all cases the Prefix-based algorithm performs worse.

Sample Results (1/4) SignStream Database

 Negations:  YES/NO Questions: Sample Results (2/4) SignStream Database

Sample Results (3/4) SignStream Database  WH-questions: For more detailed results visit the following web page:

Sample Results (4/4) Network Dataset

Performance of Different Interestingness Measures ASL Dataset

Some Arrangement Rules (1/2) ASL Dataset

Some Arrangement Rules (2/2) ASL Dataset

Related Work (1/2)  Problem of sequential pattern mining first introduced in: R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In proc. of VLDB, pages ,  An extension to episodes (i.e. combinations of events with a partially specified order) was proposed in: H. Mannila, H. Toivonen, and A. Verkamo. Discovering Frequent episodes in sequences. In Proc. of ACM SIGKDD, pages 210–215,  The Itemset Enumeration Tree was described in: R. J. Bayardo. Efficiently mining long patterns from databases. In Proc. of ACM SIGMOD, pages 85–93,  Some efficient sequential pattern mining algorithms have been proposed in: M. Zaki. Spade: An efficient algorithm for mining sequences. Machine Learning, 40:31–60, J. Ayres, J. Gehrke, T. Yiu, and J. Flannick. Sequential pattern mining using a bitmap representation. In Proc. of ACM SIGKDD, pages 429–435, 2002.

Related Work (2/2)  Closed sequential pattern mining: X. Yan, J. Han, and R. Afshar. Clospan: Mining closed sequential patterns in large databases. In Proc. of SDM, J.Wang and J. Han. Bide: Efficient mining of frequent closed sequences. In Proc. of IEEE ICDE, pages 79–90,  Mining association rules in temporal and spatio-temporal databases: T. Abraham and J. F. Roddick. Incremental meta-mining from large temporal data sets. In ER ’98: Proceedings of the Workshops on Data Warehousing and Data Mining, pages 41–54, X. Chen and I. Petrounias. Mining temporal features in association rules. In Proc. of PKDD, pages 295–300, London, UK, Springer-Verlag. I. Tsoukatos and D. Gunopulos. Efficient mining of spatiotemporal patterns. In Proc. of the SSTD, pages 425–442,  Discovering temporal patterns of Interval-based Events: P. Kam and A. W. Fu. Discovering temporal patterns of Interval-based Events. In Proc. of the DaWak, pages 317–326, London, UK, Springer-Verlag.

Conclusions  The problem of constraint-based mining frequent arrangements of temporal intervals has been formally defined.  Three efficient methods for solving the problem have been discussed.  An efficient algorithm for applying interestingness measures on the discovered patterns and extracting interesting arrangement rules has been proposed.  Both BFS and DFS approaches use an arrangement enumeration tree to discover the set of frequent arrangements.  The DFS-based approach further improves performance over BFS: Longer arrangements are reached faster. The need to examine smaller subsets of these arrangements is eliminated.  The Prefix-based approach performs worse due to projections.

Future Work  Apply our algorithms on biological sequences: DNA. Proteins.  Consider e-sequences with categorical domains: Series of medical treatments for a disease. Result (Cure/Death).

EXTRA SLIDES

Apply a closed sequential pattern mining algorithm 3 ?  Noise again… {A start, B start, A end, B end }:2/3 But also: {A start, A end, B end }:3/3 3. J.Wang and J. Han. “Bide: Efficient mining of frequent closed sequences”. In Proc. of IEEE ICDE, pages 79–90, 2004.

The ISIdList Structure (1/2)  An ISIdList is defined for every arrangement generated throughout the mining process. A D  The ISIdList for an arrangement A = {, R} in an e- sequence database D, has the following structure: Head: Arrangement representation using and R. A A record for each e-sequence in the database that supports A. idintv-List Each record is of type (id, intv-List), where:  idD  id is the id of the e-sequence in D.  intv-List: A set of intervals where A occurs in the e-sequence A (for | | ≤ 2). set of pointers to records of ISIdLists of the second level (for | | > 2).

The ISIdList Structure (2/2) (Example) D  Let D consist of a set or e- sequences of event intervals with labels A, B, C.  The set of frequent 1 arrangements is {A, B, C}, with the following ISIdLists:

BFS-based Approach At each Step k:  Use Tree to generate candidate arrangements:  Build N(k) from N(k-1).  Construct IM k. For every 2-relation, point to the second level of the Tree.  Check support. If it satisfies min_sup, then add to F k.  Continue with the rest of the Tree in a BFS order.  If a node is found not to be frequent, do not expand its sub-tree (Apriori Principle )3.  Stop at step k, where F k = empty. 3. R. Agrawal and R. Srikant. “Fast algorithms for mining association rules”. In Proc. of VLDB, pages , 1994.

BFS-based Approach (1/4)  D  D: an input e-sequence database.  F  F: the complete set of frequent arrangements.  F k  F k : the complete set of frequent k- arrangements.  C k  C k : the current set of candidate k- arrangements.  min_sup  min_sup: the minimum support threshold.  ISIdList (A)  ISIdList (A): the ISIdList of arrangement A.

BFS-based Approach (2/4)  BFS: STEP 1: Find F 1 Use Tree to generate C 1  Build N(1).  For each n i 1 in N(1):  Build ISIdList (A i ), where A i is the arrangement that corresponds to n i 1. min_sup,  If the number of records in ISIdList (A i ) is at least min_sup, then A is inserted into F 1.

BFS-based Approach (3/4)  BFS: STEP k: Find F k Use Tree to generate C k  Build N(k) from N(k-1).  Construct IM k.  For each node in IM k :  Build ISIdList. min_sup,  If the number of records in the ISIdList is at least min_sup, insert arrangement into F 1.  Continue with the rest of the Tree in a BFS order.

BFS-based Approach (4/4)  Continue with the rest of the Tree in a BFS order.  If a node is found not to be frequent, do not expand its sub-tree (Apriori Principle) 1.  Stop at step k, where F k = empty. 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In proc. of VLDB, pages , 1994.

Hybrid DFS-based Approach  DFS is inappropriate: For each node we would have to scan the database multiple times to detect the 2-relations among the items in the node. Though in BFS these relations are already available.  Generate the first two levels of the Tree using BFS.  Then use DFS.  Eliminates multiple database scans, since now the 2-relations are available.

Experimental Setup Real Datasets  Dataset 1: Utterances of WH-Questions. Size: 73 e-sequences. # of labels: 400.  Dataset 2: SignStream Database. Size: 884 e-sequences. # of labels: 400.