A Fast Algorithm For Finding Frequent Episodes In Event Streams

Slides:



Advertisements
Similar presentations
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
Advertisements

Introduction to Graph “theory”
Determinization of Büchi Automata
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
1.2 Row Reduction and Echelon Forms
Linear Equations in Linear Algebra
1 Hiring Problem and Generating Random Permutations Andreas Klappenecker Partially based on slides by Prof. Welch.
Efficiency of Algorithms
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Temporal Data Mining Claudio Bettini, X.Sean Wang and Sushil Jajodia Presented by Zhuang Liu.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
Algorithms and Efficiency of Algorithms February 4th.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
Linked Lists. RHS – SOC 2 Linked lists We can already store collec- tions of objects in arrays and array lists – why would we need other data structures…?
Zvi Kohavi and Niraj K. Jha 1 Memory, Definiteness, and Information Losslessness of Finite Automata.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Analysis of Algorithms
Hashing Table Professor Sin-Min Lee Department of Computer Science.
Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Introduction to State Machine
Formal Specification of Intrusion Signatures and Detection Rules By Jean-Philippe Pouzol and Mireille Ducassé 15 th IEEE Computer Security Foundations.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Searching Given distinct keys k 1, k 2, …, k n and a collection of n records of the form »(k 1,I 1 ), (k 2,I 2 ), …, (k n, I n ) Search Problem - For key.
1 Hashing - Introduction Dictionary = a dynamic set that supports the operations INSERT, DELETE, SEARCH Dictionary = a dynamic set that supports the operations.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
CHP-4 QUEUE. 1.INTRODUCTION  A queue is a linear list in which elements can be added at one end and elements can be removed only at other end.  That.
M1G Introduction to Programming 2 3. Creating Classes: Room and Item.
18 February 2003Mathias Creutz 1 T Seminar: Discovery of frequent episodes in event sequences Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo.
CS320n – Elements of Visual Programming Assignment Help Session.
Data Structure and Algorithms
1 1.2 Linear Equations in Linear Algebra Row Reduction and Echelon Forms © 2016 Pearson Education, Ltd.
Why getting it almost right is OK and Why scrambling the data may help Oops I made it again…
Overview Part 1 - Storage Elements and Sequential Circuit Analysis
CSCE 210 Data Structures and Algorithms
Computer Organization
Data Abstraction & Problem Solving with C++
Data Indexing Herbert A. Evans.
Top 50 Data Structures Interview Questions
Azita Keshmiri CS 157B Ch 12 indexing and hashing
CSC317 Shortest path algorithms
Online Frequent Episode Mining
Hashing Alexandra Stefan.
structures and their relationships." - Linus Torvalds
Multi - Way Number Partitioning
Spatial Online Sampling and Aggregation
Optimizing Malloc and Free
Space-for-time tradeoffs
Sequences 11/22/2018 3:25 AM Doubly-Linked Lists Doubly-Linked Lists.
Linear Equations in Linear Algebra
Cs212: Data Structures Computer Science Department Lecture 7: Queues.
Lecture 2- Query Processing (continued)
Space-for-time tradeoffs
CSE 2010: Algorithms and Data Structures Algorithms
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Analysis of Algorithms
How to use hash tables to solve olympiad problems
Decision Trees for Mining Data Streams
Data Structures Introduction
Elementary Sorting Algorithms
Chapter 4: Simulation Designs
Some Graph Algorithms.
Indexing, Access and Database System Architecture
Linear Equations in Linear Algebra
structures and their relationships." - Linus Torvalds
Queues.
Presentation transcript:

A Fast Algorithm For Finding Frequent Episodes In Event Streams Srivatsan Laxman, P. S. Sastry , K. P. Unnikrishnan SIGKDD 2007

Introduction Frequent episode discovery is a popular framework for temporal data mining. Data is a single long sequence of ordered pairs, (Ei, ti), which are called as events. The temporal patterns, referred to as episodes, are essentially small, (partially) ordered collections of event types.

Introduction When events of appropriate types appear in the data sequence, in the same order as in the episode, these events are said to constitute an occurrence of the episode. In this paper, we present a new algorithm for frequent episode discovery under the frequency count based on non-overlapped occurrences. The space complexity is same as number of candidates input to the algorithm and the time complexity is linear in the number candidates and the total number of events in the data stream.

FREQUENT EPISODES MINING FRAMEWORK An episode, α, is defined by a triple, (Vα,≤α, gα), Vα is a collection of nodes, ≤α is a partial order on Vα, gα : Vα →ε associates each node in α with an event type from ε. When ≤α represents a total order among the nodes of α, the episode is referred to as a serial episode; when ≤α is trivial (or empty), the episode is referred to as a parallel episode.

FREQUENT EPISODES MINING FRAMEWORK Event sequence : <(E1, t1), ... , (En, tn)>, an occurrence of episode α= (Vα,≤α, gα) in this event sequence, is an injective map, h : Vα → {1, . . . , n}, such that gα(v) = Eh(v) for all v ∈ Vα, and for all v,w ∈ Vα with v ≤ w , th(v) ≤ th(w). An episode β is said to be a subepisode of α if all the event types in β appear in α as well, and if the partial order among the event types of β is the same as that for the corresponding event types in α.

FREQUENT EPISODES MINING FRAMEWORK Two occurrences of an episode are said to be non-overlapped if no event corresponding to one occurrence appears in between events corresponding to the other.

FREQUENT EPISODES MINING FRAMEWORK Counting all occurrences is very inefficient. Because different instances of the automaton of an episode are needed to keep track of all its state transition possibilities. Example : <(A,1),(A,2),(B,3),(A,7),(C,8),(B,9),(B,10),(D,11),(C,12),(C,13)>

FREQUENT EPISODES MINING FRAMEWORK Each occurrence, h, of episode α, is associated with a set of events {(Eh(vi) , th(vi)) : vi ∈ Vα} in the data stream. Two occurrences, h1 and h2, of an episode α are said to be distinct if they do not share any events in the event sequence, i.e., if h1(vi) ≠ h2(vj) ∀vi, vj ∈ Vα.

FREQUENT EPISODES MINING FRAMEWORK Two occurrences of an episode in an event sequence are non-overlapped if no event corresponding to one occurrence appears in between events corresponding to the other occurrence.

Frequency counting algorithm for serial episodes In order to access these automata efficiently, they are indexed using a waits(·) list. The automata that are currently waiting for event type A can be accessed through waits(A). Each element in the waits(·) list is an ordered pair like (α, j). (α, j) ∈ waits(A) implies that an automaton for α is waiting for an event of type A to appear in the data to complete its transition to state j.

Frequency counting algorithm for serial episodes Automata transitions are effected and fresh automata for an episode are initialized by adding and removing elements from appropriate waits(·) lists. In this respect, a temporary storage called bag is used, if it is found necessary to add elements to the waits(·) list over which we are currently looping.

Frequency counting algorithm for serial episodes

Frequency counting algorithm for serial episodes If an automaton has reached its final state, then a new automaton for the episode is initialized by adding (α, 1) to waits(α[1]). The process of adding to the waits(·) list is performed inside the loop over all elements in waits(Ei). Whenever we want to add an element to waits(Ei), it is stored first in bag which is later emptied into waits(Ei) after exiting from the loop.

Frequency counting algorithm for serial episodes Space and time complexity : The space complexity is O(| C |). The initialization time is O(| C | + |ε|). The time required for the actual data pass is linear in the length, n, of the data sequence. The time complexity is O(n |C|).

Frequency counting algorithm for parallel episodes The difference when recognizing occurrences of parallel episodes (as compared to recognizing occurrences of serial episodes) is that there is no need to worry about the order in which events occur. Instead, we are interested in asking if each event type in the episode has occurred as many times as prescribed by the episode.

Frequency counting algorithm for parallel episodes Each entry in the list waits(A), is an ordered pair like, (α, j), which now indicates that there is a partial occurrence of α which still needs j events of type A before it can become a complete occurrence. There are two quantities associated with each episode, α: α.freq, which stores the frequency. α.counter, which indicates the number of events in the sequence that constitute the current partial occurrence.

Frequency counting algorithm for parallel episodes

Frequency counting algorithm for parallel episodes If, (α, j) ∈ waits(Ei), (α, j) is replaced by (α, j − 1) in waits(Ei), if sufficient number of events of type Ei for α are not yet accounted for α in the current partial occurrence. α.counter is incremented, indicating that the partial occurrence for α has progressed by one more node.

Frequency counting algorithm for parallel episodes Space and time complexity : Each waits(·) list can have at most |C| entries and so the space needed is O(N|C|) (there can be at most N distinct event types in an episode of size N). The initialization time complexity is O(|ε| + N|C|). The main loop, is over n events in the data, and any of the waits(·) loops, can at most be over |C| partial occurrences. Re-initialization of appropriate waits(·) lists whenever an occurrence is complete takes O(N) time. This re-initialization needs to be done at most n/N times for each episode. The total worst case time complexity is O(n|C|).

Simulations Each time the next event is to be generated, we first decide whether the next event is to be generated randomly with a uniform distribution over all event types (an iid event) or according to one of the temporal patterns to be embedded. This is controlled by the parameter ρ, which is the probability that the next event is iid.

Simulations If ρ = 1 then the data is simply iid noise with no temporal patterns embedded. If it is decided that the next event is to be from one of the temporal patterns to be embedded, then we have a choice of continuing with a pattern that is already embedded partially or starting a new occurrence of one of the patterns. This choice is also made randomly.

Simulations

Simulations