Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li

Slides:



Advertisements
Similar presentations
Hash Tables CSC220 Winter What is strength of b-tree? Can we make an array to be as fast search and insert as B-tree and LL?
Advertisements

Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
College of Information Technology & Design
110/6/2014CSE Suprakash Datta datta[at]cse.yorku.ca CSE 3101: Introduction to the Design and Analysis of Algorithms.
MATH 224 – Discrete Mathematics
The Efficiency of Algorithms Chapter 4 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
CS252: Systems Programming Ninghui Li Program Interview Questions.
COL 106 Shweta Agrawal and Amit Kumar
Analysis of Algorithms CS Data Structures Section 2.6.
Analysis of Algorithms
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
Inpainting Assigment – Tips and Hints Outline how to design a good test plan selection of dimensions to test along selection of values for each dimension.
Data Structures Using C++ 2E
CSCE 3110 Data Structures & Algorithm Analysis
Robert Pless, CS 546: Computational Geometry Lecture #3 Last Time: Convex Hulls Today: Plane Sweep Algorithms, Segment Intersection, + (Element Uniqueness,
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Introduction to Analysis of Algorithms
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
CSE 830: Design and Theory of Algorithms
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
CS 206 Introduction to Computer Science II 11 / 12 / 2008 Instructor: Michael Eckmann.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
February 17, 2015Applied Discrete Mathematics Week 3: Algorithms 1 Double Summations Table 2 in 4 th Edition: Section th Edition: Section th.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
1 Chapter 1 Analysis Basics. 2 Chapter Outline What is analysis? What to count and consider Mathematical background Rates of growth Tournament method.
Sorting HKOI Training Team (Advanced)
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
2.1 Computational Tractability. 2 Computational Tractability Charles Babbage (1864) As soon as an Analytic Engine exists, it will necessarily guide the.
Analysis of Algorithms
Chapter 3 Sec 3.3 With Question/Answer Animations 1.
Algorithm Evaluation. What’s an algorithm? a clearly specified set of simple instructions to be followed to solve a problem a way of doing something What.
CS 61B Data Structures and Programming Methodology July 28, 2008 David Sun.
Sequential Pattern Mining
CSC 211 Data Structures Lecture 13
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
CSC 413/513: Intro to Algorithms Hash Tables. ● Hash table: ■ Given a table T and a record x, with key (= symbol) and satellite data, we need to support:
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Ch03-Algorithms 1. Algorithms What is an algorithm? An algorithm is a finite set of precise instructions for performing a computation or for solving a.
Algorithm Design Techniques, Greedy Method – Knapsack Problem, Job Sequencing, Divide and Conquer Method – Quick Sort, Finding Maximum and Minimum, Dynamic.
Complexity Analysis (Part I)
Applied Discrete Mathematics Week 2: Functions and Sequences
On the Discovery of Interesting Patterns in Association Rules
Analysis of Algorithms
CS 332: Algorithms Hash Tables David Luebke /19/2018.
RE-Tree: An Efficient Index Structure for Regular Expressions
Enough Mathematical Appetizers!
Sorting.
Time Series Filtering Time Series
Computation.
Algorithm design and Analysis
Ch8: Sorting in Linear Time Ming-Te Chi
Prepared by Chen & Po-Chuan 2016/03/29
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
Objective of This Course
Searching Similar Segments over Textual Event Sequences
Applied Discrete Mathematics Week 6: Computation
Sorting … and Insertion Sort.
Range-Efficient Computation of F0 over Massive Data Streams
Space-for-time tradeoffs
At the end of this session, learner will be able to:
Algorithms CSCI 235, Spring 2019 Lecture 19 Order Statistics
Complexity Analysis (Part I)
Complexity Analysis (Part I)
Presentation transcript:

Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li

An Example for Time Lag Liang Tang, Tao Li, Larisa Shwartz Disk_Capacity ⟶ [5min,6min] Database, [5min, 6min] is the lag interval. 2 Why time lag is important? If the time lag is close to 0, database is writing a huge log. If the time lag is larger than 0, disk is really full.

Liang Tang, Tao Li, Larisa Shwartz Problem Definition Our Problem: Given a temporal dependency AB: when event A happens, B will also happen. What is the time lag between dependent event A and B? Why study this problem: The time lag indicates the cause of the temporal dependency. 3

Liang Tang, Tao Li, Larisa Shwartz Related Work Ask the user to predefine a time window for analyzing the event associations (The user may not know). Assume the temporal dependency is not interleaved (Two dependent A and B has no other A and B between them). 4 Overlap (Interleaved)

Liang Tang, Tao Li, Larisa Shwartz Relation with Other Temporal Patterns 5 Those temporal patterns can be seen as the temporal dependency with particular constraints on the time lag.

Liang Tang, Tao Li, Larisa Shwartz Challenges for Finding Time Lag Given a temporal dependency, A [t1,t2] B, what kind of lag interval [t1,t2] we want to find? If the lag interval is too large, every A and every B would be “dependent”. If the lag interval is too small, real dependent A and B might not be captured. Time complexity is too high. A [t1,t2] B, t1 and t2 can be any distance of any two time stamps. There are O(n 4 ) possible lag intervals. 6

Liang Tang, Tao Li, Larisa Shwartz What Is a Qualified Lag Interval If [t1,t2] is qualified, we should observe many occurrences for A [t1,t2] B. 7 Lag IntervalNumber of Occurrences [0,1]3 [5,6]4 [0,6]4 [0,+ ∞ ] 4 Length of the lag interval is larger, the number of occurrences also becomes larger.

Liang Tang, Tao Li, Larisa Shwartz What Is a Qualified Lag Interval Intuition: If B is randomly and independently distributed, how many occurrences observed in a time interval [t1,t2]? What is the minimum number of occurrences? Consider the number of occurrences in a lag interval to be a variable, n r. Then, use the chi-square test to judge whether it is caused by randomness or not? 8 The number of As Time frame for the event sequence Expected value

Liang Tang, Tao Li, Larisa Shwartz Brute-Force Algorithm Algorithm: For A [t1,t2] B, for every possible t1 and t2, scan the event sequence and count the number of occurrences. Time Complexity The number of distinct time stamps is O(n). The number of possible t1 and t2 is O(n 2 ). The number of possible [t1,t2] is O(n 4 ). Each scanning is O(n). The total cost is O(n 5 ). Cannot handle event sequences. 9

Liang Tang, Tao Li, Larisa Shwartz Maximum Length of Qualified Lag Interval 10 Event Sample Rate(polling interval in system monitoring, a small constant). The length of a qualified lag interval cannot be very long. When you increase the length of lag interval, the minimum threshold for the number of occurrences also increases. Lemma 2: Any qualified lag interval’s length is less than T/N ∙ 1/minsup.

Liang Tang, Tao Li, Larisa Shwartz STScan Algorithm Idea: Avoid redundant scanning, store all time lags into a sorted table. 11 t(x 5 )-t(x 3 )= =20. E 2 is 20, so insert 3 into IA 2, insert 5 into IB 2.

Liang Tang, Tao Li, Larisa Shwartz STScan Algorithm Every lag interval is represented as a sub-segment of the linked list. For example: [20,120] is E 2 E 3 E 4, the number of occurrences is|IA 2 ∪ IA 3 ∪ IA 4 | 12 Time cost for creating this table is O(n 2 ). The number of elements is O(3n 2 )=O(n 2 ). Time cost for scanning is O(n 2 ).

Liang Tang, Tao Li, Larisa Shwartz STScan* Algorithm Problem of STScan: Space cost O(n 2 ) is too big to run out of memory. Observation: STScan only scans one sub-segment at one time and never goes back. Solution: Incrementally create the sort table and scan. 13

Liang Tang, Tao Li, Larisa Shwartz STScan* Algorithm 14 Sort events by time stamps. We visited the lag interval of sub-segment: E 4 E 5. The next lag interval is sub-segment:E 5 E 6 We need to first create E 6 A k :the k-th A B k :the k-th B.

Liang Tang, Tao Li, Larisa Shwartz STScan* Algorithm 15 A 2, A 4 ’ pointed time lags have the smallest value, 24, so E 6 =24. Move A 2, A 4 ’ pointers to the next position. Create links from E 6 to A 2 and A 4. A k :the k-th A B k :the k-th B.

Liang Tang, Tao Li, Larisa Shwartz STScan* Algorithm 16 For every A, only keep the pointer for the next index of B. Merge time lag lists of each A (like merge-sort). Only keep O(n · |r| max ) links, the space cost is O(n), where |r| max is maximum length of qualified interval. A k :the k-th A B k :the k-th B.

Liang Tang, Tao Li, Larisa Shwartz Time Complexity Lower Bound The problem of finding all qualified time intervals is 3SUM-Hard, so the there is o(n 2 ) algorithm in the worst case. 3SUM problem: Given a set of n integers, is there three integers a,b,c in the set such that a+b=c? No o(n 2 ) algorithm can solve this problem in the worst case. 17

Liang Tang, Tao Li, Larisa Shwartz Evaluation Evaluation Objectives: Effectiveness: Is able to find the interleaved temporal dependencies? The lag interval is correct? Efficiency: Run time cost Memory space cost Comparative Methods: Inter-arrival: do clustering on time lags of A and its following B. brute-force: try every possible t1,t2 for lag interval [t1,t2]. brute-force*: brute-force with pruning by |r| max. Testing Environment: Linux 2.6, Intel Xeon 2.5G (8 core), Java VM Memory Heap: 12Gbytes 18

Liang Tang, Tao Li, Larisa Shwartz Data Sets Synthetic data: 7 data sequences. 8 event types. Average sample period is 100. Random generated with 3 embedded dependencies. 19 Embedded Dependencysupport I 1[400,500] I I 2[1000,1100] I I 4[5500,5800] I DatasetTime Frame#Events#Event Types Account154 days1,124,83495 Account232 days2,076, Time lags are large. Dependent items are very likely to be interleaved. Real data: Tivoli Monitoring system events from two large accounts in IBM service center.

Liang Tang, Tao Li, Larisa Shwartz Synthetic Data Effectiveness: brute-force, brute-force*,STScan, STScan* can find all embedded temporal dependencies if they can finish the running. inter-arrivals fails. Efficiency: 20 Data size ∙ STScan 3 ∙ ∙ ∙ 10 7 OutOfMemory STScan* ∙ Brute-Force 9 ∙ ∙ ∙ 10 4 Brute-Force* 9 ∙ ∙ ∙ 10 4 Inter-arrival<10 2

Liang Tang, Tao Li, Larisa Shwartz Tivoli Monitoring System Events 21 DatasetDiscovered Dependencies Account1 MSG_Plat_APP  [3600,3600] MSG_Plat_APP Linux_Process  [0,96] Process SMP_CPU  [0,27] Linux_Process Account2 TEC_Error  [0,1] Ticket_Retry TEC_Retry  [0,1] Ticket_Error AIX_HW_ERROR  [8,9] AIX_HW_ERROR Event Plot for Account2 Inter-arrivals only find

Liang Tang, Tao Li, Larisa Shwartz Tivoli Monitoring System Events 22 Run times on Account1 dataRun times on Account2 data

Liang Tang, Tao Li, Larisa Shwartz Conclusion and Future Work Conclusion Study the problem of discovering interleaved temporal dependencies. Propose STScan and STScan* two algorithms, which are faster than brute-force search approaches, although their time complexities are still high O(n 2 ). Prove that the problem is 3SUM-Hard. Future work Develop an approximation algorithm which can solve the problem in a linear time complexity. 23

Liang Tang, Tao Li, Larisa Shwartz End Thank you! Any question? 24