When to Update the Sequential Patterns of Stream Data?

Slides:

Advertisements

Similar presentations

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.

Advertisements

Introduction to Database Systems1 Concurrency Control CC.Lecture 1.

gSpan: Graph-based substructure pattern mining

Edi Winarko, John F. Roddick

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

A Fast High Utility Itemsets Mining Algorithm Ying Liu,Wei-keng Liao,and Alok Choudhary KDD’05 Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen.

1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge.

Data Mining Association Analysis: Basic Concepts and Algorithms

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong

1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.

Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.

Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms

A b c d e f g h j i k Graph G is shown. How many blocks does G have?

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.

Core Text Mining Operations 2007 년 02 월 06 일 부산대학교 인공지능연구실 한기덕 Text : The Text Mining Handbook pp.19~41.

Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, and Jiawei Han SIGMOD 2002 Presented by: Eddie Date: 2002/12/23.

Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

Ch5 Mining Frequent Patterns, Associations, and Correlations

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.

Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

Mining temporal interval relational rules from temporal data Yong Joon Lee, Jun Wook Lee, Duck Jin Chai, Bu Hyun Hwang, Keun Ho Ryu JSS (The Journal of.

Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.

Intelligent Database Systems Lab Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Salvatore Orlando Raffaele Perego Claudio Silvestri 國立雲林科技大學 National.

SeqStream: Mining Closed Sequential Pattern over Stream Sliding Windows Lei Chang Tengjiao Wang Dongqing Yang Hua Luan ICDM’08 Lei Chang Tengjiao Wang.

Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining ( ): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

1 Parallel Mining of Closed Sequential Patterns Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge.

Rapid Association Rule Mining Amitabha Das, Wee-Keong Ng, Yew-Kwong Woon, Proc. of the 10th ACM International Conference on Information and Knowledge Management(CIKM’01),2001.

CFI-Stream: Mining Closed Frequent Itemsets in Data Streams

CPU Scheduling CSSE 332 Operating Systems

On the Discovery of Interesting Patterns in Association Rules

Speaker : Che-Wei Chang

Sequential Pattern Mining Using A Bitmap Representation

Byung Joon Park, Sung Hee Kim

Chang-Hung Lee, Jian Chih Ou, and Ming Syan Chen, Proc

False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong（崇志宏） , Hongjun Lu.

Mining Sequential Patterns

Incremental Mining of Association Rules

An Efficient Algorithm for Incremental Mining of Association Rules

A Parameterised Algorithm for Mining Association Rules

Farzaneh Mirzazadeh Fall 2007

Yun Chi, Haixun Wang, Philip S. Yu, Richard R. Muntz, ICDM 2004.

1.4 Graphing Calculators and Computers

Virtual-Time Round-Robin: An O(1) Proportional Share Scheduler

Maintaining Frequent Itemsets over High-Speed Data Streams

Winter 2019 Lecture 11 Minimum Spanning Trees (Part II)

Dynamically Maintaining Frequent Items Over A Data Stream

Discovering Frequent Poly-Regions in DNA Sequences

Autumn 2019 Lecture 11 Minimum Spanning Trees (Part II)

Presentation transcript:

When to Update the Sequential Patterns of Stream Data? Q. Zheng, K. Xu, and S. Ma, in Proc. of the 7th Pacific-Asia In Conference on Knowledge Discovery and Data Mining, 2003. Adviser: Jia-Ling Koh Speaker: Shu-Ning Shin Date: 2004.8.12

Introduction An experimental method, called TPD (Tradeoff between Performance and Difference), to decide when to update the sequential patterns of stream data by making a tradeoff between the performance of increasingly updating algorithms and the difference of sequential patterns.

Stream Data Model (1) Stream event: Stream tuple: Length Stream tuple: Ei=<ei, tn> ei: stream event type tn: the time of stream event type occurring Stream tuple: Qi=((ek1, ek2, …,ekm), ti)=(Ek1, Ek2, …, Ekm) Length Stream tuple: |Qi|=|(ek1, ek2, …, ekm)|=m

Stream Data Model (2) Stream queue: Length of queue: Sij=<Qi, Qi+1, …, Qj>, where ti< ti+1< …< tj =<(Ei1, …, Eik)…(Ej1, …, Ejm)> Length of queue: |Sij|=<Qi, Qi+1, …, Qj>=j-i+1 Stream viewing window: Wk=<Qm, …, Qn|d=n-m+1> Size of viewing window: |Wk|=n-m+1=d

Stream Data Model (3) occur(seqm, Wk): support(seqm, Wk): |the times of seqm occurring in Wk| Seqm=<ei1, ei2, …, eim> Wk: an stream viewing window support(seqm, Wk): Occur(seqm, Wk) / |Wk|

Stream Data Model - Example S18=<Q1, Q2 ,Q3, Q4, Q5, Q6, Q7, Q8> S18=<E2, E5, E1, (E3, E6), E7, E9, E10> W5=< Q1, Q2 ,Q3, Q4, Q5, Q6, Q7 |d=7>

Sliding Stream viewing window ΔWi: incremental window, i=0, 1, 2, 3, … ΔW0: initial window Wi+1=Wi+ΔWi+1 |ΔW1|/|W0|: incremental ratio of stream data

Estimation of difference between the old and new sequential patterns LWk: old frequent sequences in Wk LWk+1: new frequent sequences in Wk+1 LWkΔ LWk+1 : symmetric difference

The Algorithm of Updating Sequential Pattern (IUS) (1) IUS algorithm uses the frequent and negative border sequences in DB and db as the candidates to compute new frequent sequences and negative border sequences in the updated database U. DB: The original database which contains old time-related data. db: The increment database which contains new time-related data. dd: The decrement database from DB which contains deleted time-related data. U: The updated database. When database being increasingly updated, the total set of data which are equal to DB+db. When database being decreasingly updated, the total set of data which are equal to DB-dd. Support(F, X): the support of the sequence X in the X database, where X ∈  {db, dd, DB, U}. Min_supp:Minimum support threshold of the frequent sequence. Min_nbd_supp: Minimum support threshold of negative border sequence. CX: Candidate sequences in X database, where X ∈{db, dd, DB, U}. LX : Frequent sequences in the X database, where X ∈{db, dd, DB, U}. NBD(X)=CX- LX, where NBD(X) consists of the sequences in X database whose sub_sets are

IUS (2) Property1: Let B be a frequent sequence in Wk, if , we have occur(A, DB)>occur(B, DB). Property2: Proof: assume that occur(S,DB)<Min_sup*|DB| and occur(S,db)<Min_sup*|db| occur(S,DB+db)<Min_sup*|DB+db| Support(S,U)<Min_sup, contradict the given condition.

IUS – using the stream data model Wk: The original stream view window which contains old time-related data. ΔWk+1: The increment stream view window which contains new time-related data. Wk+1: The updated stream view window. When stream data being increasingly updated, the total set of data which are equal to Wk+ΔWk+1 Support(F, X): the support of the sequence F in the X stream view windows, where X ∈{ Wk+1 ,Wk, ΔWk+1}. Min_supp :Minimum support threshold of the frequent sequence. Min_nbd_supp: Minimum support threshold of negative border sequence. CX: Candidate sequences in X stream view windows, where X ∈ { Wk+1 ,Wk, ΔWk+1}. LX : Frequent sequences in the X stream view windows, where X ∈ { Wk+1 ,Wk, ΔWk+1}. NBD(X)=CX- LX, where NBD(X) consists of the sequences in X stream view windows whose sub_sets are frequent, its Support is lower than Min_supp and greater than Min_nbd_supp. Note that X ∈ {Wk+1 ,Wk, ΔWk+1}

IUS – Algorithm (1)

IUS – Algorithm (2)

Tradeoff between Performance and Difference (TPD) (1) Use the speedups to measurement of IUS: Speedup=the execution time of Robust_search / the execution time of IUS Use the difference to measure the old and the new frequent sequence. Use Min-Max normalization:

TPD (2) TPD method maps the curve of the speedup and the difference changing with the size of incremental windows into the same graph under the same scale. The points of intersection of the two curves are the suitable range of the incremental ratio of the initial windows for IUS.

Experiment conducted a set of experiments to find when to update sequential patterns for stream data. Environment: DELL PC Sever with 2 CPU Pentium II Memory 512M, Disk 16G Operating system: Red Hat Linux 6.0 Data1: the alarms in GSM Networks, contain 194 alarm types and 100k alarm events. The time of alarm events in the data1 range from 2001-08-11-18 to 2001-08-13-17.

Experiment 1 – on Data 1 |initial window|=20k The intersection point: 6K The suitable range of incremental ratio of initial window: 30% of W0. Experiment 1 – on Data 1 |initial window|=20k

Experiment 2 – on Data 1 |initial window|=40k The intersection point: 9K~10K The suitable range of incremental ratio of initial window: 22.5%~25% of W0.

Experiment 3 – on Data 1 |initial window|=50k The intersection point: 15K~18K The suitable range of incremental ratio of initial window: 30%~36% of W0.

Experiment 4 – on Data 1 |initial window|=60k The intersection point: 10K~12K The suitable range of incremental ratio of initial window: 16.7%~20% of W0.

Conclusion TPD method, it is shown experimentally that the suitable range of incremental ratio of initial windows to update is about 20 to 30 percent of the size of initial windows for the IUS algorithm.