MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

Slides:



Advertisements
Similar presentations
Xiaoming Sun Tsinghua University David Woodruff MIT
Advertisements

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Chapter 5 Fundamental Algorithm Design Techniques.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Programming with Alice Computing Institute for K-12 Teachers Summer 2011 Workshop.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Merge Sort 4/15/2017 6:09 PM The Greedy Method The Greedy Method.
1 Finding Recent Frequent Itemsets Adaptively over Online Data Streams J. H, Chang and W.S. Lee, in Proc. Of the 9th ACM International Conference on Knowledge.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.
Verify and mining frequent patterns from large windows over data streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo ICDE2008.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Ex. 11 (pp.409) Given the lattice structure shown in Figure 6.33 and the transactions given in Table 6.24, label each node with the following letter(s):
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
Variable-Length Codes: Huffman Codes
Constrained Pattern Assignment for Standard Cell Based Triple Patterning Lithography H. Tian, Y. Du, H. Zhang, Z. Xiao, M. D.F. Wong Department of ECE,
Maninder Kaur VIRTUAL MEMORY 24-Nov
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
ON THE SELECTION OF TAGS FOR TAG CLOUDS (WSDM11) Advisor: Dr. Koh. Jia-Ling Speaker: Chiang, Guang-ting Date:2011/06/20 1.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
The Greedy Method. The Greedy Method Technique The greedy method is a general algorithm design paradigm, built on the following elements: configurations:
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
1 Mining Sequential Patterns with Constraints in Large Database Jian Pei, Jiawei Han,Wei Wang Proc. of the 2002 IEEE International Conference on Data Mining.
OR Chapter 8. General LP Problems Converting other forms to general LP problem : min c’x  - max (-c)’x   = by adding a nonnegative slack variable.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Salvatore Orlando Raffaele Perego Claudio Silvestri 國立雲林科技大學 National.
SeqStream: Mining Closed Sequential Pattern over Stream Sliding Windows Lei Chang Tengjiao Wang Dongqing Yang Hua Luan ICDM’08 Lei Chang Tengjiao Wang.
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
Mining Progressive Confident Rules M. Zhang, W. Hsu and M.L. Lee Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor : Jia-Ling Koh Speaker : Tsui-Feng.
1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker :董原賓 Advisor :柯佳伶.
Spring 2008The Greedy Method1. Spring 2008The Greedy Method2 Outline and Reading The Greedy Method Technique (§5.1) Fractional Knapsack Problem (§5.1.1)
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
1 Mining the Smallest Association Rule Set for Predictions Jiuyong Li, Hong Shen, and Rodney Topor Proceedings of the 2001 IEEE International Conference.
Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor : Jia-Ling Koh Speaker : Tsui-Feng Yen.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Linear program Separation Oracle. Rounding We consider a single-machine scheduling problem, and see another way of rounding fractional solutions to integer.
Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26,
CPU Scheduling CSSE 332 Operating Systems
Finding Maximal Frequent Itemsets over Online Data Streams Adaptively
Mining Sequential Patterns
The Greedy Method Spring 2007 The Greedy Method Merge Sort
Farzaneh Mirzazadeh Fall 2007
Approximate Frequency Counts over Data Streams
DENSE ITEMSETS JOUNI K. SEPPANEN, HEIKKI MANNILA SIGKDD2004
Dynamically Maintaining Frequent Items Over A Data Stream
Discovering Frequent Poly-Regions in DNA Sequences
Presentation transcript:

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling 1

OUTLINE Introduction Problem Statement Property of Max-Frequency Algorithm Experiments Conclusion 2

INTRODUCTION Most previous work on mining frequently occurring itemsets over data streams either focuses on 1. The sliding window model 2. The time-fading model 3. The landmark model Each of these models requires a fixed window length or decay factor given by the user In many applications, however, choosing such parameters that are most appropriate for every itemset at every timepoint in an evolving stream is almost impossible 3

INTRODUCTION We propose to consider for each itemset the window in which it has the highest frequency We define the current frequency of an itemset as the maximum over all windows from the past until the current state that satisfy a minimal size constraint When a stream evolves, the length of the window containing the highest frequency for a given itemset can change continuously This new stream measure turns out to be very suitable to early detect sudden bursts of occurrences of itemsets, while still taking into account the history of the itemset 4

PROBLEM STATEMENT * STREAMS AND MAX-FREQUENCY : a stream 〈 I 1 I 2 … I n 〉 is a sequence of itemsets is the length of the stream I 1 is considered the first and oldest itemset in the stream, and I n the latest and most recent : the number of sets in a stream that contain itemset I : the sub-stream of the window 〈 I s I s+1 … I t 〉 : the sub-stream of consisting of the last k items of, 5

PROBLEM STATEMENT * STREAMS AND MAX-FREQUENCY Definition 1. Given a minimal window size mwl, the max- frequency of itemset I in a stream is defined as the maximum of the frequencies of I over all windows, of size at least mwl, extending the end of the stream; that is If the length of the stream is less than mwl, the max- frequency is defined to be 0 6

PROBLEM STATEMENT * STREAMS AND MAX-FREQUENCY Definition 1. (cont.) The longest window in which the maximum frequency is reached is called the maximal window for I in, and its starting point is denoted That is, is the smallest index such that mwl will be omitted when clear form the context 7

PROPERTIES OF MAX-FREQUENCY 8

9

ALGORITHM * THE SUMMARY Let p 1 < p 2 < … < p r be the borders for itemset A in the stream, ordered from oldest to most recent Let be the number of occurrences of the target itemset A in between two subsequent border positions p i and p i+1 ( for i = 1, …, r-1 ). Denotes the number of occurrences of A since the last border The summary S t of is defined as the array 10

ALGORITHM * THE SUMMARY We can easily compute the frequencies of itemset A for any of the border positions form this summary: 11

ALGORITHM * THE SUMMARY The fractions in the blocks in between two subsequent border positions are increasing, and as a consequence, among all borders p i, we have that is maximal for i equal to r 12

ALGORITHM * THE SUMMARY 13

ALGORITHM * MINIMAL FREQUENCY Until now, we assumed that for the target itemset we need to be able to report its frequency exactly. We will now relax this requirement by setting a minimal frequency threshold minfreq Let be a stream with, and suppose that Then we can remove( p 1, a 1 ) from the left-side of the summary 14

ALGORITHM * MINIMAL WINDOW LENGTH In the algorithm without minimal window length, a border q in stream can be pruned of we can find two blocks and such that the frequency of the target in is higher than When we are working with a minimal window length, it could be the case that the suffix of the stream starting at r + 1 does not meet the minimal window length requirement In that case, even though the window starting at q has lower frequency than the window starting r + 1, it can still have the highest frequency of all windows that meet the minimal window requirement! 15

ALGORITHM * MINIMAL WINDOW LENGTH 16

ALGORITHM * MINIMAL WINDOW LENGTH In order to know the maximal frequency with a minimal window length mwl, it suffices to apply the method without any minimal window length to keep track of the borders for the stream Then, when we need the max-frequency, we check the borders of in the complete stream, and the minimal window itself, 17

ALGORITHM * MINING ALL ITEMSETS 18

ALGORITHM * MINING ALL ITEMSETS We do not need to maintain the summaries of all itemsets, but only those that were once frequent in the minimal window, and that are, at the same time, frequent now within the part of the stream Furthermore, we need to find the frequent itemsets in the mwl windows 19

EXPERIMENTS 20

EXPERIMENTS 21

CONCLUSION We presented a new frequency measure for itemsets in streams that does not rely on a fixed window length or a time-decaying factor An experimental evaluation supported the claim that the new measure can be computed from a summary with extremely small memory requirements, that can be maintained and updated efficiently The summary of the stream consists of the borders and their corresponding frequencies 22