Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Heavy hitter computation over data stream
Adaptive Data Collection Strategies for Lifetime-Constrained Wireless Sensor Networks Xueyan Tang Jianliang Xu Sch. of Comput. Eng., Nanyang Technol. Univ.,
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Chapter 1 and 2 Computer System and Operating System Overview
Fast Algorithms for Association Rule Mining
Mining Association Rules
Performance and Scalability: Apriori Implementation.
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
06/10/2015Applied Algorithmics - week81 Combinatorial Group Testing  Much of the current effort of the Human Genome Project involves the screening of.
Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Research issues on association rule mining Loo Kin Kong 26 th February, 2003.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Calculating frequency moments of Data Stream
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.
Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26,
Frequency Counts over Data Streams
Finding Maximal Frequent Itemsets over Online Data Streams Adaptively
Updating SF-Tree Speaker: Ho Wai Shing.
The Stream Model Sliding Windows Counting 1’s
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Spatial Online Sampling and Aggregation
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
StreamApprox Approximate Stream Analytics in Apache Flink
Mining Sequential Patterns
Pyramid Sketch: a Sketch Framework
Farzaneh Mirzazadeh Fall 2007
Smita Vijayakumar Qian Zhu Gagan Agrawal
Approximate Frequency Counts over Data Streams
Maintaining Frequent Itemsets over High-Speed Data Streams
DENSE ITEMSETS JOUNI K. SEPPANEN, HEIKKI MANNILA SIGKDD2004
Presentation transcript:

Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda

Introduction ▪A data stream: unbounded sequence of data arriving at high speed ▪FIM-DS: Frequent Itemset Mining form Data Stream ▪i.e. : {a}:4 , {b}:3 , {c}:3 ▪Application of FIM-DS: monitoring surveillance systems, communication networks, retail industry…… ▪A Challenging Problem of FIM-DS: handling a huge combinatorial number of entries to be generated form each streaming transaction and stored in memory ▪This study considers approximation techniques for FIM-DS. 2

Introduction Some approximation methods for FIM-DS: ▪Parameter-oriented approaches ▪One-scan approximation algorithm ▪Two Type: deletion approach & random sampling approach ▪provide some guarantee that the resulting itemsets have frequencies with errors bounded by a given parameter ▪No false negative under some condition 3

Introduction 4

5

Contents ▪Introduction ▪Preliminary and Background ▪Parameter-Oriented V.S. Resource-Oriented ▪LC-SS Algorithm ▪Skip LC-SS Algorithm (Introduction & Performance & Improvement) ▪Furthermore Improvements ▪Conclusion and Future Work 6

Notations and Terminology 7

8

9

10

Notations and Terminology 11

Lossy Counting algorithm 12

Lossy Counting algorithm 13

Lossy Counting algorithm ▪The challenging problem: ▪The LC algorithm must generate (and check) every transaction subset at least once ▪Combinatorial explosion of memory consumption 14

Space-Saving algorithm 15

Space-Saving algorithm 16

Contents ▪Introduction ▪Preliminary and Background ▪Parameter-Oriented V.S. Resource-Oriented ▪LC-SS Algorithm ▪Skip LC-SS Algorithm (Introduction & Performance & Improvement) ▪Furthermore Improvements ▪Conclusion and Future Work 17

Parameter-Oriented V.S. Resource-Oriented 18

19

20

Parameter-Oriented V.S. Resource-Oriented 21

Parameter-Oriented V.S. Resource-Oriented ▪Resource-Oriented approaches: ▪Approximation methods ▪Guarantee a resource-specified constraint: memory consumption or data processing time 22

Contents ▪Introduction ▪Preliminary and Background ▪Parameter-Oriented V.S. Resource-Oriented ▪LC-SS Algorithm ▪Skip LC-SS Algorithm (Introduction & Performance & Improvement) ▪Furthermore Improvements ▪Conclusion and Future Work 23

LC-SS Algorithm 24

LC-SS Algorithm 25

LC-SS Algorithm 26

LC-SS Algorithm 27

The validity in the LC-SS Algorithm 28

The validity in the LC-SS Algorithm 29

The validity in the LC-SS Algorithm ▪Theorem 2 guarantees the validity(i.e., completeness and accuracy) of the outputs by Algorithm 1. 30

The validity in the LC-SS Algorithm 31

Contents ▪Introduction ▪Preliminary and Background ▪Parameter-Oriented V.S. Resource-Oriented ▪LC-SS Algorithm ▪Skip LC-SS Algorithm (Introduction & Performance & Improvement) ▪Furthermore Improvements ▪Conclusion and Future Work 32

Skip LC-SS Algorithm 33

Skip LC-SS Algorithm 34

Skip LC-SS Algorithm 35

Skip LC-SS Algorithm 36

Skip LC-SS Algorithm 37

The Validity of the output 38

Performance of the Skip LC-SS algorithm ▪Data: ▪online data for earthquake occurrences from 1981 to 2013 in Japan, which consists of transactions with 1229 items. ▪Using C ▪Mac Pro, Mac OS 10.6, 3.33GHz, 16GB 39

Performance of the Skip LC-SS algorithm 40

Performance of the Skip LC-SS algorithm 41

Performance of the Skip LC-SS algorithm 42

Improvement of Skip LC-SS algorithm ▪Two bottleneck: ▪1.updating entries ▪2.replace entries ▪The replacement operation tends to be more expensive than the update one 43

R-skip 44

T-skip 45

46

47

48

Contents ▪Introduction ▪Preliminary and Background ▪Parameter-Oriented V.S. Resource-Oriented ▪LC-SS Algorithm ▪Skip LC-SS Algorithm (Introduction & Performance & Improvement) ▪Furthermore Improvements ▪Conclusion and Future Work 49

Furthermore Improvements ▪Key idea: use the stream reduction to dynamically repress each transaction ▪One fact: most items in bursty transactions are non-frequest ▪The principle of non-monotonicity: every itemset with any non-frequest item is no longer frequent ▪Eliminate non-frequent items from each transaction ▪In this paper, use SS-ST algorithm to perform the stream reduction 50

SS-ST algorithm 51

Experimental results 52

Experimental results 53 ▪Web-log data: transactions with 9961 items, the maximal length decreases by 29 from 106

Experimental results ▪Retail data: transactions with items, the maximal length decrease by 58 from 76 54

Contents ▪Introduction ▪Preliminary and Background ▪Parameter-Oriented V.S. Resource-Oriented ▪LC-SS Algorithm ▪Skip LC-SS Algorithm (Introduction & Performance & Improvement) ▪Furthermore Improvements ▪Conclusion and Future Work 55

Conclusion 56

Future Work ▪1.introduce efficient data structures for the Skip LC-SS algorithm ▪2.investigate the adaptive approach using the Skip LC-SS algorithm that can fit the relevant resource in the context of FIM-DS 57

Thank you! 58