NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

Slides:



Advertisements
Similar presentations
Rectangle-Efficient Aggregation in Spatial Data Streams Srikanta Tirthapura David Woodruff Iowa State IBM Almaden.
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Mining Association Rules
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Data Mining Techniques Association Rule
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Introduction to Algorithms Rabie A. Ramadan rabieramadan.org 2 Some of the sides are exported from different sources.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.
Data Mining Association Analysis: Basic Concepts and Algorithms
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Fast Algorithms for Association Rule Mining
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Mining Association Rules
Carmine Cerrone, Raffaele Cerulli, Bruce Golden GO IX Sirmione, Italy July
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Mining frequency counts from sensor set data Loo Kin Kong 25 th June 2003.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
计算机科学概述 Introduction to Computer Science 陆嘉恒 中国人民大学 信息学院
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Implementation of “A New Two-Phase Sampling Based Algorithm for Discovering Association Rules” Tokunbo Makanju Adan Cosgaya Faculty of Computer Science.
August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Association Rule Mining
1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.
CHAPTER 17 O PTIMAL D ESIGN FOR E XPERIMENTAL I NPUTS Organization of chapter in ISSO –Background Motivation Finite sample and asymptotic (continuous)
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Frequency Counts over Data Streams
A paper on Join Synopses for Approximate Query Answering
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
CPU Scheduling G.Anuradha
Association Rule Mining
A Parameterised Algorithm for Mining Association Rules
Farzaneh Mirzazadeh Fall 2007
Approximate Frequency Counts over Data Streams
Data Transformations targeted at minimizing experimental variance
Presentation transcript:

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic Univ ExilixisNorthwestern UniversityIBM Almaden

NGDM’02 2 Motivation  Volume of Data in Warehouses & Internet is growing faster than Moore’s Law  Scalability is a major concern  “Classical” algorithms require one/more scans of the database  Need to adopt to Streaming Data  One Solution: Execute algorithm on a sample  Data elements arrive on-line  Limited amount of memory  Lossy compressed synopses (sketch) of data

NGDM’02 3 Motivation  Advantage: can explicitly trade-off accuracy and speed  Work best when tailored to application  Base set of items & each data element is vector of item counts  Application: Association rule mining  Sampling Methods  Our Contributions  Sampling methods for count datasets

NGDM’02 4 Outline  Motivation  FAST  Epsilon Approximation  Experimental Results  Data Stream Reduction  Conclusion  Outline of the Presentation

NGDM’02 5 The Problem Generate a smaller subset S 0 of a larger superset S such that the supports of 1-itemsets in S 0 are close to those in S NP-Complete: One-In-Three SAT Problem I 1 (T) = set of all 1-itemsets in transaction set T L 1 (T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T

NGDM’02 6 FAST-trim 1. Obtain a large simple random sample S from D. 2. Compute f(A;S) for each 1-itemset A. 3. Using the supports computed in Step 2, obtain a reduced sample S 0 from S by trimming away outlier transactions. 4. Run a standard association-rule algorithm against S 0 – with Minimum support p and confidence c – to obtain the final set of Association Rules.  FAST-trim Outline Given a specified minimum support p and confidence c, FAST-trim Algorithm proceeds as follows:

NGDM’02 7 FAST-trim while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0 – {t*}, where Dist(S 0 -{t*},S) = min Dist(S 0 - {t},S) }  FAST-trim Algorithm Uses input parameter k to explicitly trade-off speed and accuracy Trimming Phase t  G Note : Removal of outlier t* causes maximum decrease or minimum increase in Dist(S 0,S)

NGDM’02 8 FAST-grow while (|S 0 | > n) { divide S 0 into disjoint groups of min(k,|S 0 |) transactions each; for each group G { compute f(A;S 0 ) for each item A; set S 0 =S 0  {t*}, where Dist(S 0  {t*},S) = min Dist(S 0  {t},S) }  FAST-grow Algorithm Select representative transactions from S and add them to the sample S 0 that is initially empty Growing Phase t  G

NGDM’02 9 Epsilon Approximation (EA)  Theory based on work in statistics on VC Dimensions (Vapnik & Cervonenkis’71) shows: Epsilon Approximation (EA) Can estimate simultaneously the frequency of a collection of subsets VC dimension is finite  Applications to computational geometry and learning theory Def: A sample S 0 of S 1 is an  approximation iff discrepancy satisfies

NGDM’02 10 Epsilon Approximation (EA)  Deterministically halves the data to get sample S 0  Apply halving repeatedly (S 1 => S 2 => … => S t (= S 0 )) until  Each halving step introduce a discrepancy where m = total no. of items in database, n i = size of sub-sample S i  Halving stops with the maximum t such that Halving Method

NGDM’02 11 Epsilon Approximation (EA) How to compute halving? Hyperbolic cosine method [Spencer] 1.Color each transaction red (in sample) or blue (not in sample) 2.Penalty for each item, reflects Penalty small if red/blue approximately balanced Penalty will shoot up exponentially when red dominates (item is over-sampled), or blue dominates (item is under-sampled) 3.Color transactions sequentially, keeping penalty low Key property: no increase on penalty in average => One of the two colors does not increase the penalty globally

NGDM’02 12 Epsilon Approximation (EA) Penalty Computation  Let Q i = Penalty for item A i  Init Q i = 2  Suppose that we have colored the first j transactions where r i = r i (j) = no. of red transactions containing A i b i = b i (j) = no. of blue transactions containing A i  i = parameter that influences how fast penalty changes as function of |r i - b i |

NGDM’02 13 Epsilon Approximation (EA) How to color transaction j+1  Compute global penalty: = Global penalty assuming transaction j+1 is red = Global penalty assuming transaction j+1 is blue  Choose color for which global penalty is smaller EA is inherently an on-line method

NGDM’02 14 Performance Evaluation  Synthetic data set  IBM QUEST project [AS94]  100,000 transactions  1,000 items  number of maximal potentially large itemsets = 2000  average transaction length: 10  average length of maximal large itemsets: 4  minimum support: 0.77%  length of the maximal large itemsets: 6  Final sampling ratios: 0.76%, 1.51%, 3.0%, … dictated by EA halvings

NGDM’02 15 Experimental Results  87% reduction in sample size for accuracy: EA (99%), FAST_trim_D2 (97%), SRS (94.6%)

NGDM’02 16 Experimental Results  FAST_grow_D2 is best for very small sampling ratio (< 2%)  EA best over-all in accuracy

NGDM’02 17 Data Stream Reduction Data Stream Reduction (DSR)  Representative sample of data stream  Assign more weight to recent data while partially keeping track of old data … … N S /2 N S /4N S /81 mSmS m S -1m S -21Bucket# 1m S -2m S -1mSmS To generate N S -element sample, halve (m S -k) times of bucket k Total #Transactions = m s.N s /2

NGDM’02 18 Data Stream Reduction  Practical Implementation NsNs 0 Halving 1 Halving 2 Halving Empty 1 Halving 2 Halving 3 Halving To avoid frequent halving we use one buffer once and compute new representative sample when buffer is full by applying EA

NGDM’02 19 Data Stream Reduction Problem: Two users immediately before and after halving operation see data that varies substantially Continuous DSR: Buffer divided into chunks 2n s 4n s N s -2n s NsNs Next n s transactions arrive Oldest chunk is halved first New trans nsns 3n s 5n s N s -n s NsNs

NGDM’02 20 Conclusion  Two-stage sampling approach based on trimming outliers or selecting representative transactions  Epsilon approximation: deterministic method for repeatedly halving data to obtain final sample  Can be used in conjunction with other non-sampling count-based mining algorithms  EA-based data stream reduction We are investigating how to evaluate goodness of representative subset Frequency information to be used for discrepancy function