Implementation of “A New Two-Phase Sampling Based Algorithm for Discovering Association Rules” Tokunbo Makanju Adan Cosgaya Faculty of Computer Science.

Slides:

Advertisements

Similar presentations

Sequential PAttern Mining using A Bitmap Representation

Advertisements

Mining Association Rules from Microarray Gene Expression Data.

A distributed method for mining association rules

IT 433 Data Warehousing and Data Mining Association Rules Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department

Parallel Apriori Algorithm Using MPI Congressional Voting Records Çankaya University Computer Engineering Department Ahmet Artu YILDIRIM January 2010.

Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.

Privacy Preserving Association Rule Mining in Vertically Partitioned Data Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Data Mining Association Rules Yao Meng Hongli Li Database II Fall 2002.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Maintenance of Discovered Association Rules S.D.LeeDavid W.Cheung Presentation : Pablo Gazmuri.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

A dynamic-programming algorithm for hierarchical discretization of continuous attributes Amit Goyal (15 st April 2008) Department of Computer Science The.

Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏

Research Project Mining Negative Rules in Large Databases using GRD.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.

ACM SIGKDD Aug – Washington, DC  M. El-Hajj and O. R. Zaïane, 2003 Database Lab. University of Alberta Canada Inverted Matrix: Efficient Discovery.

Performance and Scalability: Apriori Implementation.

Association Analysis (5) (Mining Word Associations)

Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.

1 Data Mining over the Deep Web Tantan Liu, Gagan Agrawal Ohio State University April 12, 2011.

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Trajectory Pattern Mining

Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.

Temporal Analysis using Sci2 Ted Polley and Dr. Katy Börner Cyberinfrastructure for Network Science Center Information Visualization Laboratory School.

Alva Erwin Department ofComputing Raj P. Gopalan, and N.R. Achuthan Department of Mathematics and Statistics Curtin University of Technology Kent St. Bentley.

Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.

Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.

Multi-Relational Data Mining: An Introduction Joe Paulowskey.

Stratified K-means Clustering Over A Deep Web Data Source Tantan Liu, Gagan Agrawal Dept. of Computer Science & Engineering Ohio State University Aug.

Powerpoint Templates 1 Mining High-Speed Data Streams Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Confrence Presented by: Afsoon.

August 21, 2002VLDB Gurmeet Singh Manku Frequency Counts over Data Streams Frequency Counts over Data Streams Stanford University, USA.

Privacy-Preserving K-means Clustering over Vertically Partitioned Data Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Mining Quantitative Association Rules in Large Relational Tables ACM SIGMOD Conference 1996 Authors: R. Srikant, and R. Agrawal Presented by: Sasi Sekhar.

2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Custom Computing Machines for the Set Covering Problem Paper Written By: Christian Plessl and Marco Platzner Swiss Federal Institute of Technology, 2002.

An Energy-Efficient Approach for Real-Time Tracking of Moving Objects in Multi-Level Sensor Networks Vincent S. Tseng, Eric H. C. Lu, & Kawuu W. Lin Institute.

A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.

1 Efficient Data Reduction Methods for Online Association Rule Discovery -NGDM’02 Herve Bronnimann, Bin Chen, Manoranjan Dash, Peter Haas, Yi Qiao, Peter.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.

Sequential Pattern Mining Using A Bitmap Representation

QianZhu, Liang Chen and Gagan Agrawal

Byung Joon Park, Sung Hee Kim

Waikato Environment for Knowledge Analysis

Supporting Fault-Tolerance in Streaming Grid Applications

CARPENTER Find Closed Patterns in Long Biological Datasets

Mining Frequent Itemsets over Uncertain Databases

StreamApprox Approximate Stream Analytics in Apache Spark

An Efficient Algorithm for Incremental Mining of Association Rules

A Parameterised Algorithm for Mining Association Rules

Mining Association Rules from Stars

Farzaneh Mirzazadeh Fall 2007

Stratified Sampling for Data Mining on the Deep Web

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Decision Trees for Mining Data Streams

Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.

Presentation transcript:

Implementation of “A New Two-Phase Sampling Based Algorithm for Discovering Association Rules” Tokunbo Makanju Adan Cosgaya Faculty of Computer Science Dalhousie University Fall 2005 CSCI 6405 Data Warehousing and Data Mining

Overview Introduction Algorithm Data Preparation Experimental Results Conclusions References

Introduction Size of datasets are getting larger The time required to mine information from these datasets increases as datasets get larger Demand for faster rule mining Solution: mine a sample of the original dataset

Algorithm FAST (Finding Association in Sample Transactions) 2 versions  FAST-Trim  FAST-Grow FAST outline:  Obtain a simple random sample S  Compute frequency for each 1-itemset  Obtain a reduced sample S 0 from S by either trimming S or growing S 0.  Run a standard association-rule algorithm against S 0

Algorithm Distance Functions I 1 (T) = set of all 1-itemsets in transaction set T L 1 (T) = set of frequent 1-itemsets in transaction set T f(A;T) = support of itemset A in transaction set T

Algorithm Obtain a simple random sample S from D compute f(A;S) from each A element of S set i=0, S 0 (i)= , minDist =  , and minStage=-1; while (|S 0 | < n) { divide S 0 into disjoint groups of min(k,| S-S 0 |) transactions each; for each group G { set S 0 = S 0 (i)  {t*}, where Dist(S 0 (i)  {t*},S) = min Dist(S 0 (i)  {t},S) } compute f(A; S 0 (i)) for each item A element of S 0 ; if (Dist( S 0 (i),S) < minDist) { set minDist := dist ( S 0 ( i), S) and minStage := i; } set S 0 (i + 1 / := S0(i); } FAST-Grow Algorithm

Data Preparation Downloaded from fimi.cs.helsinki.fi/data/accidents.pdf fimi.cs.helsinki.fi/data/accidents.pdf The data source for this dataset is the National Institute of Statistics from the region of Flanders in Belgium. In total 572 unique attribute values can be found in the dataset and an average of 45 attribute values are recorded for each accident.

Experimental Results Dataset with 340,183 transactions Obtained a reduced sample of 30% Final sample ratios of 2.5%, 5%, 7.5% and 10% Parameters:  Minimum Support = 0.77%  Size of group k = 10

Experimental Results Sampling ratio# of rules produced% of Accuracy 2.5%(8,500 transactions) % 5%(17,010 transactions)585100% 7.5%(25,500 transactions)445100% 10%(34,020 transactions)585100% Results

Conclusions No need to process a large input dataset FAST- grow can achieve a high accuracy even with a small sampling ratio of 5-10% The algorithm has a better performance when using the fixed-size stopping criterion

References [1] B. Chen, P. Haas, and P. Scheuermann. A new two-phase sampling based algorithm for discovering association rules. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002 [2] H. Bronnimann, B. Chen, P. Haas, M. Dash, Y. Qiao, P. Scheuermann, Efficient Data-Reduction Methods for On-Line Association Rule Discovery. Presented at NSF Workshop on Next-Generation Data Mining (NGDM02), November [3] K. Geurts. Traffic Accidents Data Set. fimi.cs.helsinki.fi/data/accidents.pdf.fimi.cs.helsinki.fi/data/accidents.pdf Last Access: 17/11/2005 [4] GNU publicly available implementation of Apriori algorithm, written by Christian Borgelt. Last Access: 24/11/2005

Thank you! Questions?