Towards Data Mining Benchmarking: A Test Bed for Performance Study of Frequent Pattern Mining Jian Pei and Runying Mao (Simon Fraser University) Kan Hu and Hua Zhu (DBMiner Technology Inc.)
Outline Introduction Testing frequent pattern mining methods Performance study presentation The open test bed Conclusions
What Is Frequent Pattern Mining? Given a set of patterns, find the frequent sub-patterns Examples Given the set of transactions in a super market, try to find items frequently bought together Given the set of DNA of patients of one kind of disease, try to find what could be the possible common structure among them
Why Frequent Pattern Mining Important? Essential technique in many data mining tasks Association Correlations Sequential patterns Episodes Partial periodicity … Many novel and efficient methods are proposed in recent years
Comprehensive Performance Study and Benchmarking Testing related methods in one uniform platform and environment Important to both research and industry Features & problems of various methods can be recognized objectively Progress & novel inventions can be evaluated and reported consistently May lead to new idea in R & D Benchmarking is important in promotion of R & D in database industry
What Is the Demo? An open test bed for performance study of frequent pattern mining Implementations of typical frequent pattern mining methods A set of performance curves already got
Testing Frequent Pattern Mining Methods Mining complete set of frequent patterns Mining frequent closed itemsets Mining sequential patterns Mining max-patterns More is in plan and coming
Mining Complete Set of Frequent Patterns We demo Apriori TreeProjection FP-growth More is in plan DHP (Apriori + hashing) Partition Random sampling DIC (dynamic itemset counting) …
Mining Frequent Closed Itemsets A frequent itemset X is closed if there exists no itemset Y such that every transaction having X also contains Y We demo A-Close ChARM CLOSET
Mining Sequential Patterns Sequential patterns: frequent subsequences in a database of sequences We demo GSP FreeSpan
Mining Max-patterns A frequent pattern X is a max-pattern if every super-pattern of it is infrequent We are implementing MaxMiner TreeProjection FP-max
Performance Measurements Scalability with Size of datasets The support threshold Resource requirements Memory Disk space overhead CPU runtime
Data Sets for Testing Synthetic data generators Real datasets IBM Almaden synthetic data generator for associations and sequential patterns … Real datasets Irvine machine-learning database repository More data sets can be added in and dynamically connected
Performance Study Presentation The performance study is based on our current implementation according to the research papers We are willing to revise the performance study and obtain feedbacks from inventors Please consider donating your latest and most efficient implementation products
Mining Complete Set of Frequent Patterns on T10I4D100k
Mining Complete Set of Frequent Patterns on T25I20D100k
Mining Complete Set of Frequent Patterns on Connect-4
Mining Frequent Closed Itemsets on T25I20D100k
Mining Frequent Closed Itemsets on Connect-4
Mining Frequent Closed Itemsets on Pumsb
Mining Sequential Patterns on C10T2.5S4I1.25
Mining Sequential Patterns on C10T5S4I1.25
Mining Sequential Patterns on C10T5S4I2.5
Mining Sequential Patterns on C20T2.5S4I1.25
Mining Sequential Patterns on C20T2.5S8I1.25
The Architecture of the Open Test Bed
Features of The Test Bed Datasets are manageable Datasets and methods are independent Reporting on mining methods and/or datasets
Conclusions Comprehensive performance study and benchmarking is very important to data mining We demo A prototype of test bed Performance study using the test bed We plan to do Publish the test bed on web Benchmarking more data mining functionalities
Acknowledgements Thank Mr. Haiming Huang for helping us to implement the interface Thank anonymous reviewers for their comments Thank people in Intelligent Database Systems Lab., Simon Fraser University, for their help in research and development
References(1) R. Agarwal, C. Aggarwal, and V. V. V. Prasad. Depth-first generation of large itemsets for association rules. In IBM Technical Report RC21538, October, 1999. R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), (to appear), 2000. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases, pages 487--499, Santiago, Chile, September 1994.
References(2) R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering, pages 3--14, Taipei, Taiwan, March 1995. R. J. Bayardo. Efficiently mining long patterns from databases. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 85--93, Seattle, Washington, June 1998. R. J. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-based rule mining on large, dense data sets. In Proc. 1999 Int. Conf. Data Engineering (ICDE'99), Sydney, Australia, April 1999. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 265--276, Tucson, Arizona, May 1997.
References(3) S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 255--264, Tucson, Arizona, May 1997. G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proc. 5th Int. Conf. Knowledge Discovery and Data Mining (KDD'99), pages 43--52, San Diego, August 1999. J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In Proc. 1999 Int. Conf. Data Engineering (ICDE'99), pages 106--115, Sydney, Australia, April 1999. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases.
References(4) In Proc. 1995 Int. Conf. Very Large Data Bases, pages 420--431, Zurich, Switzerland, Sept. 1995. J. Han, J. Pei, and Y. Yin. Mining partial periodicity using frequent pattern trees. In Computing Science Techniqcal Report TR-99-10, Simon Fraser University, July 1999. M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97), pages 207--210, Newport Beach, California, August 1997. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. In Proc. 3rd Int. Conf. Information and Knowledge Management, pages 401--408, Gaithersburg, Maryland, Nov. 1994.
References(5) B. Lent, A. Swami, and J. Widom. Clustering association rules. In Proc. 1997 Int. Conf. Data Engineering (ICDE'97), pages 220--231, Birmingham, England, April 1997. D. Lin and Z. Kedem. Pincer-search: A new algorithm for discovering the maximum frequent set. In Proc. of the Sixth Int'l Conf. on Extending Database Technology (EDBT'98), pages 105--119, Valencia, Spain, March 1998. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. In Proc. AAAI'94 Workshop Knowledge Discovery in Databases (KDD'94), pages 181--192, Seattle, WA, July 1994. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259--289, 1997.
References(6) R.J. Miller and Y. Yang. Association rules over interval data. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 452--461, Tucson, Arizona, May 1997. R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 13--24, Seattle, Washington, June 1998. J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. In Proc. 1995 ACM-SIGMOD Int. Conf. Management of Data, pages 175--186, San Jose, CA, May 1995. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 343--354, Seattle, Washington, June 1998.
References(7) A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. 1995 Int. Conf. Very Large Data Bases, pages 432--443, Zurich, Switzerland, Sept. 1995. C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 594--605, New York, NY, August 1998. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. 1995 Int. Conf. Very Large Data Bases, pages 407--419, Zurich, Switzerland, Sept. 1995. R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, pages 1--12, Montreal, Canada, June 1996.
References(8) R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. 5th Int. Conf. Extending Database Technology (EDBT), pages 3--17, Avignon, France, March 1996. R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97), pages 67--73, Newport Beach, California, August 1997. H. Toivonen. Sampling large databases for association rules. In Proc. 1996 Int. Conf. Very Large Data Bases, pages 134--145, Bombay, India, Sept. 1996. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1:343--374, 1997.