Jian Pei and Runying Mao (Simon Fraser University)

Towards Data Mining Benchmarking: A Test Bed for Performance Study of Frequent Pattern Mining
Jian Pei and Runying Mao (Simon Fraser University) Kan Hu and Hua Zhu (DBMiner Technology Inc.)

Outline Introduction Testing frequent pattern mining methods
Performance study presentation The open test bed Conclusions

What Is Frequent Pattern Mining?
Given a set of patterns, find the frequent sub-patterns Examples Given the set of transactions in a super market, try to find items frequently bought together Given the set of DNA of patients of one kind of disease, try to find what could be the possible common structure among them

Why Frequent Pattern Mining Important?
Essential technique in many data mining tasks Association Correlations Sequential patterns Episodes Partial periodicity … Many novel and efficient methods are proposed in recent years

Comprehensive Performance Study and Benchmarking
Testing related methods in one uniform platform and environment Important to both research and industry Features & problems of various methods can be recognized objectively Progress & novel inventions can be evaluated and reported consistently May lead to new idea in R & D Benchmarking is important in promotion of R & D in database industry

What Is the Demo? An open test bed for performance study of frequent pattern mining Implementations of typical frequent pattern mining methods A set of performance curves already got

Testing Frequent Pattern Mining Methods
Mining complete set of frequent patterns Mining frequent closed itemsets Mining sequential patterns Mining max-patterns More is in plan and coming

Mining Complete Set of Frequent Patterns
We demo Apriori TreeProjection FP-growth More is in plan DHP (Apriori + hashing) Partition Random sampling DIC (dynamic itemset counting) …

Mining Frequent Closed Itemsets
A frequent itemset X is closed if there exists no itemset Y such that every transaction having X also contains Y We demo A-Close ChARM CLOSET

Mining Sequential Patterns
Sequential patterns: frequent subsequences in a database of sequences We demo GSP FreeSpan

Mining Max-patterns A frequent pattern X is a max-pattern if every super-pattern of it is infrequent We are implementing MaxMiner TreeProjection FP-max

Performance Measurements
Scalability with Size of datasets The support threshold Resource requirements Memory Disk space overhead CPU runtime

Data Sets for Testing Synthetic data generators Real datasets
IBM Almaden synthetic data generator for associations and sequential patterns … Real datasets Irvine machine-learning database repository More data sets can be added in and dynamically connected

Performance Study Presentation
The performance study is based on our current implementation according to the research papers We are willing to revise the performance study and obtain feedbacks from inventors Please consider donating your latest and most efficient implementation products

Mining Complete Set of Frequent Patterns on T10I4D100k

Mining Complete Set of Frequent Patterns on T25I20D100k

Mining Complete Set of Frequent Patterns on Connect-4

Mining Frequent Closed Itemsets on T25I20D100k

Mining Frequent Closed Itemsets on Connect-4

Mining Frequent Closed Itemsets on Pumsb

Mining Sequential Patterns on C10T2.5S4I1.25

Mining Sequential Patterns on C10T5S4I1.25

Mining Sequential Patterns on C10T5S4I2.5

The Architecture of the Open Test Bed

Features of The Test Bed
Datasets are manageable Datasets and methods are independent Reporting on mining methods and/or datasets

Conclusions Comprehensive performance study and benchmarking is very important to data mining We demo A prototype of test bed Performance study using the test bed We plan to do Publish the test bed on web Benchmarking more data mining functionalities

Acknowledgements Thank Mr. Haiming Huang for helping us to implement the interface Thank anonymous reviewers for their comments Thank people in Intelligent Database Systems Lab., Simon Fraser University, for their help in research and development

References(1) R. Agarwal, C. Aggarwal, and V. V. V. Prasad. Depth-first generation of large itemsets for association rules. In IBM Technical Report RC21538, October, 1999. R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), (to appear), 2000. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc Int. Conf. Very Large Data Bases, pages , Santiago, Chile, September 1994.

References(2) R. Agrawal and R. Srikant. Mining sequential patterns. In Proc Int. Conf. Data Engineering, pages 3--14, Taipei, Taiwan, March 1995. R. J. Bayardo. Efficiently mining long patterns from databases. In Proc ACM-SIGMOD Int. Conf. Management of Data, pages , Seattle, Washington, June 1998. R. J. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-based rule mining on large, dense data sets. In Proc Int. Conf. Data Engineering (ICDE'99), Sydney, Australia, April 1999. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. In Proc ACM-SIGMOD Int. Conf. Management of Data, pages , Tucson, Arizona, May 1997.

References(3) S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. In Proc ACM-SIGMOD Int. Conf. Management of Data, pages , Tucson, Arizona, May 1997. G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proc. 5th Int. Conf. Knowledge Discovery and Data Mining (KDD'99), pages , San Diego, August 1999. J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In Proc Int. Conf. Data Engineering (ICDE'99), pages , Sydney, Australia, April 1999. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases.

References(4) In Proc Int. Conf. Very Large Data Bases, pages , Zurich, Switzerland, Sept J. Han, J. Pei, and Y. Yin. Mining partial periodicity using frequent pattern trees. In Computing Science Techniqcal Report TR-99-10, Simon Fraser University, July 1999. M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97), pages , Newport Beach, California, August 1997. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. In Proc. 3rd Int. Conf. Information and Knowledge Management, pages , Gaithersburg, Maryland, Nov

References(5) B. Lent, A. Swami, and J. Widom. Clustering association rules. In Proc Int. Conf. Data Engineering (ICDE'97), pages , Birmingham, England, April 1997. D. Lin and Z. Kedem. Pincer-search: A new algorithm for discovering the maximum frequent set. In Proc. of the Sixth Int'l Conf. on Extending Database Technology (EDBT'98), pages , Valencia, Spain, March 1998. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. In Proc. AAAI'94 Workshop Knowledge Discovery in Databases (KDD'94), pages , Seattle, WA, July 1994. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1: , 1997.

References(6) R.J. Miller and Y. Yang. Association rules over interval data. In Proc ACM-SIGMOD Int. Conf. Management of Data, pages , Tucson, Arizona, May 1997. R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In Proc ACM-SIGMOD Int. Conf. Management of Data, pages , Seattle, Washington, June 1998. J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. In Proc ACM-SIGMOD Int. Conf. Management of Data, pages , San Jose, CA, May 1995. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In Proc ACM-SIGMOD Int. Conf. Management of Data, pages , Seattle, Washington, June 1998.

References(7) A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proc Int. Conf. Very Large Data Bases, pages , Zurich, Switzerland, Sept C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. In Proc Int. Conf. Very Large Data Bases, pages , New York, NY, August 1998. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc Int. Conf. Very Large Data Bases, pages , Zurich, Switzerland, Sept R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc ACM-SIGMOD Int. Conf. Management of Data, pages 1--12, Montreal, Canada, June 1996.

References(8) R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. 5th Int. Conf. Extending Database Technology (EDBT), pages 3--17, Avignon, France, March 1996. R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97), pages , Newport Beach, California, August 1997. H. Toivonen. Sampling large databases for association rules. In Proc Int. Conf. Very Large Data Bases, pages , Bombay, India, Sept M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1: , 1997.

Jian Pei and Runying Mao (Simon Fraser University)

Similar presentations

Presentation on theme: "Jian Pei and Runying Mao (Simon Fraser University)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jian Pei and Runying Mao (Simon Fraser University)

Similar presentations

Presentation on theme: "Jian Pei and Runying Mao (Simon Fraser University)"— Presentation transcript:

Similar presentations

About project

Feedback