Jian Pei and Runying Mao (Simon Fraser University)

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.
Mining Biological Data
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Mining Frequent Item Sets by Opportunistic Projection
Rakesh Agrawal Ramakrishnan Srikant
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Multi-dimensional Sequential Pattern Mining
Data Mining Association Rules Yao Meng Hongli Li Database II Fall 2002.
Sequence Databases & Sequential Patterns
CS 590M Fall 2001: Security Issues in Data Mining Lecture 5: Association Rules, Sequential Associations.
September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia Research carried out within the laboratory: Statistical.
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Chaotic Mining: Knowledge Discovery Using the Fractal Dimension Daniel Barbara George Mason University Information and Software Engineering Department.
Pattern-growth Methods for Sequential Pattern Mining: Principles and Extensions Jiawei Han (UIUC) Jian Pei (Simon Fraser Univ.)
Performance and Scalability: Apriori Implementation.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar GNET 713 BCB Module Spring 2007.
A Short Introduction to Sequential Data Mining
What Is Sequential Pattern Mining?
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
October 2, 2015 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 8 — 8.3 Mining sequence patterns in transactional.
©Jiawei Han and Micheline Kamber
October 6, 2015Data Mining: Concepts and Techniques1 Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 6 — ©Jiawei Han and Micheline.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
1 Data Mining and Warehousing: Session 6 Association Analysis Jia-wei Han
1 Multi-dimensional Sequential Pattern Mining Helen Pinto, Jiawei Han, Jian Pei, Ke Wang, Qiming Chen, Umeshwar Dayal ~From: 10th ACM Intednational Conference.
1 Knowledge discovery & data mining Association rules and market basket analysis --introduction A EDBT2000 Fosca Giannotti and Dino Pedreschi.
Association Rule Mining ARM
Mining High Utility Itemset in Big Data
Pattern-Growth Methods for Sequential Pattern Mining Iris Zhang
Han: Association Rule Mining; modified & extended by Ch. Eick 1 Association Rule Mining — Slides for Textbook — — Chapter 6 — ©Jiawei Han and Micheline.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Overview Definition of Apriori Algorithm
Mining Sequential Patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 Slides are adapted from Introduction to Data Mining by Tan, Steinbach,
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Fuzzy Set Approach for Improving Web Log Mining Sajitha Naduvil-Vadukootu Csc 8810 : Computational Intelligence Instructor: Dr. Yanqing Zhang Dec 4, 2006.
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Information Management course
Predictive Analytics in SQL and Datalog
Association rule mining
Byung Joon Park, Sung Hee Kim
©Jiawei Han and Micheline Kamber
CS6220: Data Mining Techniques
Data Mining: Concepts and Techniques
©Jiawei Han and Micheline Kamber
CARPENTER Find Closed Patterns in Long Biological Datasets
Data Mining: Concepts and Techniques
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
به نام خداوند جان و خرد الگوکاوي در پايگاه‌هاي تراکنش بسيار بزرگ با استفاده از رويکرد تقسيم وحل Frequent Pattern Mining on Very Large Transaction Databases.
Gyozo Gidofalvi Uppsala Database Laboratory
Association Rule Mining
Data Mining: Concepts and Techniques
Association Rule Mining
Chapter 5: Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns without Candidate Generation
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
©Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Association Rule Mining
Online Analytical Processing Stream Data: Is It Feasible?
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Presentation transcript:

Towards Data Mining Benchmarking: A Test Bed for Performance Study of Frequent Pattern Mining Jian Pei and Runying Mao (Simon Fraser University) Kan Hu and Hua Zhu (DBMiner Technology Inc.)

Outline Introduction Testing frequent pattern mining methods Performance study presentation The open test bed Conclusions

What Is Frequent Pattern Mining? Given a set of patterns, find the frequent sub-patterns Examples Given the set of transactions in a super market, try to find items frequently bought together Given the set of DNA of patients of one kind of disease, try to find what could be the possible common structure among them

Why Frequent Pattern Mining Important? Essential technique in many data mining tasks Association Correlations Sequential patterns Episodes Partial periodicity … Many novel and efficient methods are proposed in recent years

Comprehensive Performance Study and Benchmarking Testing related methods in one uniform platform and environment Important to both research and industry Features & problems of various methods can be recognized objectively Progress & novel inventions can be evaluated and reported consistently May lead to new idea in R & D Benchmarking is important in promotion of R & D in database industry

What Is the Demo? An open test bed for performance study of frequent pattern mining Implementations of typical frequent pattern mining methods A set of performance curves already got

Testing Frequent Pattern Mining Methods Mining complete set of frequent patterns Mining frequent closed itemsets Mining sequential patterns Mining max-patterns More is in plan and coming

Mining Complete Set of Frequent Patterns We demo Apriori TreeProjection FP-growth More is in plan DHP (Apriori + hashing) Partition Random sampling DIC (dynamic itemset counting) …

Mining Frequent Closed Itemsets A frequent itemset X is closed if there exists no itemset Y such that every transaction having X also contains Y We demo A-Close ChARM CLOSET

Mining Sequential Patterns Sequential patterns: frequent subsequences in a database of sequences We demo GSP FreeSpan

Mining Max-patterns A frequent pattern X is a max-pattern if every super-pattern of it is infrequent We are implementing MaxMiner TreeProjection FP-max

Performance Measurements Scalability with Size of datasets The support threshold Resource requirements Memory Disk space overhead CPU runtime

Data Sets for Testing Synthetic data generators Real datasets IBM Almaden synthetic data generator for associations and sequential patterns … Real datasets Irvine machine-learning database repository More data sets can be added in and dynamically connected

Performance Study Presentation The performance study is based on our current implementation according to the research papers We are willing to revise the performance study and obtain feedbacks from inventors Please consider donating your latest and most efficient implementation products

Mining Complete Set of Frequent Patterns on T10I4D100k

Mining Complete Set of Frequent Patterns on T25I20D100k

Mining Complete Set of Frequent Patterns on Connect-4

Mining Frequent Closed Itemsets on T25I20D100k

Mining Frequent Closed Itemsets on Connect-4

Mining Frequent Closed Itemsets on Pumsb

Mining Sequential Patterns on C10T2.5S4I1.25

Mining Sequential Patterns on C10T5S4I1.25

Mining Sequential Patterns on C10T5S4I2.5

Mining Sequential Patterns on C20T2.5S4I1.25

Mining Sequential Patterns on C20T2.5S8I1.25

The Architecture of the Open Test Bed

Features of The Test Bed Datasets are manageable Datasets and methods are independent Reporting on mining methods and/or datasets

Conclusions Comprehensive performance study and benchmarking is very important to data mining We demo A prototype of test bed Performance study using the test bed We plan to do Publish the test bed on web Benchmarking more data mining functionalities

Acknowledgements Thank Mr. Haiming Huang for helping us to implement the interface Thank anonymous reviewers for their comments Thank people in Intelligent Database Systems Lab., Simon Fraser University, for their help in research and development

References(1) R. Agarwal, C. Aggarwal, and V. V. V. Prasad. Depth-first generation of large itemsets for association rules. In IBM Technical Report RC21538, October, 1999. R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), (to appear), 2000. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases, pages 487--499, Santiago, Chile, September 1994.

References(2) R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering, pages 3--14, Taipei, Taiwan, March 1995. R. J. Bayardo. Efficiently mining long patterns from databases. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 85--93, Seattle, Washington, June 1998. R. J. Bayardo, R. Agrawal, and D. Gunopulos. Constraint-based rule mining on large, dense data sets. In Proc. 1999 Int. Conf. Data Engineering (ICDE'99), Sydney, Australia, April 1999. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 265--276, Tucson, Arizona, May 1997.

References(3) S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 255--264, Tucson, Arizona, May 1997. G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proc. 5th Int. Conf. Knowledge Discovery and Data Mining (KDD'99), pages 43--52, San Diego, August 1999. J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. In Proc. 1999 Int. Conf. Data Engineering (ICDE'99), pages 106--115, Sydney, Australia, April 1999. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases.

References(4) In Proc. 1995 Int. Conf. Very Large Data Bases, pages 420--431, Zurich, Switzerland, Sept. 1995. J. Han, J. Pei, and Y. Yin. Mining partial periodicity using frequent pattern trees. In Computing Science Techniqcal Report TR-99-10, Simon Fraser University, July 1999. M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97), pages 207--210, Newport Beach, California, August 1997. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. In Proc. 3rd Int. Conf. Information and Knowledge Management, pages 401--408, Gaithersburg, Maryland, Nov. 1994.

References(5) B. Lent, A. Swami, and J. Widom. Clustering association rules. In Proc. 1997 Int. Conf. Data Engineering (ICDE'97), pages 220--231, Birmingham, England, April 1997. D. Lin and Z. Kedem. Pincer-search: A new algorithm for discovering the maximum frequent set. In Proc. of the Sixth Int'l Conf. on Extending Database Technology (EDBT'98), pages 105--119, Valencia, Spain, March 1998. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. In Proc. AAAI'94 Workshop Knowledge Discovery in Databases (KDD'94), pages 181--192, Seattle, WA, July 1994. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259--289, 1997.

References(6) R.J. Miller and Y. Yang. Association rules over interval data. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data, pages 452--461, Tucson, Arizona, May 1997. R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 13--24, Seattle, Washington, June 1998. J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. In Proc. 1995 ACM-SIGMOD Int. Conf. Management of Data, pages 175--186, San Jose, CA, May 1995. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 343--354, Seattle, Washington, June 1998.

References(7) A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. 1995 Int. Conf. Very Large Data Bases, pages 432--443, Zurich, Switzerland, Sept. 1995. C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. In Proc. 1998 Int. Conf. Very Large Data Bases, pages 594--605, New York, NY, August 1998. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. 1995 Int. Conf. Very Large Data Bases, pages 407--419, Zurich, Switzerland, Sept. 1995. R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, pages 1--12, Montreal, Canada, June 1996.

References(8) R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proc. 5th Int. Conf. Extending Database Technology (EDBT), pages 3--17, Avignon, France, March 1996. R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. In Proc. 3rd Int. Conf. Knowledge Discovery and Data Mining (KDD'97), pages 67--73, Newport Beach, California, August 1997. H. Toivonen. Sampling large databases for association rules. In Proc. 1996 Int. Conf. Very Large Data Bases, pages 134--145, Bombay, India, Sept. 1996. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1:343--374, 1997.