Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura

Slides:

Advertisements

Similar presentations

Mining Association Rules

Advertisements

Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

COMP5318 Knowledge Discovery and Data Mining

More Efficient Generation of Plane Triangulations Shin-ichi Nakano Takeaki Uno Gunma University National Institute of JAPAN Informatics, JAPAN 23/Sep/2003.

LOGO Association Rule Lecturer: Dr. Bo Yuan

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.

FP-Growth algorithm Vasiljevic Vladica,

Data Mining Association Analysis: Basic Concepts and Algorithms

Rakesh Agrawal Ramakrishnan Srikant

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Association Analysis: Basic Concepts and Algorithms.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

Ch5 Mining Frequent Patterns, Associations, and Correlations

Sequential PAttern Mining using A Bitmap Representation

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.

1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.

Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Mining High Utility Itemset in Big Data

Mining Frequent Patterns without Candidate Generation.

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

New Algorithms for Enumerating All Maximal Cliques

CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.

Temporal Database Paper Reading R 資工碩一馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.

Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Association Analysis (3)

Detailed Description of an Algorithm for Enumeration of Maximal Frequent Sets with Irredundant Dualization I rredundant B order E numerator Takeaki Uno.

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.

Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),

Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)

Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

TITLE What should be in Objective, Method and Significant

Frequent Pattern Mining

Byung Joon Park, Sung Hee Kim

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Frequent-Pattern Tree

Output Sensitive Enumeration

Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.

DENSE ITEMSETS JOUNI K. SEPPANEN, HEIKKI MANNILA SIGKDD2004

CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets

Presentation transcript:

Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute of Informatics, JAPAN Kyushu University, JAPAN Hokkaido University, JAPAN 4/Oct/2004 Discovery Science

Real world data is often large and sparse Transaction Database ・ Transaction database T : a database composed of transactions defined on itemset I i.e., 　T , t ∈T , t ⊆I - basket data - links of web pages - words in documents ・ A subset of I is called a pattern 1,2,5,6,7 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T ＝ Real world data is often large and sparse

Occurrences of Pattern ・ For a pattern P, occurrence of P :　 a transaction in T including P denotation of P :　 set of occurrences of P ・ The size of denotation is called frequency of P 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 　denotation of {1,2} ＝　{ {1,2,5,6,7,9}, {1,2,7,8,9} } T ＝

Frequent Pattern ・ Given a minimum support θ, Frequent pattern: a pattern s.t. (frequency) ≧ θ (a subset of items, which is included in at least θ transactions) Ex.) patterns included in at least 3 transactions {1} {2} {7} {9} {1,7} {1,9} {2,7} {2,9} {7,9} {2,7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 T ＝ Important role in discovering interesting knowledge ・ However, # frequent patterns is often large…

Closed Pattern [Pasquier et. al. 1999] ・ Patterns having the same denotations  quite similar ・ Classify patterns into equivalence classes by their denotations ・ Closed pattern: the maximal in an equivalence class 　 (＝ intersection of occurrences in the denotation) ・ Closure of a pattern： the closed pattern belonging to its equivalence class φ 1,2,5 3,5,7 closed pattern non-closed patterns equivalence class

Advantages of Closed Pattern Completeness: [Mannila’96; Pasquier ‘99] - The set C of frequent closed patterns have the complete information on the set F of all frequent patterns and their frequencies - Any maximal association rule can be constructed from the set C - The set C is sufficient for building classification rules over itemsets Compactness: [Mannila ‘96] - |Maximal Frequent| ≦ |C| ≦ |F| - Frequent closed patterns are possibly exponentially fewer than |F| Th 1 [This paper]: For any n and m, we can construct a database of n items and m transactions that |C| = O(m2) while |F| = 2Ω(n+m).

Problem and Result ・ PROBLEM: given a transaction database, find all frequent closed patterns ・ Many existing studies, theory and practice We propose prefix preserving closure extension, and an efficient algorithm LCM (Linear time Closed pattern Miner) ・ Theoretical advantage: linear time in #frequent closed patterns, use small memory ・ Practical advantage: faster than the other algorithms for many datasets (almost all datasets of KDDcup and FIMI’03)

Existing Approach ・ Frequent pattern mining based approaches: enumerate frequent patterns, and output closed patterns among them ・ Reduce the computation time by avoiding non-closed patterns: During the enumeration, - eliminate unnecessary patterns from memory - prune unnecessary branches of the recursion (not complete)

Our Approach ・ Existing algorithms - possibly operate many non-closed patterns - require much memory for storing obtained patterns We propose closure extension based enumeration  operate closed patterns only (linear time) prefix preserving closure extension  no memory for previously obtained patterns (small memory) some algorithms for fast computation  faster then other algorithms

Closure Extension [Pasquier et. al. ’99] ・ Closure extension: a rule for constructing a closed pattern from another closed pattern  add an item, and take its closure closure closed pattern + item ・ Any closed pattern is a closure extension of at least one other closed pattern ・ Any closed pattern has strictly smaller size than any its closure extension

Acyclic Relation [essentially Pasquier et. al. ’99] Closure extension induces an acyclic search graph frequent ・We compute in linear time all closed patterns by closure extension ・ However, we still have to store obtained closed patterns in memory…

Prefix Preserving Closure Extension [new] ・ Prefix preserving closure extension (ppc extension) is a variation of closure extension Def. closure tail of a closed pattern P ⇔ the minimum j s.t. closure (P ∩ {1,…,j}) ＝ P Def. H ＝ closure(P∪{i}) (closure extension of P) is a ppc extension of P 　 ⇔ i > closure tail and H ∩{1,…,i-1} ＝ P ∩{1,…,i-1}  no duplication occurs by depth-first search “Any” closed pattern H is generated from another “unique” closed pattern by ppc extension (i.e., from closure(H ∩{1,…,i-1}) )

Relation of ppc extension [new] ・ Any closed pattern is a ppc extension of unique closed pattern  ppc extension forms a tree frequent We can proceed depth-first search by ppc extension, without storing closed patterns in memory

closure extension ppc extension Example φ ・ closure extension  acyclic ・ ppc extension  tree {2} {7,9} 1,2,5,6,7,9 2,3,4,5 1,2,7,8,9 1,7,9 2,7,9 2 {1,7,9} T ＝ {2,5} {2,7,9} {1,2,7,9} {2,3,4,5} closure extension ppc extension {1,2,7,8,9} {1,2,5,6,7,9}

We propose efficient algorithms for these tasks Fast Computation To generate a ppc extension for closed pattern P and item i, we 1. compute the denotation of P ∪{i} 2. compute the closure of P ∪{i} 3. compare the prefix We propose efficient algorithms for these tasks

Occurrence Deliver [new] ・ Compute the denotations of P ∪{i} for all i’s at once, by transposing the trimmed database ・ Trimmed database is composed of - items to be added - transactions including P 3 4 5 A B C 4 5 3 pattern: 1,2 denotation: A,B,C linear time in the size of trimmed database A B C denotation of 1,2,3 denotation of 1,2,4 denotation of 1,2,5 B C A ・ Efficient for sparse datasets

Anytime Database Reduction [new] ・ Reduce the database, by [fp-growth, etc] 　　◆ Remove item e, if e is included in less than θ transactions 　　　　 or included in all transactions 　　◆ merge identical transactions into one ・ Recursively apply trimming and this reduction, in the recursion 　  database size becomes small in lower levels of the recursion ・ For taking closure, keep the intersection of merged transactions ← closure operation is to take the intersection of transactions

Experiments ・ Computational environment CPU, memory:　AMD Athron XP 1600+, 224MB OS, Programming language, compiler: Linux, C, gcc ・Algorithms compared with, FP-growth, afopt, MAFIA, PATRICIAMINE, kDCI (All these marked high scores at competition FIMI03) ・ Datasets 13 dataset of real world, machine learning, artificial datasets used in FIMI03 and KDD-cup, with specified supports Result ・ Won 12 databases for every support (other than Accident dataset of middle supports) ・ outperfroms especially smaller supports

results

Conclusion Closed patterns: representatives of frequent patterns [Pasquier et.al.’00] - much fewer than frequent patterns (possibly exponentially) - useful in compact representation and rule induction ・ We proposed an algorithm LCM for mining closed patterns in databases - prefix preserving closure extension for tree-shaped search space - time complexity is linear in #closed patterns, and small memory footprint - practical speed up: occurrence deliver and anytime database reduction ・ Experiments show that LCM outperforms other algorithms in most instances, in KDDcup and FIMI datasets, especially with small supports Future work: closed patterns for sequences, trees, and other structures LCM is submitted to FIMI04 competition, be looking forward to it!

List of Datasets Machine learning benchmark Real datasets ・ Chess ・ BMS-WebVeiw-1 ・ BMS-WebVeiw-2 ・ BMS-POS ・ Retail ・ Kosarak ・ Accidents Machine learning benchmark ・ Chess ・ Mushroom ・ Pumsb ・ Pumsb* ・ Connect Aartificial datasets ・ T10I4D100K ・ T40I10D100K