LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

Recap: Mining association rules from large datasets
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
More Efficient Generation of Plane Triangulations Shin-ichi Nakano Takeaki Uno Gunma University National Institute of JAPAN Informatics, JAPAN 23/Sep/2003.
Zeev Dvir – GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki.
Graduate : Sheng-Hsuan Wang
Data Mining Association Analysis: Basic Concepts and Algorithms
Kuo-Yu HuangNCU CSIE DBLab1 The Concept of Maximal Frequent Itemsets NCU CSIE Database Laboratory Kuo-Yu Huang
Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Frequent Item Based Clustering M.Sc Student:Homayoun Afshar Supervisor:Martin Ester.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Data Mining Association Analysis: Basic Concepts and Algorithms
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
2-Layer Crossing Minimisation Johan van Rooij. Overview Problem definitions NP-Hardness proof Heuristics & Performance Practical Computation One layer:
Fast Algorithms for Association Rule Mining
Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
A Fast Algorithm for Enumerating Bipartite Perfect Matchings Takeaki Uno (National Institute of Informatics, JAPAN)
I/O-Efficient Graph Algorithms Norbert Zeh Duke University EEF Summer School on Massive Data Sets Århus, Denmark June 26 – July 1, 2002.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.
1 Verifying and Mining Frequent Patterns from Large Windows ICDE2008 Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Date: 2008/9/25 Speaker: Li, HueiJyun.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
LCM ver.2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Mining Serial Episode Rules with Time Lags over Multiple Data Streams Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P.
Alva Erwin Department ofComputing Raj P. Gopalan, and N.R. Achuthan Department of Mathematics and Statistics Curtin University of Technology Kent St. Bentley.
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Instructor : Prof. Marina Gavrilova. Goal Goal of this presentation is to discuss in detail how data mining methods are used in market analysis.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
New Algorithms for Enumerating All Maximal Cliques
Temporal Database Paper Reading R 資工碩一 馬智釗 Efficient Mining Strategy for Frequent Serial Episodes in Temporal Database, K Huang, C Chang.
Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)
Association Analysis (3)
Detailed Description of an Algorithm for Enumeration of Maximal Frequent Sets with Irredundant Dualization I rredundant B order E numerator Takeaki Uno.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
Gspan: Graph-based Substructure Pattern Mining
Reducing Number of Candidates
New ideas on FP-Growth and batch incremental mining with FP-Tree
The Concept of Maximal Frequent Itemsets
Dissertation for the degree of Philosophiae Doctor (PhD)
Pyramid Sketch: a Sketch Framework
Mining Frequent Itemsets over Uncertain Databases
A Parameterised Algorithm for Mining Association Rules
Approximate Frequency Counts over Data Streams
Frequent-Pattern Tree
Output Sensitive Enumeration
Output Sensitive Enumeration
Approximate Graph Mining with Label Costs
Presentation transcript:

LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo Uchida National Institute of Informatics Kyushu University 19/Nov/2003 FIMI 2003

small supports Motivation - We want to solve difficult problems in short time Few solutions for small support Many solutions for even large support #closed set = #freq. set #closed set << #freq. set retail accidents IBMdatas chess connect mushroom kosarak pumsb* pumsb BMS POS BMS web1,2 ・ database reduction ・ remove infrequent items ・ sparse/dense (occ-deliv/diffsets) (occ-deliv/diffsets) ・ exact enumeration of closed item set of closed item set ・ generation of all/maximal item set from closed item set large supports

Outline of Our Research Exact enumeration - Exact enumeration of closed item sets (no sophisticated pruning, post processing, nor memory for obtained closed item sets) - Enumerate all/maximal frequent item sets using closed item set - Algorithms for updating occurrences/maximality check adaptive hybrid in dense/sparse cases, and their adaptive hybrid Save additional memoryuse - Save additional memory use (right first sweep, adjacency matrix only for large transactions)

parent-child relationship - Introduce acyclic parent-child relationship on freq. closed sets tree-shaped transversal route ( it induces a tree-shaped transversal route ) depth-first manner - Traverse the route in depth-first manner ( find a child, and go to it ) Exact Enumeration of Closed Item Sets  Exact enumeration (linear time to #closed set)  Any child is found by taking closure (in short time)  Not need to store obtained item sets (small memory) can enumerate all closed item sets (even without min. support) root root (= φ)

X : closed item set parent of X = closure of X∩{1,…,i} where i is the maximum s.t. X ≠closure of X∩{1,…,i}  parent of X ⊆ X, acyclic X' = child of X ⇔ X' is closure of X ∪ {i} for some i and (cond) X' \ X includes no item <i Definition of Parent All children are found by taking closure of X ∪ {i} (cond) can be checked in short time by using some algorithms x x' Closure = maximal item set with the same occurrences child

Computation of Occurrences X ∪ {i} for Sparse and Dense Cases - In sparse case, by tracing items of each occurrence of X (occurrence deliver : maybe a known technique) - In dense case, use diffsets (proposed by Zaki) Adaptive Hybrid Algorithm We choose best one according to estimations of computation time in each iterations

- Maximal frequent sets  generated from closed item sets - All frequent sets (hypercube decomposition)  -- decompose classes of closed item sets into complete sublattices -- enumerate pairs of greatest/least elements of sublattices -- generate others from the pairs Maximal and All Frequent Sets Maximal and All Frequent Sets closed item set class 01 lattice

Result retail accidents IBMdatas chess connect mushroom kosarak pumsb* pumsb BMS POS BMS web1,2 fast if support is small fast or usual Slower than others large supports small supports fast

Conclusion - For data sets s.t. #freq. closed sets << #freq. sets - large business datasets: BMS-web1,2, retails - machine learning datasets with small supports: UCI repository exact enumeration exact enumeration of closed item sets and hypercube decomposition hypercube decomposition perform well - These techniques are orthogonal to other techniques, ( ・ database reduction, ・ pruning infrequent items,… )  we can do better for large supports / accidents (blue area). hybrid - Parameter of hybrid is not tuned  not fast for kosarak, IBMdatas  now faster For further speed up Fast without pruning, trie, other existing method

We think… What are the real problem (bottleneck) ? ● What are the real problem (bottleneck) ? ---- Mining structured item sets (closed item sets, association rule with threshold,… ) Is it only a counting problem ? ● Is it only a counting problem ? ---- for all frequent item set mining, Yes. the problem is how to make the occurrences of an item set from other item sets (choose best way, represent Is maximal item set useful ? ● Is maximal item set useful ? ---- closed item set is useful!! have an application for classification, association rule mining

Usually, < 1/2 Really need to prune ? - Computing occurrences for infrequent items from X Some Observations X X ∪ {1} X ∪ {2} X ∪ {3} X ∪ {4} X ∪ {5} frequency - Almost computation is for updating occurrences - There is a best e to get occurrence of X from X - e Can we design algorithm choosing e in each iteration ? how we find this e ? Does this accelerate? ( we can evaluate the lower bound of occurrence computation ) Pruning of infrequent sets really necessary? Need for accelerating occurrence computation ?

Usually, < 1/2 - Computing occurrences for infrequent items from X Some Observations Really need to prune ? X X ∪ {10} X ∪ {11} X ∪ {12} X ∪ {13} X ∪ {14} frequency

- Generate recursive calls in decreasing order of items - Clear memory after the recursive call - Re-use the memory in the following recursive calls Right First Sweep Child iterations need no memory X ∪ {10} X ∪ {11} X ∪ {12} X ∪ {13} X ∪ {14} A A A B B CD D D E

Compute T(X ∪ {i}) by tracing each occurrence of X Occurrence deliver In sparse cases, fast EDCBAEDCBA X ∪ {10} X ∪ {11} X ∪ {12} X ∪ {13} X ∪ {14} A A A B B CD D D E

- Check (cond) closure of X ∪ {i} \ X includes no item <i - In sparse case, find an occurrence not including j, for all possible item j - In dense case, update occurrences of all frequent X ∪ {j}, and compute T(X ∪ {i} ∪ {j}) Checking (cond) of Closure Quite faster than computing the closure of X ∪ {i} A B C X ∪ {1} X ∪ {2} X ∪ {i} X ∪ {14} A B C A ・・・ C

Results Results all closed maximal