Download presentation
Presentation is loading. Please wait.
Published byElaine Dorsey Modified over 9 years ago
1
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane Presented by: Ivy Tong 19 December, 2003
2
2 Introduction Association rules mining Existing algorithms highly depend on Memory size Repeated I/O scans for the datasets =>Insufficient for extremely large datasets
3
3 Related Work Two major approaches in the literature: Apriori-like algorithms Algorithms like FP-tree Apriori Extensive I/O scans for the DB High cost of computations FP-tree High memory requirement (data structure+conditional trees) Storing the structure in disk significantly increases number of I/O
4
4 Contributions of this paper Current algorithms Handle small sizes DB with low dimensions Scale up to a few millions of transactions A few thousands of dimensions No algorithms can handle > 15M transactions, and hundreds of thousands of dimensions. This paper Disk-based algorithm based on conditional pattern concept Divided into 2 phases
5
5 Contributions of this paper Phase 1 (Pre-processing): 2 full I/O scans of dataset Generate a special disk-based data structure called Inverted Matrix Phase 2 (Mining): Mine the Inverted Matrix (using different support levels) to generate association rule Less than one-full I/O scan of the data structure Only the part of frequent items are scanned and used to generate frequent patterns
6
6 Outline of the talk Problem Statement Transactional Layout Motivations of using Inverted Matrix Design and Constructions of COFI-trees (Co-Occurrence Frequent Item Trees) Experimental Results Conclusions and Future Work
7
7 Problem Statement Let I = {i 1, i 2,…, i m } be a set of literals, called items Let D be a set of transactions, where each transaction T is a set of items (itemsets) such that T I A unique identifier TID is given to each transaction A transaction T is said to contain X, a set of items in I, if X T. An association rule is an implication of the form “X=>Y”, where X I, Y I, and X Y=
8
8 Problem Statement An itemset X is large or frequent if its support s is greater or equal than a given min support threshold min_sup The rule X=>Y has support s if s% of transactions in D contains X Y. The rule X=>Y has confidence c if c% of transactions in D that contains X also contains Y.
9
9 Transactional Layout If support threshold changes, mining process is repeated In practice, minimum support is not known and requires tuning => mining process is repeated If min_sup is reduced Apriori: new scans of DB needed FP-growth: new memory structure is built Previous accumulated knowledge not used
10
10
11
11 Horizontal VS Vertical Format of transactions in the DB Affects efficiency of the algorithms Existing algorithms use of the two: Horizontal - relates all items on the same transaction together Key: transaction ID Vertical – relates all transactions that share the same items together Key: the item
12
12 Horizontal Layout Vertical Layout
13
13 Horizontal VS Vertical Horizontal Combines all items in one transaction together Possibly eliminate candidacy generation step by using some clever techniques, e.g. FP-growth Problem: useless work on scanning the whole DB Vertical An index on the items Reduces the effect of large data sizes Problem: expensive candidacy generation Intersecting records of different items of the candidate patterns
14
14 Inverted Matrix Layout Combines horizontal and vertical layouts Idea: Associate each item with all transactions in which it occurs (inverted index), and Associate each transaction with all its items using pointers
15
15 Inverted Matrix Layout Similar to vertical layout Key is the item Differences: Fields are not transaction Ids Each field is a pointer that points to the location of the next item on the same transaction 2 parts in a pointer First element: address of a line in the matrix (which is the next item) Second element: address of a column
16
16 Inverted Matrix Layout Each row in the matrix Has an address Prefixed by the item it represents, with its frequency in the database Rows are ordered in ascending order of the frequency of the item they represent
17
17 Inverted Matrix
18
18 Building the Inverted Matrix Two phases Phase 1: Scan the database once Find the frequency of each item Order the items into ascending order of their frequencies
19
19 Building the Inverted Matrix Phase 2: Scan the database again once For each transaction Sort the items into ascending order according to the frequency of each item Fill in the matrix appropriately
20
20 Example Building the Inverted Matrix (Example) First transaction: (A, B, C, D, E) => (D, B, C, E, A)
21
21
22
22 Mining the Inverted Matrix Objectives: Minimize the candidacy generation Eliminate scans of infrequent items support border First item in the index of the Inverted matrix that has a support greater or equal to min_sup
23
23 Mining the Inverted Matrix Follow the chain of items starting from C, rebuild parts of the transactions that contain only the frequent items Avoid processing non- frequent items (A) is never built at once
24
24 COFI-Tree Co-Occurrence Frequent-Item Tree To compute frequencies Read sub-transactions for frequent items directly from the Inverted Matrix Build independent relatively small COFI-trees for each frequent item in the transactional database Mine separately each one of the trees once they are built (Discarded as soon as mined)
25
25 COFI-Tree Similar to conditional FP-tree A header of ordered frequent items Horizontal pointers pointing to a succession of nodes containing the same frequent items A prefix tree, paths representing sub-transactions Difference Bidirectional links in the tree Nodes contain item label, a frequency counter and a participation counter (explained later) A COFI-tree for a frequent item x contains only nodes labeled with items that are more frequent or as frequent as x.
26
26 COFI-Tree Assume a transactional DB with frequent items (A, B, C, D, E, F) Order of Increasing Frequencies: F < E < C < D < B < A Sub-transactions generated from the Inverted Matrix (not realized)
27
27 COFI-Tree: Construction Itemsets of different sizes are found simultaneously For each given frequent 1-itemset Find all frequent k-itemsets that subsume it A COFI-tree is built for each frequent item except the most frequent one, starting from the least frequent
28
28
29
29 Construction-Example 1 F – least frequent Build a COFI-tree for F first All frequent items which are more frequent than F participate in building this tree Root node: F (the item in question) For each sub-transaction containing F with other frequent items (more frequent than F), a branch is formed starting from root. If multiple frequent items share same prefix, they are merged into one branch, count is adjusted accordingly
30
30 Construction-Example 2 Support for the node Participation count Initialized to 0 Used in Mining (later) Item label Round node:Tree node Square node: cells from header table Horizontal Link: points to the next node that has the same item-name Vertical (bi-directional) Link: link a parent with child and child with parent
31
31 Construction-Example 2 C:4:0 B:4:0 A:4:0 BABA C:5:0 B:4:0 A:4:0 DBADBA D:1:0 A:1:0 C:6:0 B:5:0 A:4:0 DBADBA D:1:0 A:1:0 C:8:0 B:5:0 A:4:0 DBADBA D:3:0 A:1:0 Step1: CBA:4Step2: CDA:1 Step3: CB:1Step4: CD:2
32
32 Construction-Example
33
33 Generate Frequent Patterns COFI-trees of all frequent items are not constructed together Each tree is built, mined and then discarded before the next COFI-tree is built Mining is done for each tree independently To find all frequent k-itemset patterns that the item on the root Top-down approach used to generate patterns
34
34 Generate Frequent Patterns COFI-tree for a frequent item I is built by following the chain of pointers in the Inverted Matrix I-COFI-tree is mined branch by branch starting with the node of the most frequent item Go upward in the tree to identify candidate frequent patterns containing I A list of these candidates is kept and updated with frequencies of the branches where they occur. Since a node could belong to more than one branch, a participation count is used to avoid re-counting.
35
35
36
36 Experimental Results Compared with Apriori and FP-growth Run on a 733-Mhz machine with 256MB RAM IBM synthetic dataset 2 tests: Time needed to mine different transaction sizes Time needed to mine with different supports level
37
37 Scalability Settings min_sup=0.01% DB size: 1M-25M Average length of transaction=24 items Dimensionality 1M DB: 10000 items 5M-25M: 100,000 items Results Apriori failed to mine 5M DB while FP-growth couldn’t mine beyond 5M Inverted Matrix scales well, Can mine 25M transactions in 2731sec.
38
38 Performance (VS. min_sup) Settings 1M transactions 10,000 items Average transaction length=24 Min_sup: 0.0025%-0.01% Results Matrix built in 763s, size 109MB on disk (original DB: 102 MB) Time needed to mine 1M transactions with different support levels Accumulated time needed to mine 1M transactions using 4 different support levels
39
39 Future Work Reduction of the Inverted Matrix size Reduction of number of I/Os when building COFI-trees Inverted Matrix clusters frequent items at the bottom Traversing one transaction may call more than 1 page => Cluster the same transactions on the same page Update of matrix by addition or deletion of transactions
40
40 Example: Two transactions: (A, B, C, D, E) and (A, E, C, H, G) Ordering both transactions according to the frequencies: (D, B, C, E, A) and (G, H, C, E, A) Both transactions share the same suffix C, E, A, we can view them as
41
41 Conclusions A new scalable algorithm is proposed Uses the disk to store the transactions in a special layout – Inverted Matrix Uses memory to interactively mine small structures called COFI-trees Experiments show the algorithms can mine very large transactional databases, with very large number of unique items Useful in a repetitive and interactive setting
42
42 References Mohammad El-Hajj, Osmar R. Zaïane. Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining, SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane. Non Recursive Generation of Frequent K-itemsets from Frequent Pattern Tree Representations, DaWak'2003
43
43 Thank You!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.