Mining Optimal Decision Trees from Itemset Lattices KDD’07 Presented by Xiaoxi Du.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
The Dictionary ADT Definition A dictionary is an ordered or unordered list of key-element pairs, where keys are used to locate elements in the list. Example:
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Classification Techniques: Decision Tree Learning
Data Mining Association Analysis: Basic Concepts and Algorithms
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Transparency No. P2C4-1 Formal Language and Automata Theory Part II Chapter 4 Parse Trees and Parsing.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Induction of Decision Trees
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
LEARNING DECISION TREES
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Data Structures Using C++ 2E Chapter 11 Binary Trees and B-Trees.
Perfect Phylogeny MLE for Phylogeny Lecture 14
Defining Polynomials p 1 (n) is the bound on the length of an input pair p 2 (n) is the bound on the running time of f p 3 (n) is a bound on the number.
Advanced Data Structures and Algorithms COSC-600 Lecture presentation-6.
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Advanced Algorithm Design and Analysis (Lecture 5) SW5 fall 2004 Simonas Šaltenis E1-215b
Mining Optimal Decision Trees from Itemset Lattices Dr, Siegfried Nijssen Dr. Elisa Fromont KDD 2007.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
Tree. Basic characteristic Top node = root Left and right subtree Node 1 is a parent of node 2,5,6. –Node 2 is a parent of node.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
CMSC 341 Introduction to Trees. 8/3/2007 UMBC CMSC 341 TreeIntro 2 Tree ADT Tree definition  A tree is a set of nodes which may be empty  If not empty,
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Learning from Observations Chapter 18 Through
Data Structures Week 8 Further Data Structures The story so far  Saw some fundamental operations as well as advanced operations on arrays, stacks, and.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
CS690L Data Mining: Classification
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Association Analysis (3)
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CS 201 Data Structures and Algorithms
DECISION TREES An internal node represents a test on an attribute.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Rule Induction for Classification Using
Chapter 6 Classification and Prediction
Structural testing, Path Testing
Market Basket Analysis and Association Rules
Algorithms (2IL15) – Lecture 2
Discriminative Frequent Pattern Analysis for Effective Classification
CE 221 Data Structures and Algorithms
CE 221 Data Structures and Algorithms
Statistical Learning Dong Liu Dept. EEIS, USTC.
Geometrically Inspired Itemset Mining*
INTRODUCTION TO Machine Learning 2nd Edition
Tree.
Data Structures Using C++ 2E
Presentation transcript:

Mining Optimal Decision Trees from Itemset Lattices KDD’07 Presented by Xiaoxi Du

Part Ⅰ : Itemset Lattices for Decision Tree Mining

Terminology  I = { i₁, i₂, ……, i m }  D = { T 1,T 2,……, T n } T k ⊆ I  TID-set: Transaction identifier set t(I) ⊆ {1,2,…… n} I ⊆ I  Freq(I) = | t(I) | I ⊆ I  Support(I) = freq(I) / |D|  Freq c (I) c ∊C

Class Association Rule  Associate to each itemset the class label for which its frequency is highest.  I → c(I)  Where c(I) = argmax c’ ∈C freq c’ (I)

The Decision Tree  Assume that all tests are boolean; nominal attributes are transformed into boolean attributes by mapping each possible value to a separate attribute.  The input of a decision tree is a binary matrix B, where B ij contains the value of attribute i of example j.  Observation: let us transform a binary table B into transactional form D such that T j = {i|B ij =1} ∪ { ⇁i|B ij =0}. Then the examples that are sorted down every node of a decision tree for B are characterized by an itemset of items occurring in D.

Example: the decision tree B 1 C {B} { ⇁BC } { ⇁B⇁C } Path (T)={ ∅, {B}, {⇁B}, {⇁B, C}, {⇁B,⇁C} } Leaves(T)

Example : the decision tree  This example includes negative items, such as ⇁B, in the itemsets.  The leaves of a decision tree correspond to class association rules, as leaves have associated classes.

Accuracy of a decision tree  The accuracy of a decision tree is derived from the number of misclassified examples in the leaves:  Accuracy(T) =  Where  e(T) = and e(I) = freq(I)-freq c (I) (I)

Part Ⅱ :Queries for Decision Trees  Locally constrained decision tree  Globally constrained decision tree  A ranked set of globally constrained decision trees

Locally Constrained Decision Trees  The constraints on the nodes of the decision trees T 1 :={T | T ∊ Decision Trees, ∀I∊ paths(T), p(I)} the set T 1 :locally constrained decision trees. Decision Trees : the set of all possible decision trees. p(I) : a constraint on paths, (simplest : p(I) := freq(I) ≥ minfreq).

Locally Constrained Decision Trees  Two properties of p(I): ◎ the evaluation of p(I) must be independent of the tree T of which I is part. ◎ p must be anti-monotonic. A predicate p(I) on itemsets I ⊆ I is called anti-monotonic, iff p(I) ∩(I’ ⊆ I ) ⟹p(I’).

Locally Constrained Decision Trees TTwo types of locally constrained: ◎ coverage-based constraints such as frequency ◎ pattern-based constraints such as the size of an itemset

Globally Constrained Decision Trees  The constraints refer to the tree as a whole  Optional part T 2 := {T | T ∊ T 1, q(T) } the set T 2 : globally constrained decision trees. q(T) : a conjunction of constraints of the form f(T) ≤.

Globally Constrained Decision Trees  where f(T) can be: ◎ e(T), to constrain the error of a tree on a training dataset; ◎ ex(T), to constrain the expected error on unseen examples, according to some estimation procedure; ◎ size(T), to constrain the number of nodes in a tree; ◎ depth(T), to constrain the length of the longest root-leaf path in a tree.

A ranked set of globally constrained decision trees  Preference for a tree in the set T 2  Mandatory output argmin T ∊T₂ [r 1 (T), r 2 (T),……, r n (T)] r(T) =[r 1 (T), r 2 (T),……,r n (T)] : a ranked set of globally constrained decision trees; r i ∊ {e, ex, size, depth}. if depth or size before e or ex, then q must contain an atom depth(T) ≤ maxdepth or size(T) ≤ maxsize.

Part Ⅲ : The DL8 Algorithm

 The main idea: the lattice of itemsets can be traversed bottom-up, and we can determine the best decision tree(s) for the transactions t(I) covered by an itemset I by combining for all i ∊I, the optimal trees of its children I∪{i} and I∪{⇁i} in the lattice.  The main property: if a tree is optimal, then also the left-hand and right-hand branch of its root must be optimal; this applies to every subtree of the decision tree.

Algorithm 1 DL8(p, maxsize, maxdepth, maxerror, r) 1: if maxsize≠ ∞ then 2: S←{1,2,……,maxsize} 3: else 4: S←{∞} 5: if maxdepth≠ ∞ then 6: D←{1,2,……,maxdepth} 7: else 8: D←{∞} 9: T←DL8-RECURSIVE( ∅ ) 10: if maxerror≠ ∞ then 11: T←{T|T ∊T, e(T)≤ maxerror } 12: if T = ∅ then 13: return undefined 14: return argmin T ∊T r(T) 15: 16: procedure DL8-RECURSIVE(I) 17: if DL8-RECURSIVE(I) was computed before then 18: return stored result 19: C←{ l( c(I)) } 20: if pure(I) then 21: store C as the result for I and reture C 22: for all i ∊I do 23: if p(I ∪ {i})=true and p(I ∪{⇁i} )=true then 24: T 1 ←DL8-RECURSIVE(I ∪{i} ) 25: T 2 ←DL8-RECURSIVE(I ∪{⇁i} ) 26: for all T 1 ∊ T 1,, T 2 ∊T 2 do 27: C←C ∪{n(i, T 1, T 2 )} 28: end if 29: T← ∅ 30: for all d ∊ D, s∊S do 31: L←{T ∊C|depth(T)≤d∩size(T)≤s } 32: T←T ∪{argmin T∊L [r k =e t(I) (T),……,r n (T)]} 33: end for 34: store T as the result for I and return T 35: end procedure

The DL8 Algorithm  Parameters DL8(p, maxsize, maxdepth, maxerror, r) where, p: the local constraint ; r: the ranking function; maxsize, maxdepth, maxerror : the global constraints; (each global constraints is passed in a separate parameter; global constraints that are not specified, are assumed to be set to ∞)

The DL8 Algorithm  Line 1-8: the valid ranges of sizes and depths are computed here if a size or depth constraint was specified.  Line 11: for each depth and size satisfying the constraints DL8-RECURSIVE finds the most accurate tree possible. Some of the accuracies might be too low for the given constraint, and are removed from consideration.  Line 19: a candidate decision tree for classifying the examples t(I) consists of a single leaf.

The DL8 Algorithm  Line 20 : if all examples in a set of transactions belong to the same class, continuing the recursion is not necessary; after all, any larger tree will not be more accurate than a leaf, and we require that size is used in the ranking. More sophisticated pruning is possible in some special cases.  Line 23: in this line the anti-monotonic property of the predicate p(I) is used: an itemset that does not satisfy the predicate p(I) cannot be part of a tree, nor can any of its supersets; therefore the search is not continued if p(I ∪{i} )=false or p(I ∪{⇁i} )=false.

The DL8 Algorithm  Line 22-33: these lines make sure that each tree that should be part of the output T, is indeed returned. We can prove this by induction. Assume that for the set of transactions t(I), tree T should be part of T as it is the most accurate tree that is smaller than s and shallower than d for some s ∊S and d∊D; assume T is not a leaf, and contains test in the root. Then T must have a left-hand branchT 1 and a right-hand branch T 2. Tree T 1 must be the most accurate tree that can be constructed for t(I∪{⇁i}) under depth and size constraints. We can inductively assume that trees with these constraints are found by DL8-RECURSIVE(I∪{i}) and DL8- RECURSIVE(I∪{⇁i}) as size(T 1 ), size(T 2 )≤maxsize and depth(T 1 ), depth(T 2 )≤maxdepth. Consequently T(or a tree with the same properties) must be among the trees found by combining results from the two recursive procedure calls in line 27.

The DL8 Algorithm  Line 34: a key feature of DL8-RECURSIVE is that it stores every results that it computes. Consequently, DL8 avoids that optimal decision trees for any itemset are computed more than once; furthermore, we do not need to store the entire decision trees with every itemset; it is sufficient to store the root and statistics(error, possible size and depth); left-hand and right-hand subtrees can be recovered from the stored results for the left-hand and right-hand itemsets if necessary. specially, if maxdepth=∞, maxsize=∞, maxerror=∞ and r(T)=[e(T)], in this case, DL8-RECURSIVE combines only two trees for each i ∊ I, and returns the single most accurate tree in line 34.

The DL8 Algorithm  The most important part of DL8 is its recursive search procedure.  Functions in recursive: ◎ l(c): return a tree consisting of a single leaf with class label c; ◎ n(i, T 1,T 2 ) : return a tree that contains test i in the root, and has T 1 and T 2 as left-hand and right-hand branches; ◎ e t (T) : compute the error of tree T when only the transactions in TID-set t are considered; ◎ pure(I) : blocks the recursion if all examples t(I) belong to the same class;

The DL8 Algorithm  As with most data mining algorithms, the most time consuming operations are those that access the data. DL8 requires frequency counts for itemsets in line 20, 23 and 32.

The DL8 Algorithm  Four related strategies to obtain the frequency counts. ◎ The simple single-step approach ◎ The FIM approach ◎ The constrained FIM approach ◎ The closure-based single-step approach

The Simple Single-Step Approach  DL8-SIMPLE  The most straightforward approach  Once DL8-RECURSIVE is called for an itemset I, we obtain the frequencies of I in a scan over the data, and store the result to avoid later recomputations.

The FIM Approach  Apriori-Freq+DL8  Every itemset that occurs in a tree, must satisfy the local constraint p.  Unfortunately, the frequent itemset mining approach may compute frequencies of itemsets that can never be part of a decision tree.

The Constrained FIM Approach  In DL8-SIMPLE, I ={i 1,……,i n } order[i k₁,……,i ] none of the proper prefixes I’=[i k₁, i k₂,……, i ] (m<n) ◎ the ⇁pure(I’) predicate is false in line 20; ◎ the conjunction p(I’ ∪ {i } ) ∩ p(I’∪ {⇁i }) is false in line 23.  ⇁pure as a leaf constraint

The principle of itemset relevancy  Definition 1 let p 1 be a local anti-monotonic tree constraint and p 2 be an anti-monotonic leaf constraint. Then the relevancy of I, denoted by rel(I), is defined by if I= ∅ (Case 1) if ∃i∊I s.t. rel(I-i)∩p₂(I-i)∩ p₁(I) ∩p₁(I-i∪⇁i) (Case 2) otherwise (Case 3)

The principle of itemset relevancy  Theorem 1 let L₁ be the set of itemsets stored by DL8-SIMPLE, and let L₂ be the set of itemsets {I ⊆ I |rel(I)=true }.Then L₁=L₂.  Theorem 2 itemset relevancy is an anti-monotonic property.

The Constrained FIM Approach  We stored the optimal decision trees for every itemset separately. However, if the local constraint is only coverage based, it is easy to see that for two itemsets I 1 and I 2, if t(I 1 )=t(I 2 ), the result of DL8-RECURSIVE(I 1 ) and DL8- RECURSIVE(I 2 ) must be the same.

The Closure-Based Single-Step Approach  DL8-CLOSED  Closure i(t) = ∩ T k (k ∊t ) t : a TID-set; i(t(I)) : the closure of itemset I  An itemset I is closed iff I = i(t(I)).

Thank You !