Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure How to find good features from semi-structured raw data for classification
Feature Construction Most data mining and machine learning model assume the following structured data: (x 1, x 2,..., x k ) -> y where xis are independent variable y is dependent variable. y drawn from discrete set: classification y drawn from continuous variable: regression When feature vectors are good, differences in accuracy among learners are not much. Questions: where do good features come from?
Frequent Pattern-Based Feature Extraction Data not in the pre-defined feature vectors Transactions Biological sequence Graph database Frequent pattern is a good candidate for discriminative features So, how to mine them?
FP: Sub-graph A discovered pattern NSC 4960 NSC NSC NSC NSC (example borrowed from George Karypis presentation)
Frequent Pattern Feature Vector Representation P 1 P 2 P 3 Data Data Data Data ……… | Petal.Width< 1.75 setosa versicolor virginica Petal.Length< 2.45 Any classifiers you can name NN DT SVM LR Mining these predictive features is an NP-hard problem. 100 examples can get up to patterns Most are useless
Example 192 examples 12% support (at least 12% examples contain the pattern), 8600 patterns returned by itemsets 192 vs 8600 ? 4% support, 92,000 patterns 192 vs 92,000 ?? Most patterns have no predictive power and cannot be used to construct features. Our algorithm Find only 20 highly predictive patterns can construct a decision tree with about 90% accuracy
Data in bad feature space Discriminative patterns A non-linear combination of single feature(s) Increase the expressive and discriminative power of the feature space An example XYC Data is non-linearly separable in (x, y) x y 1 1
New Feature Space Data is linearly separable in (x, y, F ) Mine & Transform Solving Problem Map Data to a Different Space XYC XY F:x=0, y=0 C x y F ItemSet: F: x=0,y=0 Association rule F: x=0 y=0
Computational Issues Measured by its frequency or support. E.g. frequent subgraphs with sup 10% or 10% examples contain these patterns Ordered enumeration: cannot enumerate sup = 10% without first enumerating all patterns > 10%. NP hard problem, easily up to patterns for a realistic problem. Most Patterns are Non-discriminative. Low support patterns can have high discriminative power. Bad! Random sampling not work since it is not exhaustive. Most patterns are useless. Random sample patterns (or blindly enumerate without considering frequency) is useless. Small number of examples. If subset of vocabulary, incomplete search. If complete vocabulary, wont help much but introduce sample selection bias problem, particularly to miss low support but high info gain patterns
1. Mine frequent patterns (>sup) Frequent Patterns DataSet mine Mined Discriminative Patterns select 2. Select most discriminative patterns; 3. Represent data in the feature space using such patterns; 4. Build classification models. F1 F2 F4 Data Data Data Data ……… represent | Petal.Width< 1.75 setosa versicolor virginica Petal.Length< 2.45 Any classifiers you can name NN DT SVM LR Conventional Procedure Feature Construction and Selection Two-Step Batch Method
Two Problems Mine step combinatorial explosion Frequent Patterns DataSe t mine 1. exponential explosion 2. patterns not considered if minsupport isnt small enough
Two Problems Select step Issue of discriminative power Frequent Patterns Mined Discriminative Patterns select 3. InfoGain against the complete dataset, NOT on subset of examples 4. Correlation not directly evaluated on their joint predictability
Direct Mining & Selection via Model- based Search Tree Basic Flow Mined Discriminative Patterns Compact set of highly discriminative patterns Divide-and-Conquer Based Frequent Pattern Mining 2 Mine & Select P: 20% Y 3 Y 6 Y + Y Y 4 N Few Data N N + N 5 N Mine & Select P:20% 7 N … … Y dataset 1 Mine & Select P: 20% Most discriminative F based on IG Feature Miner Classifier Global Support: 10*20%/10000 =0.02%
Analyses (I) 1. Scalability (Theorem 1) Upper bound Scale down ratio to obtain extremely low support pat: 2. Bound on number of returned features (Theorem 2)
4. Non-overfitting 5. Optimality under exhaustive search Analyses (II) 3. Subspace is important for discriminative pattern Original set: no-information gain if C 1 and C 0 : number of examples belonging to class 1 and 0 P 1 : number of examples in C 1 that contains a pattern α P 0 : number of examples in C 0 that contains the same pattern α Subsets could have info gain:
Experimental Studies: Itemset Mining (I) Scalability Comparison Datasets#Pat using MbT sup Ratio (MbT #Pat / #Pat using MbT sup) Adult % Chess + ~0% Hypo % Sick % Sonar % 2 Mine & Select P: 20% Y 3 Y + Y Y Few Data N + N dataset 1 Mine & Select P: 20% Most discriminative F based on IG Global Support: 10*20%/10000 =0.02% 6 Y 5 N Mine & Select P:20% 7 N 4 N 2 Y 3 Y + Y Y Few Data N + N dataset 1 Mine & Select P: 20% Most discriminative F based on IG Global Support: 10*20%/10000 =0.02% 6 Y 5 N Mine & Select P:20% 7 N 4 N
Experimental Studies: Itemset Mining (II) Accuracy of Mined Itemsets 4 Wins 1 loss much smaller number of patterns
Experimental Studies: Itemset Mining (III) Convergence
Experimental Studies: Graph Mining (I) 9 NCI anti-cancer screen datasets The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3% 2 AIDS anti-viral screen datasets URL: H1: CM+CA – 3.5% H2: CA – 1%
Experimental Studies: Graph Mining (II) Scalability 2 Mine & Select P: 20% Y 3 Y + Y Y Few Data N + N dataset 1 Mine & Select P: 20% Most discriminative F based on IG Global Support: 10*20%/10000 =0.02% 6 Y 5 N Mine & Select P:20% 7 N 4 N 2 Y 3 Y + Y Y Few Data N + N dataset 1 Mine & Select P: 20% Most discriminative F based on IG Global Support: 10*20%/10000 =0.02% 6 Y 5 N Mine & Select P:20% 7 N 4 N
Experimental Studies: Graph Mining (III) AUC and Accuracy AUC 11 Wins 10 Wins 1 Loss
AUC of MbT, DT MbT VS Benchmarks Experimental Studies: Graph Mining (IV) 7 Wins, 4 losses
Summary Model-based Search Tree Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play Experiment Results Itemset Mining Graph Mining Software and Dataset available from: