From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,
A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Is Random Model Better? -On its accuracy and efficiency-
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.
Experience with Simple Approaches Wei Fan Erheng Zhong Sihong Xie Yuzhao Huang Kun Zhang $ Jing Peng # Jiangtao Ren IBM T. J. Watson Research Center Sun.
Data not in the pre-defined feature vectors that can be used to construct predictive models. Applications: Transactional database Sequence database Graph.
Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Linear Regression.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
CHAPTER 9: Decision Trees
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Decision Tree.
Classification Techniques: Decision Tree Learning
Decision Tree Rong Jin. Determine Milage Per Gallon.
Sparse vs. Ensemble Approaches to Supervised Learning
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Lecture 5 (Classification with Decision Trees)
ICS 273A Intro Machine Learning
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.
Ensemble Learning (2), Tree and Forest
Learning Chapter 18 and Parts of Chapter 20
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
by B. Zadrozny and C. Elkan
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Chapter 9 – Classification and Regression Trees
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Benk Erika Kelemen Zsolt
Learning from Observations Chapter 18 Through
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Ensemble Methods in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining and Decision Support
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Learning From Observations Inductive Learning Decision Trees Ensembles.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Chapter 7. Classification and Prediction
Trees, bagging, boosting, and stacking
Supervised Time Series Pattern Discovery through Local Importance
Introduction Feature Extraction Discussions Conclusions Results
Introduction to Data Mining, 2nd Edition by
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Discriminative Frequent Pattern Analysis for Effective Classification
Statistical Learning Dong Liu Dept. EEIS, USTC.
Learning Chapter 18 and Parts of Chapter 20
Presentation transcript:

From Feature Construction, to Simple but Effective Modeling, to Domain Transfer Wei Fan IBM T.J.Watson

Feature Vector Most data mining and machine learning model assume the following structured data: (x 1, x 2,..., x k ) -> y where xis are independent variable y is dependent variable. y drawn from discrete set: classification y drawn from continuous variable: regression

Frequent Pattern-Based Feature Construction Data not in the pre-defined feature vectors Transactions Biological sequence Graph database Frequent pattern is a good candidate for discriminative features So, how to mine them?

FP: Sub-graph A discovered pattern NSC 4960 NSC NSC NSC NSC (example borrowed from George Karypis presentation)

Computational Issues Measured by its frequency or support. E.g. frequent subgraphs with sup 10% Cannot enumerate sup = 10% without first enumerating all patterns > 10%. Random sampling not work since it is not exhaustive. NP hard problem

1. Mine frequent patterns (>sup) Frequent Patterns DataSet mine Mined Discriminative Patterns select 2. Select most discriminative patterns; 3. Represent data in the feature space using such patterns; 4. Build classification models. F1 F2 F4 Data Data Data Data ……… represent | Petal.Width< 1.75 setosa versicolor virginica Petal.Length< 2.45 Any classifiers you can name NN DT SVM LR Conventional Procedure Feature Construction followed by Selection Two-Step Batch Method

Two Problems Mine step combinatorial explosion Frequent Patterns DataSe t mine 1. exponential explosion 2. patterns not considered if minsupport isnt small enough

Two Problems Select step Issue of discriminative power Frequent Patterns Mined Discriminative Patterns select 3. InfoGain against the complete dataset, NOT on subset of examples 4. Correlation not directly evaluated on their joint predictability

Direct Mining & Selection via Model- based Search Tree Basic Flow Mined Discriminative Patterns Compact set of highly discriminative patterns Divide-and-Conquer Based Frequent Pattern Mining 2 Mine & Select P: 20% Y 3 Y 6 Y + Y Y 4 N Few Data N N + N 5 N Mine & Select P:20% 7 N … … Y dataset 1 Mine & Select P: 20% Most discriminative F based on IG Feature Miner Classifier Global Support: 10*20%/10000 =0.02%

Analyses (I) 1. Scalability of pattern enumeration Upper bound (Theorem 1): Scale down ratio: 2. Bound on number of returned features

Analyses (II) 3. Subspace pattern selection Original set: Subset: 4. Non-overfitting 5. Optimality under exhaustive search

Experimental Studies: Itemset Mining (I) Scalability Comparison Datasets#Pat using MbT sup Ratio (MbT #Pat / #Pat using MbT sup) Adult % Chess + ~0% Hypo % Sick % Sonar % 2 Mine & Select P: 20% Y 3 Y + Y Y Few Data N + N dataset 1 Mine & Select P: 20% Most discriminative F based on IG Global Support: 10*20%/10000 =0.02% 6 Y 5 N Mine & Select P:20% 7 N 4 N 2 Y 3 Y + Y Y Few Data N + N dataset 1 Mine & Select P: 20% Most discriminative F based on IG Global Support: 10*20%/10000 =0.02% 6 Y 5 N Mine & Select P:20% 7 N 4 N

Experimental Studies: Itemset Mining (II) Accuracy of Mined Itemsets 4 Wins 1 loss But, much smaller number of patterns

Experimental Studies: Itemset Mining (III) Convergence

Experimental Studies: Graph Mining (I) 9 NCI anti-cancer screen datasets The PubChem Project. URL: pubchem.ncbi.nlm.nih.gov. Active (Positive) class : around 1% - 8.3% 2 AIDS anti-viral screen datasets URL: H1: CM+CA – 3.5% H2: CA – 1%

Experimental Studies: Graph Mining (II) Scalability 2 Mine & Select P: 20% Y 3 Y + Y Y Few Data N + N dataset 1 Mine & Select P: 20% Most discriminative F based on IG Global Support: 10*20%/10000 =0.02% 6 Y 5 N Mine & Select P:20% 7 N 4 N 2 Y 3 Y + Y Y Few Data N + N dataset 1 Mine & Select P: 20% Most discriminative F based on IG Global Support: 10*20%/10000 =0.02% 6 Y 5 N Mine & Select P:20% 7 N 4 N

Experimental Studies: Graph Mining (III) AUC and Accuracy AUC 11 Wins 10 Wins 1 Loss

AUC of MbT, DT MbT VS Benchmarks Experimental Studies: Graph Mining (IV) 7 Wins, 4 losses

Summary Model-based Search Tree Integrated feature mining and construction. Dynamic support Can mine extremely small support patterns Both a feature construction and a classifier Not limited to one type of frequent pattern: plug-play Experiment Results Itemset Mining Graph Mining New: Found a DNA sequence not previously reported but can be explained in biology. Code and dataset available for download

Even the true distribution is unknown, still assume that the data is generated by some known function. Estimate parameters inside the function via training data CV on the training data Model 1 Model 2 Model 3 Model 4 Model 5 Model 6 Some unknown distribution How to train models? There probably will always be mistakes unless: 1.The chosen model indeed generates the distribution 2.Data is sufficient to estimate those parameters But how about, you dont know which to choose or use the wrong one? List of methods: Logistic Regression Probit models Naïve Bayes Kernel Methods Linear Regression RBF Mixture models After structure is prefixed, learning becomes optimization to minimize errors: quadratic loss exponential loss slack variables

How to train models II Not quite sure the exact function, but use a family of free-form functions given some preference criteria. There probably will always be mistakes unless: the training data is sufficiently large. free form function/criteria is appropriate. List of methods: Decision Trees RIPPER rule learner CBA: association rule clustering-based methods … … Preference criteria Simplest hypothesis that fits the data is the best. Heuristics: info gain, gini index, Kearns-Mansour, etc pruning: MDL pruning, reduced error-pruning, cost-based pruning. Truth: none of purity check functions guarantee accuracy on unseen test data, it only tries to build a smaller model

Can Data Speak for Themselves? Make no assumption about the true model, neither parametric form nor free form. Encode the data in some rather neutral representations: Think of it like encoding numbers in computers binary representation. Always cannot represent some numbers, but overall accurate enough. Main challenge: Avoid rote learning: do not remember all the details Generalization Evenly representing numbers – Evenly encoding the data.

Potential Advantages If the accuracy is quite good, then Method is quite automatic and easy to use No Brainer – DM can be everybodys tool.

Encoding Data for Major Problems Classification: Given a set of labeled data items, such as, (amt, merchant category, outstanding balance, date/time, ……,) and the label is whether it is a fraud or non-fraud. Label: set of discrete values classifier: predict if a transaction is a fraud or non-fraud. Probability Estimation: Similar to the above setting: estimate the probability that a transaction is a fraud. Difference: no truth is given, i.e., no true probability Regression: Given a set of valued data items, such as (zipcode, capital gain, education, …), interested value is annual gross income. Target value: continuous values. Several other on-going problems

Encoding Data in Decision Trees Think of each tree as a way to encode the training data. Why tree? a decision tree records some common characteristic of the data, but not every piece of trivial detail Obviously, each tree encodes the data differently. Subjective criteria that prefers some encodings than others are always adhoc. Do not prefer anything then – just do it randomly Minimizes the difference by multiple encodings, and then average them.

Random Decision Tree to Encode Data - classification, regression, probability estimation At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Continued We stop when one of the following happens: A node becomes too small (<= 3 examples). Or the total height of the tree exceeds some limits: Such as the total number of features.

Illustration of RDT B1: {0,1} B2: {0,1} B3: continuous B2: {0,1} B3: continuous B2: {0,1} B3: continuous B3: continous Random threshold 0.3 Random threshold 0.6 B1 chosen randomly B2 chosen randomly B3 chosen randomly

Classification | Petal.Width< 1.75 setosa 50/0/0 versicolor 0/49/5 virginica 0/1/45 Petal.Length< 2.45 P( setosa |x,θ) = 0 P( versicolor |x,θ) = 49/54 P( virginica |x,θ) = 5/54

Regression | Petal.Width< 1.75 setosa Height=10in versicolor Height=15 in virginica Height=12in Petal.Length< in average value of all examples In this leaf node

Prediction Simply Averaging over multiple trees

Potential Advantage Training can be very efficient. Particularly true for very large datasets. No cross-validation based estimation of parameters for some parametric methods. Natural multi-class probability. Natural multi-label classification and probability estimation. Imposes very little about the structures of the model.

Reasons The true distribution P(y|X) is never known. Is it an elephant? Every random tree is not a random guess of this P(y|X). Their structure is, but not the node statistics Every random tree is consistent with the training data. Each tree is quite strong, not weak. In other words, if the distribution is the same, each random tree itself is a rather decent model.

Expected Error Reduction Proven that for quadratic loss, such as: for probability estimation: ( P(y|X) – P(y|X, θ) ) 2 regression problems ( y – f(x) ) 2 General theorem: the expected quadratic loss of RDT (and any other model averaging) is less than any combined model chosen at random.

Theorem Summary

Number of trees Sampling theory: The random decision tree can be thought as sampling from a large (infinite when continuous features exist) population of trees. Unless the data is highly skewed, 30 to 50 gives pretty good estimate with reasonably small variance. In most cases, 10 are usually enough.

Variance Reduction

Optimal Decision Boundary from Tony Lius thesis (supervised by Kai Ming Ting)

RDT looks like the optimal boundary

Regression Decision Boundary (GUIDE) Properties Broken and Discontinuous Some points are far from truth Some wrong ups and downs

RDT Computed Function Properties Smooth and Continuous Close to true function All ups and downs caught

Hidden Variable

Limitation of GUIDE Need to decide grouping variables and independent variables. A non-trivial task. If all variables are categorical, GUIDE becomes a single CART regression tree. Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results.

It grows like …

ICDM08 Cup Crown Winner Nuclear ban monitoring RDT based approach is the highest award winner.

Ozone Level Prediction (ICDM06 Best Application Paper) Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ)

SVM: 1-hr criteria CV

AdaBoost: 1-hr criteria CV

SVM: 8-hr criteria CV

AdaBoost: 8-hr criteria CV

Other Applications Credit Card Fraud Detection Late and Default Payment Prediction Intrusion Detection Semi Conductor Process Control Trading anomaly detection

Conclusion Imposing a particular form of model may not be a good idea to train highly-accurate models for general purpose of DM. It may not even be efficient for some forms of models. RDT has been show to solve all three major problems in data mining, classification, probability estimation and regressions, simply, efficiently and accurately. When physical truth is unknown, RDT is highly recommended Code and dataset is available for download.

Standard Supervised Learning New York Times training (labeled) test (unlabeled) Classifier 85.5% New York Times

In Reality…… New York Times training (labeled) test (unlabeled) Classifier 64.1% New York Times Labeled data not available! Reuters

Domain Difference Performance Drop traintest NYT New York Times Classifier 85.5% Reuters NYT ReutersNew York Times Classifier 64.1% ideal setting realistic setting

A Synthetic Example Training (have conflicting concepts) Test Partially overlapping

Goal Source Domain Target Domain Source Domain Source Domain To unify knowledge that are consistent with the test domain from multiple source domains (models)

Summary Transfer from one or multiple source domains Target domain has no labeled examples Do not need to re-train Rely on base models trained from each domain The base models are not necessarily developed for transfer learning applications

Locally Weighted Ensemble M1M1 M2M2 MkMk …… Training set 1 Test example x Training set 2 Training set k …… x-feature value y-class label Training set

Modified Bayesian Model Averaging M1M1 M2M2 MkMk …… Test set Bayesian Model Averaging M1M1 M2M2 MkMk …… Test set Modified for Transfer Learning

Global versus Local Weights …… xy …100001… M1M … M2M … wgwg 0.3 … wlwl … wgwg 0.7 … wlwl … Locally weighting scheme Weight of each model is computed per example Weights are determined according to models performance on the test set, not training set Training

Synthetic Example Revisited Training (have conflicting concepts) Test Partially overlapping M1M1 M2M2 M1M1 M2M2

Optimal Local Weights C1C1 C2C2 Test example x Higher Weight Optimal weights Solution to a regression problem w1w1 w2w2 = H wf

Approximate Optimal Weights How to approximate the optimal weights M should be assigned a higher weight at x if P(y|M,x) is closer to the true P(y|x) Have some labeled examples in the target domain Use these examples to compute weights None of the examples in the target domain are labeled Need to make some assumptions about the relationship between feature values and class labels Optimal weights Impossible to get since f is unknown!

Clustering-Manifold Assumption Test examples that are closer in feature space are more likely to share the same class label.

Graph-based Heuristics Graph-based weights approximation Map the structures of models onto test domain Clustering Structure M1M1 M2M2 weight on x

Graph-based Heuristics Local weights calculation Weight of a model is proportional to the similarity between its neighborhood graph and the clustering structure around x. Higher Weight

Local Structure Based Adjustment Why adjustment is needed? It is possible that no models structures are similar to the clustering structure at x Simply means that the training information are conflicting with the true target distribution at x Clustering Structure M1M1 M2M2 Error

Local Structure Based Adjustment How to adjust? Check if is below a threshold Ignore the training information and propagate the labels of neighbors in the test set to x Clustering Structure M1M1 M2M2

Verify the Assumption Need to check the validity of this assumption Still, P(y|x) is unknown How to choose the appropriate clustering algorithm Findings from real data sets This property is usually determined by the nature of the task Positive cases: Document categorization Negative cases: Sentiment classification Could validate this assumption on the training set

Algorithm Check Assumption Neighborhood Graph Construction Model Weight Computation Weight Adjustment

Data Sets Different applications Synthetic data sets Spam filtering: public collection personal inboxes (u01, u02, u03) (ECML/PKDD 2006) Text classification: same top-level classification problems with different sub-fields in the training and test sets (Newsgroup, Reuters) Intrusion detection data: different types of intrusions in training and test sets.

Baseline Methods One source domain: single models Winnow (WNN), Logistic Regression (LR), Support Vector Machine (SVM) Transductive SVM (TSVM) Multiple source domains: SVM on each of the domains TSVM on each of the domains Merge all source domains into one: ALL SVM, TSVM Simple averaging ensemble: SMA Locally weighted ensemble without local structure based adjustment: pLWE Locally weighted ensemble: LWE Implementation Package: Classification: SNoW, BBR, LibSVM, SVMlight Clustering: CLUTO package

Performance Measure Prediction Accuracy 0-1 loss: accuracy Squared loss: mean squared error Area Under ROC Curve (AUC) Tradeoff between true positive rate and false positive rate Should be 1 ideally

A Synthetic Example Training (have conflicting concepts) Test Partially overlapping

Experiments on Synthetic Data

Spam Filtering Problems Training set: public s Test set: personal s from three users: U00, U01, U02 pLWE LR SVM SMA TSVM WNN LWE pLWE LR SVM SMA TSVM WNN LWE Accuracy MSE

20 Newsgroup C vs S R vs T R vs S C vs T C vs R S vs T

pLWE LR SVM SMA TSVM WNN LWE Acc pLWE LR SVM SMA TSVM WNN LWE MSE

Reuters pLWE LR SVM SMA TSVM WNN LWE pLWE LR SVM SMA TSVM WNN LWE Accuracy MSE Problems Orgs vs People (O vs Pe) Orgs vs Places (O vs Pl) People vs Places (Pe vs Pl)

Intrusion Detection Problems (Normal vs Intrusions) Normal vs R2L (1) Normal vs Probing (2) Normal vs DOS (3) Tasks > 3 (DOS) > 2 (Probing) > 1 (R2L)

Conclusions Locally weighted ensemble framework transfer useful knowledge from multiple source domains Graph-based heuristics to compute weights Make the framework practical and effective Code and Dataset available for download

More information or For code, dataset and papers