Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao† Wei Fan‡ Jing Jiang†Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.

Slides:



Advertisements
Similar presentations
Latent Space Domain Transfer between High Dimensional Overlapping Distributions Sihong Xie Wei Fan Jing Peng* Olivier Verscheure Jiangtao Ren Sun Yat-Sen.
Advertisements

Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao Wei Fan Jing JiangJiawei Han University of Illinois at Urbana-Champaign IBM T. J.
Actively Transfer Domain Knowledge Xiaoxiao Shi Wei Fan Jiangtao Ren Sun Yat-sen University IBM T. J. Watson Research Center Transfer when you can, otherwise.
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Data Mining Classification: Alternative Techniques
Support Vector Machines
Universal Learning over Related Distributions and Adaptive Graph Transduction Erheng Zhong †, Wei Fan ‡, Jing Peng*, Olivier Verscheure ‡, and Jiangtao.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Chapter 4: Linear Models for Classification
A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Zhu, Rogers, Qian, Kalish Presented by Syeda Selina Akter.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Cross Domain Distribution Adaptation via Kernel Mapping Erheng Zhong † Wei Fan ‡ Jing Peng* Kun Zhang # Jiangtao Ren † Deepak Turaga ‡ Olivier Verscheure.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Graph-based Iterative Hybrid Feature Selection Erheng Zhong † Sihong Xie † Wei Fan ‡ Jiangtao Ren † Jing Peng # Kun Zhang $ † Sun Yat-sen University ‡
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cross Validation Framework to Choose Amongst Models and Datasets for Transfer Learning Erheng Zhong ¶, Wei Fan ‡, Qiang Yang ¶, Olivier Verscheure ‡, Jiangtao.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2.
Crash Course on Machine Learning
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Instance Weighting for Domain Adaptation in NLP Jing Jiang & ChengXiang Zhai University of Illinois at Urbana-Champaign June 25, 2007.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models Jing Gao 1, Feng Liang 2, Wei Fan 3, Yizhou Sun 1, Jiawei Han 1 1.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, and Ruixin Zhu
Universit at Dortmund, LS VIII
Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.
Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department.
Confidence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Dual Transfer Learning Mingsheng Long 1,2, Jianmin Wang 2, Guiguang Ding 2 Wei Cheng, Xiang Zhang, and Wei Wang 1 Department of Computer Science and Technology.
HAITHAM BOU AMMAR MAASTRICHT UNIVERSITY Transfer for Supervised Learning Tasks.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
NTU & MSRA Ming-Feng Tsai
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Web-Mining Agents: Transfer Learning TrAdaBoost R. Möller Institute of Information Systems University of Lübeck.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Semi-Supervised Clustering
Deep Feedforward Networks
Cross Domain Distribution Adaptation via Kernel Mapping
Knowledge Transfer via Multiple Model Local Structure Mapping
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Presentation transcript:

Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao† Wei Fan‡ Jing Jiang†Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM T. J. Watson Research Center KDD’08 Las Vegas, NV

2/49 Outline Introduction to transfer learning Related work –Sample selection bias –Semi-supervised learning –Multi-task learning –Ensemble methods Learning from one or multiple source domains –Locally weighted ensemble framework –Graph-based heuristic Experiments Conclusions

3/49 Standard Supervised Learning New York Times training (labeled) test (unlabeled) Classifier 85.5% New York Times Ack. From Jing Jiang’s slides

4/49 In Reality…… New York Times training (labeled) test (unlabeled) Classifier 64.1% New York Times Labeled data not available! Reuters Ack. From Jing Jiang’s slides

5/49 Domain Difference  Performance Drop traintest NYT New York Times Classifier 85.5% Reuters NYT ReutersNew York Times Classifier 64.1% ideal setting realistic setting Ack. From Jing Jiang’s slides

6/49 Other Examples Spam filtering –Public collection  personal inboxes Intrusion detection –Existing types of intrusions  unknown types of intrusions Sentiment analysis –Expert review articles  blog review articles The aim –To design learning methods that are aware of the training and test domain difference Transfer learning –Adapt the classifiers learnt from the source domain to the new domain

7/49 Outline Introduction to transfer learning Related work –Sample selection bias –Semi-supervised learning –Multi-task learning –Ensemble methods Learning from one or multiple source domains –Locally weighted ensemble framework –Graph-based heuristic Experiments Conclusions

8/49 Sample Selection Bias (Covariance Shift) Motivating examples –Load approval –Drug testing –Training set: customers participating in the trials –Test set: the whole population Problems –Training and test distributions differ in P(x), but not in P(y|x) –But the difference in P(x) still affects the learning performance

9/49 Sample Selection Bias (Covariance Shift) Unbiased %Biased 92.7% Ack. From Wei Fan’s slides

10/49 Sample Selection Bias (Covariance Shift) Existing work –Reweight training examples according to the distribution difference and maximize the re- weighted likelihood –Estimate the probability of a observation being selected into the training set and use this probability to improve the model –Use P(x,y) to make predictions instead of using P(y|x)

11/49 Semi-supervised Learning (Transductive Learning) Labeled Data Unlabeled Data Test set Model Applications and problems –Labeled examples are scarce but unlabeled data are abundant –Web page classification, review ratings prediction Transductive

12/49 Semi-supervised Learning (Transductive Learning) Existing work –Self-training Give labels to unlabeled data –Generative models Unlabeled data help get better estimates of the parameters –Transductive SVM Maximize the unlabeled data margin –Graph-based algorithms Construct a graph based on labeled and unlabeled data, propagate labels along the paths –Distance learning Map the data into a different feature space where they could be better separated

13/49 Learning from Multiple Domains Multi-task learning –Learn several related tasks at the same time with shared representations –Single P(x) but multiple output variables Transfer learning –Two stage domain adaptation: select generalizable features from training domains and specific features from test domain

14/49 Ensemble Methods Improve over single models –Bayesian model averaging –Bagging, Boosting, Stacking –Our studies show their effectiveness in stream classification Model weights –Usually determined globally –Reflect the classification accuracy on the training set

15/49 Ensemble Methods Transfer learning –Generative models: Traing and test data are generated from a mixture of different models Use Dirichlet Process prior to couple the parameters of several models from the same parameterized family of distributions –Non-parametric models Boost the classifier with labeled examples which represent the true test distribution

16/49 Outline Introduction to transfer learning Related work –Sample selection bias –Semi-supervised learning –Multi-task learning Learning from one or multiple source domains –Locally weighted ensemble framework –Graph-based heuristic Experiments Conclusions

17/49 All Sources of Labeled Information training (labeled) test (completely unlabeled) Classifier New York Times Reuters Newsgroup …… ?

18/49 A Synthetic Example Training (have conflicting concepts) Test Partially overlapping

19/49 Goal Source Domain Target Domain Source Domain Source Domain To unify knowledge that are consistent with the test domain from multiple source domains (models)

20/49 Summary of Contributions Transfer from one or multiple source domains –Target domain has no labeled examples Do not need to re-train –Rely on base models trained from each domain –The base models are not necessarily developed for transfer learning applications

21/49 Locally Weighted Ensemble M1M1 M2M2 MkMk …… Training set 1 Test example x Training set 2 Training set k …… x-feature value y-class label Training set

22/49 Modified Bayesian Model Averaging M1M1 M2M2 MkMk …… Test set Bayesian Model Averaging M1M1 M2M2 MkMk …… Test set Modified for Transfer Learning

23/49 Global versus Local Weights …… xy …100001… M1M … M2M … wgwg 0.3 … wlwl … wgwg 0.7 … wlwl … Locally weighting scheme –Weight of each model is computed per example –Weights are determined according to models’ performance on the test set, not training set Training

24/49 Synthetic Example Revisited Training (have conflicting concepts) Test Partially overlapping M1M1 M2M2 M1M1 M2M2

25/49 Optimal Local Weights C1C1 C2C2 Test example x Higher Weight Optimal weights –Solution to a regression problem w1w1 w2w2 = H wf

26/49 Approximate Optimal Weights How to approximate the optimal weights –M should be assigned a higher weight at x if P(y|M,x) is closer to the true P(y|x) Have some labeled examples in the target domain –Use these examples to compute weights None of the examples in the target domain are labeled –Need to make some assumptions about the relationship between feature values and class labels Optimal weights –Impossible to get since f is unknown!

27/49 Clustering-Manifold Assumption Test examples that are closer in feature space are more likely to share the same class label.

28/49 Graph-based Heuristics Graph-based weights approximation –Map the structures of models onto test domain Clustering Structure M1M1 M2M2 weight on x

29/49 Graph-based Heuristics Local weights calculation –Weight of a model is proportional to the similarity between its neighborhood graph and the clustering structure around x. Higher Weight

30/49 Local Structure Based Adjustment Why adjustment is needed? –It is possible that no models’ structures are similar to the clustering structure at x –Simply means that the training information are conflicting with the true target distribution at x Clustering Structure M1M1 M2M2 Error

31/49 Local Structure Based Adjustment How to adjust? –Check if is below a threshold –Ignore the training information and propagate the labels of neighbors in the test set to x Clustering Structure M1M1 M2M2

32/49 Verify the Assumption Need to check the validity of this assumption –Still, P(y|x) is unknown –How to choose the appropriate clustering algorithm Findings from real data sets –This property is usually determined by the nature of the task –Positive cases: Document categorization –Negative cases: Sentiment classification –Could validate this assumption on the training set

33/49 Algorithm Check Assumption Neighborhood Graph Construction Model Weight Computation Weight Adjustment

34/49 Outline Introduction to transfer learning Related work –Sample selection bias –Semi-supervised learning –Multi-task learning Learning from one or multiple source domains –Locally weighted ensemble framework –Graph-based heuristic Experiments Conclusions

35/49 Data Sets Different applications –Synthetic data sets –Spam filtering: public collection  personal inboxes (u01, u02, u03) (ECML/PKDD 2006) –Text classification: same top-level classification problems with different sub-fields in the training and test sets (Newsgroup, Reuters) –Intrusion detection data: different types of intrusions in training and test sets.

36/49 Baseline Methods –One source domain: single models Winnow (WNN), Logistic Regression (LR), Support Vector Machine (SVM) Transductive SVM (TSVM) –Multiple source domains: SVM on each of the domains TSVM on each of the domains –Merge all source domains into one: ALL SVM, TSVM –Simple averaging ensemble: SMA –Locally weighted ensemble without local structure based adjustment: pLWE –Locally weighted ensemble: LWE Implementation –Classification: SNoW, BBR, LibSVM, SVMlight –Clustering: CLUTO package

37/49 Performance Measure Prediction Accuracy –0-1 loss: accuracy –Squared loss: mean squared error Area Under ROC Curve (AUC) –Tradeoff between true positive rate and false positive rate –Should be 1 ideally

38/49 A Synthetic Example Training (have conflicting concepts) Test Partially overlapping

39/49 Experiments on Synthetic Data

40/49 Spam Filtering Problems –Training set: public s –Test set: personal s from three users: U00, U01, U02 pLWE LR SVM SMA TSVM WNN LWE pLWE LR SVM SMA TSVM WNN LWE Accuracy MSE

41/49 20 Newsgroup C vs S R vs T R vs S C vs T C vs R S vs T

42/49 pLWE LR SVM SMA TSVM WNN LWE Acc pLWE LR SVM SMA TSVM WNN LWE MSE

43/49 Reuters pLWE LR SVM SMA TSVM WNN LWE pLWE LR SVM SMA TSVM WNN LWE Accuracy MSE Problems –Orgs vs People (O vs Pe) –Orgs vs Places (O vs Pl) –People vs Places (Pe vs Pl)

44/49 Intrusion Detection Problems (Normal vs Intrusions) –Normal vs R2L (1) –Normal vs Probing (2) –Normal vs DOS (3) Tasks – > 3 (DOS) – > 2 (Probing) – > 1 (R2L)

45/49 Parameter Sensitivity Parameters –Selection threshold in local structure based adjustment –Number of clusters

46/49 Outline Introduction to transfer learning Related work –Sample selection bias –Semi-supervised learning –Multi-task learning Learning from one or multiple source domains –Locally weighted ensemble framework –Graph-based heuristic Experiments Conclusions

47/49 Conclusions Locally weighted ensemble framework –transfer useful knowledge from multiple source domains Graph-based heuristics to compute weights –Make the framework practical and effective

48/49 Feedbacks Transfer learning is real problem –Spam filtering –Sentiment analysis Learning from multiple source domains is useful –Relax the assumption –Determine parameters

49/49 Thanks! Any questions? Office: 2119B