Cross Domain Distribution Adaptation via Kernel Mapping Erheng Zhong† Wei Fan‡ Jing Peng* Kun Zhang# Jiangtao Ren† Deepak Turaga‡ Olivier Verscheure‡ †Sun Yat-Sen University ‡IBM T. J. Watson Research Center *Montclair State University #Xavier University of Lousiana
Can We?
Standard Supervised Learning training (labeled) test (unlabeled) Classifier 85.5% In traditional learning situation, such as standard supervised learning, labeled and unlabeled data are assumed to come from the same domain. For example, in text categorization, our task is to tell whether a New York Times article comes from Business or Science section. A classifier “learns” from a collection of business and science articles from New Times website, then classifies unseen business or science articles from the same website. In such learning settings, high classification accuracy (e.g. 85.5%) can be achieved. New York Times New York Times
Labeled data not available! In Reality…… training (labeled) test (unlabeled) Classifier 64.1% However, in reality, things can be different. For example, it may be too costly to label a large amount of New York Times articles which are needed to train a good classifier. One may want to use those already available text data to classify these New York Times articles at hand. For example, business and science articles from Reuters. This is practical since the Reuters corpus is a well-known text classification data set. However, due to domain differences, some terms in Reuters may not appear in New York Times. Even they use the same terms, the distribution of the terms in two corpora could differ too. Labeled data not available! Reuters New York Times New York Times
Domain Difference->Performance Drop train test ideal setting Classifier NYT NYT 85.5% New York Times New York Times Such difference can lead to dramatic performance drop. Though the idea of using related domain to help the classification in another domain is appealing, it is characterized by several inherent difficulties realistic setting Classifier NYT Reuters 64.1% Reuters New York Times
Synthetic Example “two moons” and “two circles” have significantly different distributions.
Synthetic Example If we only use the labeled examples (highlighted in square) to construct a model (SVM polynomial kernel in this case), most unlabeled data are misclassified. [left figure] 2.If we simply borrow the labeled data from “two moons” to help learning a model on two circles, most of the unlabeled data are still misclassified. [right figure]
Main Challenge Motivation Both the marginal and conditional distributions between target-domain and source-domain could be significantly different in the original space!! Could we remove those useless source-domain data? Could we find other feature spaces? How to get rid of these differences?
Main Flow Kernel Discriminant Analysis
Kernel Mapping Although data are very different in the original feature space, if one can find a proper mapping as the bridge, at least, the marginal distributions can be reasonably close in the new feature space.
Instances Selection Not all the examples from “two moons” are useful for “two circles”, but only those having similar functional relation or conditional probabilities can transfer knowledge across domains. 2.Each mapping can have its intrinsic bias and it is difficult to decide which mapping is the optimal. If we combine the predictions from different feature spaces, the result is expected to be better than using any single mapping.
Ensemble
Properties Kernel mapping can reduce the difference of marginal distributions between source and target domains. [Theorem 2] Both source and target domain after kernal mapping are approximately Gaussian. Cluster-based instances selection can select those data from source domain with similar conditional probabilities. [Cluster Assumption, Theorem 1] Error rate of the proposed approach can be bounded; [Theorem 3] Ensemble can further reduce the transfer risk. [Theorem 4]
Experiment – Data Set 20 News groups (Reuters) SyskillWebert Reuters First fill up the “GAP”, then use knn classifier to do classification 20 News groups (Reuters) comp comp.sys comp.graphics rec rec.sport rec.auto Target-Domain Source-Domain Experiment – Data Set Reuters 21758 Reuters news articles 20 News Groups 20000 newsgroup articles SyskillWebert HTML source of web pages plus the ratings of a user on those web pages from 4 different subjects All of them are high dimension (>1000)! First fill up the “GAP”, then use knn classifier to do classification SyskillWebert Target-Domain Sheep Biomedical Bands-recording Source-Domain Goats
Experiment -- Baseline methods Non-transfer single classifiers Transfer learning algorithm TrAdaBoost. Base classifiers: K-NN SVM NaiveBayes
Experiment -- Overall Performance kMapEnsemble -> 24 win, 3 lose! Dataset 1~9
Conclusion Domain transfer when margin and conditional distributions are different between two domains. Flow Step-1 Kernel mapping -- Bring two domains’ marginal distributions closer; Step-2 Cluster-based instances selection -- Make conditional distribution transferable; Step-3 Ensemble – Further reduce the transfer risk. Code and data available from the authors.