Download presentation
Presentation is loading. Please wait.
Published byGrant Morris Modified over 9 years ago
1
Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji University, China 3 IBM T. J. Watson Research Center 4 Hong Kong University of Science and Technology
2
1/18 Why learning with heterogeneous sources? New York Times Training (labeled) Test (unlabeled) Classifier New York Times 85.5% Standard Supervised Learning
3
2/18 New York Times Training (labeled) Test (unlabeled) New York Times Labeled data are insufficient! 47.3% How to improve the performance? In Reality… Why heterogeneous sources?
4
3/18 Why heterogeneous sources? Reuters Labeled data from other sources Target domain test (unlabeled) New York Times 82.6% 1.Different distributions 2.Different outputs 3.Different feature spaces 47.3%
5
Real world examples Social Network: –Can various bookmarking systems help predict social tags for a new system given that their outputs (social tags) and data (documents) are different? WikipediaODPBackflip Blink …… ? 4/18
6
Real world examples Applied Sociology: –Can the suburban housing price census data help predict the downtown housing prices? ? #rooms #bathrooms #windows price 5 2 12 XXX 6 3 11 XXX #rooms #bathrooms #windows price 2 1 4 XXXXX 4 2 5 XXXXX 5/18
7
Other examples Bioinformatics –Previous years’ flu data new swine flu –Drug efficacy data against breast cancer drug data against lung cancer –…… Intrusion detection –Existing types of intrusions unknown types of intrusions Sentiment analysis –Review from SDM Review from KDD 6/18
8
Learning with Heterogeneous Sources The paper mainly attacks two sub- problems: –Heterogeneous data distributions Clustering based KL divergence and a corresponding sampling technique –Heterogeneous outputs (to regression problem) Unifying outputs via preserving similarity. 7/18
9
Learning with Heterogeneous Sources General Framework Unifying data distributions Unifying outputs Source data Target data Source data Target data 8/18
10
Unifying Data Distributions Basic idea: –Combine the source and target data and perform clustering. –Select the clusters in which the target and source data are similarly distributed, evaluated by KL divergence. 9/18
11
An Example D T Combined Data Adaptive Clustering 10/18
12
Unifying Outputs Basic idea: –Generate initial outputs according to the regression model –For the instances similar in the original output space, make their new outputs closer. 11/18
13
12/18 16 37 26.5 21.2531.75 Initial Outputs Modification
14
Experiment Bioinformatics data set: 13/18
15
Experiment 14/18
16
Experiment Applied sociology data set: 15/18
17
Experiment 16/18
18
17/18 Problem: Learning with Heterogeneous Sources: Heterogeneous data distributions Heterogeneous outputs Solution: Clustering based KL divergence help perform sampling Similarity preserving output generation help unify outputs Conclusions
19
18/18 Thanks!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.