Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji University, China 3 IBM T. J. Watson Research Center 4 Hong Kong University of Science and Technology

1/18 Why learning with heterogeneous sources? New York Times Training (labeled) Test (unlabeled) Classifier New York Times 85.5% Standard Supervised Learning

2/18 New York Times Training (labeled) Test (unlabeled) New York Times Labeled data are insufficient! 47.3% How to improve the performance? In Reality… Why heterogeneous sources?

3/18 Why heterogeneous sources? Reuters Labeled data from other sources Target domain test (unlabeled) New York Times 82.6% 1.Different distributions 2.Different outputs 3.Different feature spaces 47.3%

Real world examples Social Network: –Can various bookmarking systems help predict social tags for a new system given that their outputs (social tags) and data (documents) are different? WikipediaODPBackflip Blink …… ? 4/18

Real world examples Applied Sociology: –Can the suburban housing price census data help predict the downtown housing prices? ? #rooms #bathrooms #windows price 5 2 12 XXX 6 3 11 XXX #rooms #bathrooms #windows price 2 1 4 XXXXX 4 2 5 XXXXX 5/18

Other examples Bioinformatics –Previous years’ flu data  new swine flu –Drug efficacy data against breast cancer  drug data against lung cancer –…… Intrusion detection –Existing types of intrusions  unknown types of intrusions Sentiment analysis –Review from SDM  Review from KDD 6/18

Learning with Heterogeneous Sources The paper mainly attacks two sub- problems: –Heterogeneous data distributions Clustering based KL divergence and a corresponding sampling technique –Heterogeneous outputs (to regression problem) Unifying outputs via preserving similarity. 7/18

Learning with Heterogeneous Sources General Framework Unifying data distributions Unifying outputs Source data Target data Source data Target data 8/18

Unifying Data Distributions Basic idea: –Combine the source and target data and perform clustering. –Select the clusters in which the target and source data are similarly distributed, evaluated by KL divergence. 9/18

An Example D T Combined Data Adaptive Clustering 10/18

Unifying Outputs Basic idea: –Generate initial outputs according to the regression model –For the instances similar in the original output space, make their new outputs closer. 11/18

12/18 16 37 26.5 21.2531.75 Initial Outputs Modification

Experiment Bioinformatics data set: 13/18

Experiment 14/18

Experiment Applied sociology data set: 15/18

Experiment 16/18

17/18 Problem: Learning with Heterogeneous Sources: Heterogeneous data distributions Heterogeneous outputs Solution: Clustering based KL divergence help perform sampling Similarity preserving output generation help unify outputs Conclusions

18/18 Thanks!

Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Similar presentations

Presentation on theme: "Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Similar presentations

Presentation on theme: "Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji."— Presentation transcript:

Similar presentations

About project

Feedback