1 Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.

Music Recommendation by Unified Hypergraph: Music Recommendation by Unified Hypergraph: Combining Social Media Information and Music Content Jiajun Bu,

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.

1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.

The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.

Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

1 Heterogeneous Cross Domain Ranking in Latent Space Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of.

Cross Validation Framework to Choose Amongst Models and Datasets for Transfer Learning Erheng Zhong ¶, Wei Fan ‡, Qiang Yang ¶, Olivier Verscheure ‡, Jiangtao.

Relaxed Transfer of Different Classes via Spectral Partition Xiaoxiao Shi 1 Wei Fan 2 Qiang Yang 3 Jiangtao Ren 4 1 University of Illinois at Chicago 2.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

TransRank: A Novel Algorithm for Transfer of Rank Learning Depin Chen, Jun Yan, Gang Wang et al. University of Science and Technology of China, USTC Machine.

(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence

1 1 Chenhao Tan, 1 Jie Tang, 2 Jimeng Sun, 3 Quan Lin, 4 Fengjiao Wang 1 Department of Computer Science and Technology, Tsinghua University, China 2 IBM.

1 Zi Yang, Wei Li, Jie Tang, and Juanzi Li Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University, China {yangzi,

Dongyeop Kang1, Youngja Park2, Suresh Chari2

Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.

Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.

Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.

Modern Topics in Multivariate Methods for Data Analysis.

Source-Selection-Free Transfer Learning

Transfer Learning Motivation and Types Functional Transfer Learning Representational Transfer Learning References.

Mining Social Network for Personalized Prioritization Language Techonology Institute School of Computer Science Carnegie Mellon University Shinjae.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.

Matching Users and Items Across Domains to Improve the Recommendation Quality Created by: Chung-Yi Li, Shou-De Lin Presented by: I Gde Dharma Nugraha 1.

Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.

Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.

Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification John Blitzer, Mark Dredze and Fernando Pereira University.

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

Collecting High Quality Overlapping Labels at Low Cost Grace Hui Yang Language Technologies Institute Carnegie Mellon University Anton Mityagin Krysta.

Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun

CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Unsupervised Streaming Feature Selection in Social Media

Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Recommendation in Scholarly Big Data

Bridging Domains Using World Wide Knowledge for Transfer Learning

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

Cross Domain Distribution Adaptation via Kernel Mapping

Transfer Learning in Astronomy: A New Machine Learning Paradigm

An Empirical Study of Learning to Rank for Entity Search

Correlative Multi-Label Multi-Instance Image Annotation

Collective Network Linkage across Heterogeneous Social Platforms

Learning to Rank Shubhra kanti karmaker (Santu)

Example: Academic Search

Q4 : How does Netflix recommend movies?

Weakly Learning to Match Experts in Online Community

Intent-Aware Semantic Query Annotation

iSRD Spam Review Detection with Imbalanced Data Distributions

MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.

Citation-based Extraction of Core Contents from Biomedical Articles

Knowledge Transfer via Multiple Model Local Structure Mapping

Example: Academic Search

Feature Selection for Ranking

Relevance and Reinforcement in Interactive Browsing

Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007

Learning to Rank with Ties

Presentation transcript:

1 Bo Wang 1, Jie Tang 2, Wei Fan 3, Songcan Chen 1, Zi Yang 2, Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University 3 IBM T.J. Watson Research Center, USA 4 Peking University Heterogeneous Cross Domain Ranking in Latent Space

2 Introduction The web is becoming more and more heterogeneous Ranking is the fundamental problem over web –unsupervised v.s. supervised –homogeneous v.s. heterogeneous

3 Motivation Heterogeneous cross domain ranking Main Challenges 1) How to capture the correlation between heterogeneous objects? 2) How to preserve the preference orders between objects across heterogeneous domains? Main Challenges 1) How to capture the correlation between heterogeneous objects? 2) How to preserve the preference orders between objects across heterogeneous domains?

4 Outline Related Work Heterogeneous cross domain ranking Experiments Conclusion

5 Related Work Learning to rank –Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07] [Yue, 07] –Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08] –Ranking adaptation: [Chen, 08] Transfer learning –Instance-based : [Dai, 07] [Gao, 08] –Feature-based : [Jebara, 04] [Argyriou, 06] [Raina, 07] [Lee, 07] [Blitzer, 06] [Blitzer, 07] –Model-based : [Bonilla, 08]

6 Outline Related Work Heterogeneous cross domain ranking –Basic idea –Proposed algorithm: HCDRank Experiments Conclusion

7 Query: “data mining” Conference Expert Latent Space Source Domain Target Domain mis-ranked pairs might be empty! (no labelled data in target domain)

8 Learning Task In the HCD ranking problem, the transfer ranking task can be defined as: Given limited number of labeled data L_T, a large number of unlabeled data S from the target domain, and sufficiently labeled data L_S from the source domain, the goal is to learn a ranking function f_T^* for predicting the rank levels of unlabeled data in the target domain. Key issues: -Different feature distributions/different feature spaces -Number of rank levels different -Number of labeled training examples very unbalanced (thousands vs a few)

9 The Proposed Algorithm — HCDRank How to optimize?How to define? Non-convex Dual problem Loss function in source domain Loss function in target domain penalty Loss function: Number of mis-ranked pairs C: cost-sensitive parameter which deals with imalance of labeled data btwn domains \lambda: balances the empirical loss and the penalty

10 The Proposed Algorithm — HCDRank How to optimize?How to define? Non-convex Dual problem Loss function in source domain Loss function in target domain penalty Loss function: Number of mis-ranked pairs C: cost-sensitive parameter which deals with imalance of labeled data btwn domains \lambda: balances the empirical loss and the penalty unsolvable

11 alternately optimize matrix M and D O(2T*sN logN) Construct transformation matrix O(d 3 ) learning in latent space O(sN logN) O((2T+1)*sN log(N) + d 3 Learn weight vector of target domain Apply learnt weight vector to predict d: feature number, N = nr of instance pairs for training, s: number of non- zero features

12 Outline Related Work Heterogeneous cross domain ranking Experiments –Ranking on Homogeneous data –Ranking on Heterogeneous data –Ranking on Heterogeneous tasks Conclusion

13 Experiments Data sets –Homogeneous data set: LETOR_TR 50/75/106 queries with 44/44/25 features for TREC2003_TR, TREC2004_TR and OHSUMED_TR –Heterogeneous academic data set: ArnetMiner.org 14,134 authors, 10,716 papers, and 1,434 conferences –Heterogeneous task data set: 9 queries, 900 experts, 450 best supervisor candidates Evaluation measures : –MAP : mean average precision –NDCG : normalized discount cumulative gain

14 Ranking on Homogeneous data LETOR_TR –We made a slight revision of LETOR 2.0 to fit into the cross- domain ranking scenario –three sub datasets: TREC2003_TR, TREC2004_TR, and OHSUMED_TR Baselines

15 Cosine Similarity=0.01 OHSUMED_TR TREC2004_TRTREC2003_TR Cosine Similarity=0.23 Cosine Similarity=0.18

16 Observations Ranking accuracy HCDRank is +5.6% to +6.1% in terms of MAP better Effect of difference when cosine similarity is high (TREC2004), simply combining the two domains would result in a better ranking performance Training time: next slide

17 Training Time BUT: HCDRank can easily be parallelized And training process only needs to be run once on a data set

18 Ranking on Heterogeneous data ArnetMiner data set ( 14,134 authors, 10,716 papers, and 1,434 conferences Training and test data set: –44 most frequent queried keywords from log file Author collection: Libra, Rexa and ArnetMiner Conference collection: Libra, ArnetMiner Ground truth: –Conference: online resources –Expert: two faculty members and five graduate students from CS provided human judgments for expert ranking

19 Feature Definition FeaturesDescription L1-L10Low-level language model features H1-H3High-level language model features S1How many years the conference has been held S2The sum of citation number of the conference during recent 5 years S3The sum of citation number of the conference during recent 10 years S4How many years have passed since his/her first paper S5The sum of citation number of all the publications of one expert S6How many papers have been cited more than 5 times S7How many papers have been cited more than 10 times 16 features for a conference, 17 features for an expert

20 Expert Finding Results

21 Observations Ranking accuracy HCDRank outperforms the baselines especially the two unsupervised systems Feature analysis next slide: final weight vectors which exploits the data information from two domains and adjusts the weight learn from single domain data Training time: next slide

22 Feature Correlation Analysis

23 Ranking on Heterogeneous tasks Expert finding task v.s. best supervisor finding task Training and test data set: –expert finding task: ranking lists from ArnetMiner or annotated lists –best supervisor finding task: 9 most frequent queries from log file of ArnetMiner For each query, we collected 50 best supervisor candidates, and sent s to 100 researchers for annotation Ground truth: –Collection of feedbacks about the candidates (yes/ no/ not sure)

24 Best supervisor finding Training/test set and ground truth –724 mails sent –Fragment of mail 24 – Feedbacks in effect > 82 (increasing) – Rate each candidate by the definite feedbacks (yes/no)

25 Feature Definition FeaturesDescription L1-L10Low-level language model features H1-H3High-level language model features B1The year he/she published his/her first paper B2The number of papers of an expert B3The number of papers in recent 2 years B4The number of papers in recent 5 years B5The number of citations of all his/her papers B6The number of papers cited more than 5 times B7The number of papers cited more than 10 times B8PageRank score SumCo1-SumCo8The sum of coauthors’ B1-B8 scores AvgCo1-AvgCo8The average of coauthors’ B1-B8 scores SumStu1-SumStu8The sum of his/her advisees’ B1-B8 scores AvgStu1-AvgStu8The average of his/her advisees’ B1-B8 scores

26 Best supervisor finding results

27 Experimental Results

28 Outline Related Work Heterogeneous cross domain ranking Experiments Conclusion

29 Conclusion Formally define the problem of heterogeneous cross domain ranking and propose a general framework We provide a preferred solution under the regularized framework by simultaneously minimizing two ranking loss functions in two domains The experimental results on three different genres of data sets verified the effectiveness of the proposed algorithm

30 Data Set

31 Ranking on Heterogeneous data A subset of ArnetMiner ( authors, papers, and 1434 conferences 44 most frequent queried keywords from log file Author collection: –For each query, we gathered top 30 experts from Libra, Rexa and ArnetMiner Conference collection: –For each query, we gathered top 30 conferences from Libra and ArntetMiner Ground truth: –Three online resources –Two faculty members and five graduate students from CS provided human judgments

32 Ranking on Heterogeneous tasks For expert finding task, we can use results from ArnetMiner or annotated lists as training data For best supervisor task, 9 most frequent queries from log file of ArnetMiner are used –For each query, we sent s to 100 researchers Top 50 researchers by ArnetMiner Top 50 researchers who start publishing papers only in recent years (91.6% of them are currently graduates or postdoctoral researchers) –Collection of feedbacks 50 best supervisor candidates (yes/ no/ not sure) Also add other candidates –Ground truth