Ph.D. Thesis Defense 1 Web Mining Techniques for Query Log Analysis and Expertise Retrieval Hongbo Deng Department of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Evaluating Search Engine
Chen Cheng1, Haiqin Yang1, Irwin King1,2 and Michael R. Lyu1
ACM Multimedia th Annual Conference, October , 2004
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
1 Integrating User Feedback Log into Relevance Feedback by Coupled SVM for Content-Based Image Retrieval 9-April, 2005 Steven C. H. Hoi *, Michael R. Lyu.
Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization Department of Systems Engineering & Engineering Management The.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
1 PageSim: A Link-based Similarity Measure for the World Wide Web Zhenjiang Lin, Irwin King, and Michael, R., Lyu Computer Science & Engineering, The Chinese.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Information Retrieval
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Probabilistic Question Recommendation for Question Answering Communities Mingcheng Qu, Guang Qiu, Xiaofei He, Cheng Zhang, Hao Wu, Jiajun Bu, Chun Chen.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Using Hyperlink structure information for web search.
Which of the two appears simple to you? 1 2.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
1 Discovering Authorities in Question Answer Communities by Using Link Analysis Pawel Jurczyk, Eugene Agichtein (CIKM 2007)
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Jiafeng Guo(ICT) Xueqi Cheng(ICT) Hua-Wei Shen(ICT) Gu Xu (MSRA) Speaker: Rui-Rui Li Supervisor: Prof. Ben Kao.
Institute of Computing Technology, Chinese Academy of Sciences 1 A Unified Framework of Recommending Diverse and Relevant Queries Speaker: Xiaofei Zhu.
Flickr Tag Recommendation based on Collective Knowledge BÖrkur SigurbjÖnsson, Roelof van Zwol Yahoo! Research WWW Summarized and presented.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Hongbo Deng, Michael R. Lyu and Irwin King
Post-Ranking query suggestion by diversifying search Chao Wang.
More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
A Novel Relational Learning-to- Rank Approach for Topic-focused Multi-Document Summarization Yadong Zhu, Yanyan Lan, Jiafeng Guo, Pan Du, Xueqi Cheng Institute.
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
Leveraging Knowledge Bases for Contextual Entity Exploration Categories Date:2015/09/17 Author:Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv Source:KDD'15.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
Wen Chan 1 , Jintao Du 1, Weidong Yang 1, Jinhui Tang 2, Xiangdong Zhou 1 1 School of Computer Science, Shanghai Key Laboratory of Data Science, Fudan.
Hao Ma, Dengyong Zhou, Chao Liu Microsoft Research Michael R. Lyu
Xiang Li,1 Lili Mou,1 Rui Yan,2 Ming Zhang1
An Empirical Study of Learning to Rank for Entity Search
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 22, Feb, 2010 Department of Computer.
Presentation transcript:

Ph.D. Thesis Defense 1 Web Mining Techniques for Query Log Analysis and Expertise Retrieval Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Date: Sep 2, 2009

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 2 Rich Web Data Web pages One trillion unique URLs Question answer Yahoo! answer Scientific literature DBLP data Query logs AOL query log Web mining techniques

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 3 Query Overview Rich Web Data Information Web pages, images, etc. Query log analysis Web mining techniques Expertise retrieval Relevant experts

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 4 Objective  Combine the content and the graph information  Leverage IR, link analysis, ML in a unified way

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 5 Outline  Background: Web Mining Techniques Information retrieval, link analysis, machine learning  Modeling Bipartite Graph for Query Log Analysis Entropy-biased Models [w/ King-Lyu, SIGIR’09] Co-HITS Algorithm [w/ Lyu-King, KDD’09]  Modeling Expertise Retrieval Baseline and Weighted Model [w/ King-Lyu, ICDM’08] Graph-based Re-ranking Model [w/ Lyu-King, WSDM’09] Enhanced Models with Community [w/ King-Lyu, CIKM’09]  Conclusion and Future Work

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 6  Information retrieval models Vector space model Probabilistic model Language model  Web link analysis PageRank: a link represents a vote HITS: good hubs points to good authorities Other variations  Machine learning Semi-supervised learning Graph-based regularization framework Background

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 7 Query log data: Query-URL bipartite graph Query log data: Query-URL bipartite graph Modeling Bipartite Graph for Query Log Analysis  Many Web data can be modeled as bipartite graphs DBLP data: Author-Paper bipartite graph Netflix data: User-Movie bipartite graph It is very essential to model bipartite graphs for mining these data types. How to weigh the edges of the graph? How to combine the graph with other info.?

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 8 Query Log Analysis Query suggestion Query classification Targeted advertising Ranking  Query log analysis – improve search engine’s capabilities Modeling Bipartite Graph for Query Log Analysis

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 9 Click Graph  Click graph – an important technique A bipartite graph between queries and URLs Edges connect a query with the URLs Capture some semantic relations, e.g., “map” and “travel” How to utilize and model the click graph to represent queries? Traditional model based on the raw click frequency (CF) Propose two kinds of models Entropy-biased framework Co-HITS algorithm Modeling Bipartite Graph for Query Log Analysis

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 10 Outline  Part I: Entropy-biased Framework for Modeling Click Graphs Motivation and Preliminaries Traditional Click Frequency Model Entropy-biased Model Experimental Results Summary

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 11 Motivation  Basic idea Various query-URL pairs should be treated differently  Intuition Common clicks on less frequent but more specific URLs are of greater value than common clicks on frequent and general URLs Is a single click on different URLs equally important? General URL Specific URL Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 12 Preliminaries Query: URL: User: Query instance: Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 13 Traditional Click Frequency Model  Transition probability: Normalize the click frequency (CF) From query to URL:From URL to query: Part I. Entropy-biased Framework Measure the similarity between queries  The most similar query: q2 (“map”)  q1 (“Yahoo”)  More reasonable: q2 (“map”)  q3 (“travel”) Entropy-biased model

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 14 Entropy-biased Model  The more general and highly ranked URL Connect with more queries Increase the ambiguity and uncertainty  The entropy of a URL: Suppose Tend to be proportional to the n(d j ) It would be more reasonable to weigh these two edges differently Part I. Entropy-biased Framework The number of queries that connected with d j Query frequency Transition probability from a URL to a query

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 15 Entropy Entropy  Discriminative Ability  Entropy increase, discriminative ability decrease Be inversely proportional to each other A URL with a high query frequency is less discriminative overall  Inverse query frequency Measure the discriminative ability of the URL Benefits  Reduce the influence of some heavily-clicked URLs  Balance the bias of clicks for those highly ranked URLs  Incorporate with other factors to tune the model Part I. Entropy-biased Framework Constant

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 16 CF-IQF Model  Incorporate the IQF with the click frequency  A high click frequency  A low query frequency  “A” is weighted higher than “B” Part I. Entropy-biased Framework The surface specified by the click frequency, query frequency and cfiqf, with color specified by the cfiqf value. The color is proportional to the surface height.

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 17 CF-IQF Model  Transition probability The most similar query q2 (“map”)  q3 (“travel”) The most similar query q2 (“map”)  q1 (“Yahoo”) Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 18 UF Model and UF-IQF Model  Drawback of CF model Prone to spam by some malicious clicks (if a single user clicked on a certain URL thousands of times)  UF model Utilize user frequency instead of click frequency Improve the resistance against malicious clicks  UF-IQF model Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 19 Connection with TF-IDF  TF-IDF Successfully used in vector space model for text retrieval Try to interpret IDF based on binary independence retrieval (BIR), information entropy and LM TF-IDF has never been explored to bipartite graphs  Entropy-biased framework (CF-IQF) IQF is new CF-IQF is a simplified version of entropy-biased model Share the key point to tune the importance of an edge The scheme can be applied to other bipartite graphs Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 20 Mining Query Log on Click Graph Query clustering Query-to-query similarity Query suggestion Appendix B Models Query-to-query similarity Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 21 Similarity Measurement  Cosine similarity  Jaccard coefficient  The similarity results are reported and analyzed Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 22 Experimental Evaluation  Data collection AOL query log data  Cleaning the data Removing the queries that appear less than 2 times Combining the near-duplicated queries 883,913 queries and 967,174 URLs 4,900,387 edges Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 23 Evaluation: ODP Similarity  A simple measure of similarity among queries using ODP categories (query  category) Definition: Example:  Q1: “United States”  “Regional > North America > United States”  Q2: “National Parks”  “Regional > North America > United States > Travel and Tourism > National Parks and Monuments”  Precision at rank n  300 distinct queries 3/5 Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 24 Experimental Results  Query similarity analysis 1. CF-IQF is better than CF UF-IQF > UF Results: Part I. Entropy-biased Framework Better 2. UF is better than CF UF-IQF > CF-IQF 3. TF-IDF is better than TF

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 25 Experimental Results  Query similarity analysis 4. Jaccard coefficient The improvements are consistent with the Cosine similarity Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 26 Experimental Results  Query similarity analysis 5. UF-IQF achieves best performance in most cases. 6. CF and UF models > TF CF-IQF, UF-IQF > TF-IDF The click graph catches more semantic relations between queries than the query terms Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 27 Summary of Part I  Introduce the inverse query frequency (IQF) to measure the discriminative ability of a URL  Propose the framework of the entropy-biased model for the click graph IQF + CF, IQF + UF Formal model to distinguish the variation on different query-URL pairs in the click graph.  Experimental results show the improvements of the proposed models are consistent and promising Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 28 Outline  Part II: Co-HITS Algorithm Motivation Co-HITS Algorithm  Iterative Framework  Regularization Framework Experimental Results Summary

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 29 Motivation Two kinds of information ContentGraph IR Models for Link Analysis for - VSM - Language Model - etc. - HITS - PageRank - etc. Incorporate Content with Graph - Personalized PageRank (PPR) - Linear Combination - etc. Semantic relationsRelevance Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 30 Preliminaries Graph Hidden links: Explicit links: Content XY Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 31  Basic idea Incorporate the bipartite graph with the content information from both sides Initialize the vertices with the relevance scores x 0, y 0 Propagate the scores (mutual reinforcement) Generalized Co-HITS Initial scoresScore propagation Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 32  Iterative framework Generalized Co-HITS Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 33 Iterative  Regularization Framework  Consider the vertices on one side SmoothnessFit initial scores Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 34 Generalized Co-HITS  Regularization framework R1R1 R2R2 R3R3 W uu W vv Intuition: the highly connected vertices are most likely to have similar relevance scores. Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 35 Generalized Co-HITS  Regularization framework The cost function:Optimization problem: Solution: Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 36 Application to Query-URL Bipartite Graphs  Bipartite graph construction Edge weighted by the click frequency Normalize to obtain the transition matrix  Overall algorithm Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 37 Experimental Evaluation  Data collection AOL query log data  Cleaning the data Removing the queries that appear less than 2 times Combining the near-duplicated queries 883,913 queries and 967,174 URLs 4,900,387 edges 250,127 unique terms Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 38 Evaluation: ODP Similarity  A simple measure of similarity among queries using ODP categories (query  category) Definition: Example:  Q1: “United States”  “Regional > North America > United States”  Q2: “National Parks”  “Regional > North America > United States > Travel and Tourism > National Parks and Monuments”  Precision at rank n  300 distinct queries 3/5 Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 39 Experimental Results  Comparison of iterative framework The initial relevance scores from both sides provide valuable information. The improvements of OSP and CoIter over the baseline (the dashed line) are promising when compared to the PPR. personalized PageRank one-step propagation general Co-HITS Result 1: Part II. Co-HITS Algorithm graphcontentUVincorporate VU

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 40 Experimental Results  Comparison of regularization framework single-sided regularization double-sided regularization SiRegu can improve the performance over the baseline. CoRegu performs better than SiRegu, which owes to the newly developed cost function R 3. Moreover, CoRegu is relatively robust. Result 2: Part II. Co-HITS Algorithm graphcontentIncorporate R3 R1+R2

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 41 Experimental Results  Detailed results The CoRegu-0.5 achieves the best performance. It is very essential and promising to consider the double-sided regularization framework for the bipartite graph. Result 3: Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 42 Summary of Part II  Propose the generalized Co-HITS algorithm Incorporate the bipartite graph with the content information from both sides  Investigate two different frameworks Iterative: include HITS and personalized PageRank as special cases Regularization: build the connection with HITS, develop new cost functions  Experimental results CoRegu is more robust, achieves the best performance Part II. Co-HITS Algorithm

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 43 Outline  Background: Web Mining Techniques Information retrieval, link analysis, machine learning  Modeling Bipartite Graph for Query Log Analysis Entropy-biased Models [w/ King-Lyu, SIGIR’09] Co-HITS Algorithm [w/ Lyu-King, KDD’09]  Modeling Expertise Retrieval Baseline and Weighted Model [w/ King-Lyu, ICDM’08] Graph-based Re-ranking Model [w/ Lyu-King, WSDM’09] Enhanced Models with Community [w/ King-Lyu, CIKM’09]  Conclusion and Future Work

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 44 Modeling Expertise Retrieval  Expertise retrieval (Expert finding) task: Identify people with relevant expertise for a given query A high-level information retrieval DBLP bibliography and its supplemental data Information Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 45 Overview of Expertise Retrieval Part III:  Baseline model  Weighted language model  Graph-based re-ranking Part IV:  Enhanced models with community-aware strategies Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 46 Expertise Modeling  Expert finding p(ca|q): what is the probability of a candidate ca being an expert given the query topic q? Rank candidates ca according to this probability.  Approach: Using Bayes’ theorem, where p(ca, q) is joint probability of a candidate and a query, p(q) is the probability of a query. Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 47 Baseline Model (Document-based Model)  The probability p(ca,q): Language Model Conditionally independent Baseline model Find out documents relevant to the query Aggregate the expertise of an expert candidate from the associated documents Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 48 Weighted Model A query example Weighted model Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 49 General Expert Finding System Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 50 Graph-based Re-ranking  Key issue for expert finding: To retrieve the most relevant documents along with the relevance scores  Intuition Global consistency: Similar documents are most likely to have similar ranking scores with respect to a query  Regularization framework Global consistencyFit initial scores Parameter Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 51  Optimization problem  A closed-form solution  Connection with other methods μ α  0, return the initial scores μ α  1, a variation of PageRank-based model μ α ∈ (0, 1), combine both information simultaneously Graph-based Re-ranking Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 52 Combination of Different Methods Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 53 Experimental Setup  DBLP collection and representation Statistics of the DBLP collection A sample of the DBLP XML records Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 54 Experimental Setup  Assessments Manually created the ground truth through the method of pooled relevance judgments 17 query topics and 17 expert lists  Evaluation metrics Precision at rank n MAP Bpref (Appendix D) Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 55 Experimental Results  Weighted model LM(w) outperforms baseline model LM(bas) Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 56 Experimental Results  The performance can be boosted with Graph-based regularization Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 57 Experimental Results Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 58 Summary of Part III  Present the weighted model for expert finding Take into account both the relevance scores and the importance of the documents  Investigate and integrate the graph-based regularization method with the weighted model  Experimental results are presented to show the effectiveness of proposed models The performance is further boosted by refining the relevance scores of the documents Part III. Modeling Expertise Retrieval

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 59 Summary of Other Contributions  Enhancing Expertise Retrieval Communities could provide valuable insight and distinctive information A new smoothing method using the community context Ranking refinement based on community co-authorship Details in Appendix AAppendix A

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 60 Conclusion

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 61 Future Work  Query log analysis Personalization  General click graph, user’s click graph, session info. Incorporate with other information  Query-flow model, user modeling  Expertise retrieval on the Web Beyond a particular domain or intranet Identify relevant experts/trusted people Create a global expert and friend recommendation  Apply to other applications Entity retrieval Online social media search …

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 62 Selected Publications  Hongbo Deng, Irwin King, Michael R. Lyu. "Entropy-biased Models for Query Representation on the Click Graph." Proceedings of the 32nd Annual ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2009). Pages , Boston, MA, July 19-23, [Acceptance rate = 78/494 = 15.8%]SIGIR 2009  Hongbo Deng, Michael R. Lyu and Irwin King. "A Generalized Co-HITS Algorithm and Its Application to Bipartite Graphs." Proceedings of the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2009). Pages , Paris, France, June 28th - July 1st, [Acceptance rate = 105/561 = 18.7%]KDD 2009  Hongbo Deng, Irwin King, Michael R. Lyu. "Enhancing Expertise Retrieval Using Community-aware Strategies." Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM 2009). Hong Kong, China, Nov. 2-6, Short paper. [Acceptance rate = 294/847 = 34.7%]CIKM 2009  Hongbo Deng, Michael R. Lyu and Irwin King. "Effective Latent Space Graph-based Re-ranking Model with Global Consistency." Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM 2009). Pages , Barcelona, Spain, Feb. 9-12, [Acceptance rate = 29/170 = 17%]WSDM 2009  Hongbo Deng, Irwin King and Michael R. Lyu. "Formal Models for Expert Finding on DBLP Bibliography Data." Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008). Pages , Pisa, Italy, Dec , [Acceptance rate = 70/723 = 10%]ICDM 2008

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 63 Q&A Thanks!

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 64 Appendix A: Enhancing Expertise Retrieval  Motivation Communities could provide valuable insight and distinctive information  Community-aware strategies A new smoothing method using the community context Ranking refinement based on community co-authorship An example graph with two communities

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 65 Document-based Model  Key challenge Compute the relevance between query and document  Statistical language model Smoothing p(t|θ d ) with the community language model p(t|C d ) instead of the collection language model p(t|G)

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 66 Discovering Authorities in a Community  Co-authorship frequency  Normalized weight  AuthorRank Measure the authority for the authors within a community Co-authorship graph with: (a) co-authorship frequency, and (b) normalized weight

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 67 Community-Sensitive AuthorRank  Relevance  The quantity p(C k )  Community Sensitive AuthorRank Suppose C k be a “virtual” document, it becomes the document-based model Capture the high-level and general aspects for a query

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 68 Ranking Refinement Strategy  Two kinds of ranking results Rd: capture more specific and detailed aspects Rc: reflect more general and abstract aspects  Measure the similarity and diversity  Ranking refinement

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 69 Experimental Results  Comparison of Document-based Models

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 70 Experimental Results  Comparison of Enhanced Models

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 71 Experimental Results  Discussion and Detailed Results The detailed results of the community-sensitive AuthorRank for the query “machine learning.”

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 72 Experimental Results

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 73 Summary  Investigate the smoothing method using community context instead of the whole collection  Introduce the community-sensitive AuthorRank for determining the query-sensitive authorities  Develop an adaptive ranking refinement strategy to aggregate the ranking results  Experimental results shows a significant improvement over the baseline method  Return from Appendix AAppendix A

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 74 Appendix B: Graph-based Random Walk  Query-to-query graph The transition probability from q i to q j  The personalized PageRank Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 75 Experimental Results  Random Walk Evaluation Results: 1. With the increase of n, both models improve their performance. 2. CF-IQF model always performs better than the CF model. Part I. Entropy-biased Framework

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 76 Experimental Results  Random Walk Evaluation In general, the results generated by the CF and the CF-IQF models are similar, and mostly semantically relative to the original query, such as “American airline”. Part I. Entropy-biased Framework CF-IQF model can boost more relevant queries as suggestion and reduce some irrelevant queries.

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 77 Appendix C: Optimization Problem Optimization problem: Differentiating and simplifying: Solution:

Hongbo Deng Department of Computer Science and Engineering The Chinese University of Hong Kong Ph.D. Thesis Defense 78 Appendix D: Evaluation Metrics  Precision at rank n  Mean Average Precision (MAP):  Bpref: The score function of the number of non-relevant candidates