1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date : 2009.08.04.

Slides:



Advertisements
Similar presentations
A probabilistic model for retrospective news event detection
Advertisements

Term Level Search Result Diversification DATE : 2013/09/11 SOURCE : SIGIR’13 AUTHORS : VAN DANG, W. BRUCE CROFT ADVISOR : DR.JIA-LING, KOH SPEAKER : SHUN-CHEN,
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Evaluating Search Engine
Search Engines and Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Opinion Spam and Analysis (WSDM,08)Nitin Jindal and Bing Liu Date: 04/06/09 Speaker: Hsu, Yu-Wen Advisor: Dr. Koh, Jia-Ling.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
SEEKING STATEMENT-SUPPORTING TOP-K WITNESSES Date: 2012/03/12 Source: Steffen Metzger (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling Koh 1.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Speaker: Ruirui Li 1 The University of Hong Kong.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Chapter 23: Probabilistic Language Models April 13, 2004.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Post-Ranking query suggestion by diversifying search Chao Wang.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
PERSONALIZED DIVERSIFICATION OF SEARCH RESULTS Date: 2013/04/15 Author: David Vallet, Pablo Castells Source: SIGIR’12 Advisor: Dr.Jia-ling, Koh Speaker:
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
1 Blog Cascade Affinity: Analysis and Prediction 2009 ACM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
CS590I: Information Retrieval
Relevance and Reinforcement in Interactive Browsing
INF 141: Information Retrieval
Presentation transcript:

1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :

2 Outline Introduction Resource selection techniques for blog site search 1. Global Representation 2. Query Generation Maximization 3. Pseudo-Cluster based Selection Experiments Customizing the search Conclusion

3 Introduction A blog site consists of many individual blog postings. Current blog search services focus on retrieving postings but there is also a need to identify relevant blog sites. Blog site search is similar to resource selection in distributed information retrieval, in that the target is to find relevant collections of documents.

4 Introduction In this paper, we focus on search techniques for complete blogs rather than postings. Since the term “blog search” often means “posting search” we instead use the term “blog site search”. As an example of the dfference between blog site and blog posting searches, consider the following two queries: Q1: “Nikon D3 review” Q2: “digital camera reviews”

5 Introduction Finding relevant blog sites can be regarded as selecting relevant collections from a number of collections, in that each blog site can be considered as a collection of postings. Thus, in this paper, we study how to apply resource selection techniques to blog site search and further suggest customized methods to improve retrieval performance.

6 Resource selection techniques for blog site search Resource selection in distributed information retrieval is used to select the most relevant collections from a large number of possible collections. We can employ existing resource selection techniques for blog site search. Our goal is to find relevant collections, i.e. blog sites, rather than relevant documents. Of course, we could use blog site search as a technique for improving posting search.

7 Resource selection techniques for blog site search - Global Representation One of the simplest approaches to resource selection treats a collection as a single, large document. For a blog site search, we can generate a virtual document for a blog site by concatenating all postings in a blog. This virtual document D i for a blog site c i can then be represented using a language model and the query likelihood of the document for a query Q is used as a ranking function.

8 Resource selection techniques for blog site search - Global Representation This technique has some problems. One of the problems is that the virtual document might be a mixture of various topics. We call this technique “global representation” and use it as the first baseline for our experiments. q is a query term of query Q tf q,Di is the number of times term q occurs in virtual document D i |D i | is the length of virtual document D i cf q is the number of times term q occurs in the entire collection |C| is the length of the collection

9 Resource selection techniques for blog site search - Query Generation Maximization “unified utility maximization”, does resource selection to maximize a utility function. The utility function for the high-recall problem is defined as follows: c i is a collection, i.e. {d i1,d i2, ···}, 1,2,... is the # of docs. N C is the number of total collections ˜ n i is the number of the returned documents from the collection c i I(c i ) is an indicator function (1 if c i is selected and 0 otherwise) σ is a selection vector, i.e. [I(c1),I(c2), ···,I(cN C )] R(d ij ) is an estimated probability of relevance of the returned document d ij.

10 Resource selection techniques for blog site search - Query Generation Maximization Our goal is finding a selection vector to maximize the utility function with the limited number of selection. The problem is described as follows: Where N σ is the predetermined number for selection. The optimized solution of this problem is selecting N σ collections with the largest expected number of the relevant documents.

11 Resource selection techniques for blog site search - Query Generation Maximization In order to apply this method to blog site search, we simplify the process as follows. Build an index of postings ignoring which blog site the postings are from. Since we already know statistics of each collection, we can directly translate the query likelihood score to the probability of relevance of the document R(d ij ) for a given query without any estimation process. where P( Q|d ij ) is the query likelihood of the document dij for the query Q.

12 Resource selection techniques for blog site search - Query Generation Maximization In this case, the optimized solution is selecting N σ collections with the highest expected generation of the query, i.e. We induce a ranking function based on the maximization. Simply sum the query likelihood scores of postings from the same blog site in the ranked list which is returned from the index.

13 Resource selection techniques for blog site search - Pseudo-Cluster based Selection Distributed information retrieval using clustering is very e ff ective because clustering redistributes documents in collections and makes topic-based sub-collections.  Our goal is not to find relevant documents using resource selection but to find resources themselves. We create “pseudo-clusters” by ranking blog postings and then grouping highly-ranked postings from the same blog. To represent the pseudo-clusters, we borrow a method from cluster-based retrieval.  One of the biggest problems is that the representation of a cluster can be biased by some documents in the cluster.

14 Resource selection techniques for blog site search - Pseudo-Cluster based Selection To avoid such a problem, we customize a new representation method. This method expresses probability distribution of words over clusters using a geometric mean as follows: We can easily compute a query likelihood of blog site ci by a geometric mean of query likelihoods of postings of blog site ci in the ranked list (under a unigram assumption) as follows. w is a word, g is a cluster d j is a document in cluster g N g is the number of documents in cluster g

15 Resource selection techniques for blog site search - Pseudo-Cluster based Selection  Unfair!! Fix to 

16 Experiments - Design We do experiments for three resource selection techniques. For global representation, we built an index of each blog site after concatenating each posting from the same blog site. We used the query likelihood retrieval model as the ranking method for the global representation. Query generation maximization and pseudo-cluster selection require an initial retrieval. We built an index from all postings and used the query likelihood retrieval model for the initial run.

17 Experiments - Training We performed exhaustive grid search to find optimal parameters for each technique. i.e. theμ parameter for Dirichlet smoothing. We used the normalized discounted cumulative gain (NDCG), the mean average precision (MAP) and the precision at the rank 10 as the evaluation measures.

18 Experiments - Retrieval Performance Table 2 presents that two baselines, global representation and query generation maximization showed similar performance. Pseudo-cluster selection significantly outperformed the other techniques. In a practical sense, query generation maximization and pseudo-cluster selection have an advantage over global representation.

19 Customizing the search Blog site search involves somewhat di ff erent strategies compared to resource selection due to specific features of blog sites. For better resource selection, it is desirable to choose collections which include a greater number of relevant documents. We discuss which customizations may be appropriate by first introducing several types of blog sites

20 Customizing the search - Types of Blog Sites We classified blog sites into three types based on how they are managed and the degree of diversity of the topics covered. Type I is the diary type of blog.  In this type, a blogger usually posts descriptions of their daily life.  it is rare that other postings about similar topics are regularly updated in the blog site. Type II is the news blog.  Documents covering a large number of topics are posted, and many of these blogs are managed by an organization or a company.  Many general Web news sites also contain feed links for their subscribers.Must to prevent.

21 Customizing the search - Types of Blog Sites Type III is the topic-focused type of blog.  This is managed by one or a few individuals and concentrates on a small number of topics.  This type of blog site with a topic specialty exists for many topics. The success of our retrieval methods will depend on how well we are able to find this type of blog site for a given query.

22 Customizing the search - Types of Blog Sites To verify the validity of our categories, we manually classified 100 blog sites randomly selected from the pools for relevance judgments. There were some cases that we could not decide which category a blog site is in because it did not match any category. Most of such blog sites were spam sites, We tagged such sites as “Unclassifiable”. e.g., sites which do not contain real contents but instead are mostly advertisement links.

23 Customizing the search - Types of Blog Sites Three annotators independently labeled the blog sites. By majority voting, we assigned the label which more than two annotators agreed with to each blog site. If all annotators had di ff erent labels for a blog site, then we tagged the site as ”Unclassifiable”. As we expected, the majority of relevant blog sites were in the topic-focused category.

24 Customizing the search - Diversity Penalty We need to penalize Type I and Type II blog sites. To do this, we focus on the fact that they are not topic- centric. Accordingly, we considered a method for penalizing blog sites with diverse topics. We have to decide whether or not the blog site is topic- centric at the global level, i.e. the blog site level. Therefore,the penalty should be able to be used at the global level.

25 Customizing the search - Diversity Penalty by Global Representation The query likelihood score from the global representation could be used as a diversity penalty. We compute the score at the global level. Further, if the blog site deals with the diverse topics, then the distribution of the words in the blog site are probably widely scattered.

26 Customizing the search - Diversity Penalty by Global Representation We analyzed the distribution of the number of postings in the returned blog sites according to the above mentioned techniques. As we can see from the histogram in Figure 1, the global representation definitely returned much fewer blog sites which have a large number of postings. In summary, the query likelihood score can be useful as a measure of diversity of blog sites. Furthermore, this score reflects the relevance of the blog site for the given topic.

27 Customizing the search - Diversity Penalty by Global Representation Accordingly, to supplement the other two resource selection techniques, we can use this score as a penalty factor for diversity by multiplying it by the previous ranking function as follows.

28 Customizing the search - Clarity Score as a Penalty Factor We compute the clarity score by using the Kullback- Leibler divergence between a blog site and the whole collection as follows. We also use this score as a penalty factor for diversity by multiplying it by th previous ranking function as follows.

29 Customizing the search - Diversity Penalty by Random Sampling We randomly sample M postings from a blog site to obtain postings independent of any topic. And compute the query likelihoods for the sampled postings with the given query. If the blog site is topic-centric and relevant to the topic, then the postings are likely to relevant to the topic and the query likelihoods have high values. Therefore, the query likelihoods can be used for estimating diversity of a blog site.

30 Customizing the search - Diversity Penalty by Random Sampling We make a diversity penalty factor with the query likelihoods of the randomly sampled postings in the same way as used in pseudo-cluster selection. We compute a geometric mean of the query likelihoods.

31 Customizing the search - Experimental Results

32 Conclusion We defined the properties of blog sites and the goal of blog site search. Based on this goal, we introduced various resource selection algorithms for site search in blog collections. We classified the types of blog sites and claimed that an appropriate penalty factor reflecting the diversity of the topics of each blog site is required. Our experiments demonstrated that pseudo-cluster selection combined with a global representation penalty outperformed the other methods in all situation.