Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao.

Slides:



Advertisements
Similar presentations
PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Advertisements

A probabilistic model for retrospective news event detection
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Chapter 7 Retrieval Models.
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Carnegie Mellon Exact Maximum Likelihood Estimation for Word Mixtures Yi Zhang & Jamie Callan Carnegie Mellon University Wei Xu.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.
Language Modeling Approaches for Information Retrieval Rong Jin.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
2008/06/06 Y.H.Chang Towards Effective Browsing of Large Scale Social Annotations1 Towards Effective Browsing of Large Scale Social Annotations WWW 2007.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Tag Data and Personalized Information Retrieval 1.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
1 Blog site search using resource selection 2008 ACM CIKM Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Link Distribution on Wikipedia [0407]KwangHee Park.
Automatic Labeling of Multinomial Topic Models
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Multi-Aspect Query Summarization by Composite Query Date: 2013/03/11 Author: Wei Song, Qing Yu, Zhiheng Xu, Ting Liu, Sheng Li, Ji-Rong Wen Source: SIGIR.
Relevance Feedback Hongning Wang
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
2005/09/13 A Probabilistic Model for Retrospective News Event Detection Zhiwei Li, Bin Wang*, Mingjing Li, Wei-Ying Ma University of Science and Technology.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
A Study of Poisson Query Generation Model for Information Retrieval
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Online Multiscale Dynamic Topic Models
Relevance Feedback Hongning Wang
Stochastic Optimization Maximization for Latent Variable Models
Michal Rosen-Zvi University of California, Irvine
Junghoo “John” Cho UCLA
Topic Models in Text Processing
CS590I: Information Retrieval
Query Type Classification for Web Document Retrieval
INF 141: Information Retrieval
Conceptual grounding Nisheeth 26th March 2019.
Unsupervised learning of visual sense models for Polysemous words
Language Models for TR Rong Jin
Jinwen Guo, Shengliang Xu, Shenghua Bao, and Yong Yu
Presentation transcript:

Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao Microsoft Research Asia CIKM ’ 07 poster

Introduction The language modeling for IR has been approved to be efficient and effective way for modeling relevance between queries and documents The language modeling for IR has been approved to be efficient and effective way for modeling relevance between queries and documents Two critical problems in LMIR: data sparseness and term independence assumption Two critical problems in LMIR: data sparseness and term independence assumption In recent years, there emerged many web sites that provide folksonomy services, e.g. del.icio.us In recent years, there emerged many web sites that provide folksonomy services, e.g. del.icio.usdel.icio.us This paper explore the use of social annotations in addressing the two problems critical for LMIR This paper explore the use of social annotations in addressing the two problems critical for LMIR

Properties of Social Annotations The keyword property The keyword property –Social annotations can be seen as good keywords for describing the respective documents from various aspects –The concatenation of all the annotations of a document is a summary of the document from users ’ perspective The structure property The structure property –An annotation may be associated with multiple documents and vice versa –The structure of social annotations can be used to explore two types of similarity: document-document similarity and annotation-annotation similarity

Deriving Data from Social Annotations On the basis of social annotations, three sets of data can be derived On the basis of social annotations, three sets of data can be derived –A summary dataset: sum ann = {d s1, d s2, …, d sn } where d si is the summary of the i th document –A dataset of document similarity sim doc = {(doc i, doc j,simscore_doc ij ) | 0 ≦ i ≦ j ≦ n} –A dataset of annotation similarity sim ann = {(ann i, ann j,simscore_ann ij ) | 0 ≦ i ≦ j ≦ m} (Define t as a triple of sim doc or sim ann, t[i] means the ith dimension of t)

Language Annotation Model (LAM) Figure. Bayesian network for generating a term in LAM

Content Model (CM) Content Unigram Model (CUM) Content Unigram Model (CUM) –Match the query against the literal content of a document Topic Cluster Model (TCM) Topic Cluster Model (TCM) –Match the query against the latent topic of a document –Assume the similar documents of document d may more or less share the same latent topic of d –The term distribution over d ’ s topic cluster can be used to smooth d ’ s language model

Annotation Model (AM) Assume AM contains two sub models: an independency model and a dependency model Assume AM contains two sub models: an independency model and a dependency model Annotation Unigram Model (AUM) Annotation Unigram Model (AUM) –A unigram language model that matches query terms against annotated summaries Annotation Dependency Model (ADM) Annotation Dependency Model (ADM)

Parameter Estimation 5 mode probailities { P cum (q i |d), P aum (q i |d s ), P tcm (q i |d), P(q i |a), P(a|d s ) } and 3 mixture parameters (λ c, λ a, λ d ) have to be estimated 5 mode probailities { P cum (q i |d), P aum (q i |d s ), P tcm (q i |d), P(q i |a), P(a|d s ) } and 3 mixture parameters (λ c, λ a, λ d ) have to be estimated Use EM algorithm to estimate λ c, λ a, and λ d Use EM algorithm to estimate λ c, λ a, and λ d Dirichlet prior smoothing method for CUM, AUM, and TCM Dirichlet prior smoothing method for CUM, AUM, and TCM P tcm (q i |d) is estimated using a unigram language model on the topic clusters P tcm (q i |d) is estimated using a unigram language model on the topic clusters P(a|d s ) is approximated by maximum likelihood estimation P(a|d s ) is approximated by maximum likelihood estimation Approximate P(q i |a) : Approximate P(q i |a) :

Experiment Setup 1,736,268 web pages with 269,566 different annotations are crawled from del.icio.us 1,736,268 web pages with 269,566 different annotations are crawled from del.icio.us 80 queries with 497 relevant documents manually collected by a group of CS students 80 queries with 497 relevant documents manually collected by a group of CS students Merged Source Model (MSM) as baseline Merged Source Model (MSM) as baseline –Merge each document ’ s annotations into its content and implement a Dirichlet prior smoothed unigram language model on the merged source SocialSimRank (SSR) and Separable Mixture Model (SMM) are used to measure the similarity between documents and between annotations SocialSimRank (SSR) and Separable Mixture Model (SMM) are used to measure the similarity between documents and between annotations

SSR and SMM Table. Top 3 most similar annotations of 5 sample annotations exploited by SSR and SMM

Retrieval Performance Table. MAP of each model

Conclusions and Future Work The problem of integrating social annotations into LMIR is studied. The problem of integrating social annotations into LMIR is studied. Two properties of social annotations are studied and effectively utilized to lighten the data sparseness problem and relax the term independence assumption. Two properties of social annotations are studied and effectively utilized to lighten the data sparseness problem and relax the term independence assumption. In future, we are to explore more features of social annotations and more sophisticated ways of using the annotations. In future, we are to explore more features of social annotations and more sophisticated ways of using the annotations.