Download presentation
Presentation is loading. Please wait.
Published byMaud Goodwin Modified over 9 years ago
1
Using Social Annotations to Improve Language Model for Information Retrieval Shengliang Xu, Shenghua Bao, Yong Yu Shanghai Jiao Tong University Yunbo Cao Microsoft Research Asia CIKM ’ 07 poster
2
Introduction The language modeling for IR has been approved to be efficient and effective way for modeling relevance between queries and documents The language modeling for IR has been approved to be efficient and effective way for modeling relevance between queries and documents Two critical problems in LMIR: data sparseness and term independence assumption Two critical problems in LMIR: data sparseness and term independence assumption In recent years, there emerged many web sites that provide folksonomy services, e.g. del.icio.us In recent years, there emerged many web sites that provide folksonomy services, e.g. del.icio.usdel.icio.us This paper explore the use of social annotations in addressing the two problems critical for LMIR This paper explore the use of social annotations in addressing the two problems critical for LMIR
3
Properties of Social Annotations The keyword property The keyword property –Social annotations can be seen as good keywords for describing the respective documents from various aspects –The concatenation of all the annotations of a document is a summary of the document from users ’ perspective The structure property The structure property –An annotation may be associated with multiple documents and vice versa –The structure of social annotations can be used to explore two types of similarity: document-document similarity and annotation-annotation similarity
4
Deriving Data from Social Annotations On the basis of social annotations, three sets of data can be derived On the basis of social annotations, three sets of data can be derived –A summary dataset: sum ann = {d s1, d s2, …, d sn } where d si is the summary of the i th document –A dataset of document similarity sim doc = {(doc i, doc j,simscore_doc ij ) | 0 ≦ i ≦ j ≦ n} –A dataset of annotation similarity sim ann = {(ann i, ann j,simscore_ann ij ) | 0 ≦ i ≦ j ≦ m} (Define t as a triple of sim doc or sim ann, t[i] means the ith dimension of t)
5
Language Annotation Model (LAM) Figure. Bayesian network for generating a term in LAM
6
Content Model (CM) Content Unigram Model (CUM) Content Unigram Model (CUM) –Match the query against the literal content of a document Topic Cluster Model (TCM) Topic Cluster Model (TCM) –Match the query against the latent topic of a document –Assume the similar documents of document d may more or less share the same latent topic of d –The term distribution over d ’ s topic cluster can be used to smooth d ’ s language model
7
Annotation Model (AM) Assume AM contains two sub models: an independency model and a dependency model Assume AM contains two sub models: an independency model and a dependency model Annotation Unigram Model (AUM) Annotation Unigram Model (AUM) –A unigram language model that matches query terms against annotated summaries Annotation Dependency Model (ADM) Annotation Dependency Model (ADM)
8
Parameter Estimation 5 mode probailities { P cum (q i |d), P aum (q i |d s ), P tcm (q i |d), P(q i |a), P(a|d s ) } and 3 mixture parameters (λ c, λ a, λ d ) have to be estimated 5 mode probailities { P cum (q i |d), P aum (q i |d s ), P tcm (q i |d), P(q i |a), P(a|d s ) } and 3 mixture parameters (λ c, λ a, λ d ) have to be estimated Use EM algorithm to estimate λ c, λ a, and λ d Use EM algorithm to estimate λ c, λ a, and λ d Dirichlet prior smoothing method for CUM, AUM, and TCM Dirichlet prior smoothing method for CUM, AUM, and TCM P tcm (q i |d) is estimated using a unigram language model on the topic clusters P tcm (q i |d) is estimated using a unigram language model on the topic clusters P(a|d s ) is approximated by maximum likelihood estimation P(a|d s ) is approximated by maximum likelihood estimation Approximate P(q i |a) : Approximate P(q i |a) :
9
Experiment Setup 1,736,268 web pages with 269,566 different annotations are crawled from del.icio.us 1,736,268 web pages with 269,566 different annotations are crawled from del.icio.us 80 queries with 497 relevant documents manually collected by a group of CS students 80 queries with 497 relevant documents manually collected by a group of CS students Merged Source Model (MSM) as baseline Merged Source Model (MSM) as baseline –Merge each document ’ s annotations into its content and implement a Dirichlet prior smoothed unigram language model on the merged source SocialSimRank (SSR) and Separable Mixture Model (SMM) are used to measure the similarity between documents and between annotations SocialSimRank (SSR) and Separable Mixture Model (SMM) are used to measure the similarity between documents and between annotations
10
SSR and SMM Table. Top 3 most similar annotations of 5 sample annotations exploited by SSR and SMM
11
Retrieval Performance Table. MAP of each model
12
Conclusions and Future Work The problem of integrating social annotations into LMIR is studied. The problem of integrating social annotations into LMIR is studied. Two properties of social annotations are studied and effectively utilized to lighten the data sparseness problem and relax the term independence assumption. Two properties of social annotations are studied and effectively utilized to lighten the data sparseness problem and relax the term independence assumption. In future, we are to explore more features of social annotations and more sophisticated ways of using the annotations. In future, we are to explore more features of social annotations and more sophisticated ways of using the annotations.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.