Download presentation
Presentation is loading. Please wait.
Published byGinger Howard Modified over 8 years ago
1
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR 2008 2009. 01. 14. Summarized & presented by Jung-Yeon Yang IDS Lab.
2
Copyright 2008 by CEBT Introduction A growing interest in finding out people’s opinions from web data Product survey Advertisement analysis Political opinion polls TREC started a special track on blog data in 2006 – blog opinion retrieval – It has been the track that has the most participants in 2007 This paper is focused on the problem of searching opinions over general topics 2
3
Copyright 2008 by CEBT Related Work The popular opinion identification approaches Text classification Lexicon-based sentiment analysis Opinion retrieval Opinion retrieval To find the sentimental relevant documents according to a user’s query One of the key problems How to combine opinion score with relevance score of each document for ranking 3
4
Copyright 2008 by CEBT Related Work (cont.) Topic-relevance search is carried out by using relevance ranking(e.g. TF*IDF ranking) Ad hoc solutions of combining relevance ranking and opinion detection result 2 steps: rank with relevance, then re-rank with sentiment score Most existing approaches use a linear combination α *Score rel + β *Score opn 4
5
Copyright 2008 by CEBT Backgrounds Statistical Language Model (LM) A probability distribution over word sequences – p(“ Today is Wednesday ”) 0.001 – p(“ Today Wednesday is ”) 0.0000000000001 – Unigram : P(w1,w2,w3) = P(w1)*P(w2)*P(w3) – Bigram : P(w1,w2,w3) = P(w1)*P(w2|w1)*P(w3|w2) Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model LM allows us to answer questions like: Given that we see “ John ” and “ feels ”, how likely will we see “ happy ” as opposed to “ habit ” as the next word? (speech recognition) Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) 5
6
Copyright 2008 by CEBT Backgrounds (cont.) The notion of relevance 6 Relevance (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity Vector space model Prob. distr. model … Generative Model Regression Model Classical prob. Model Doc generation Query generation LM approach Prob. concept space model Different inference system Inference network model
7
Copyright 2008 by CEBT Backgrounds (cont.) Retrieval as Language Model Estimation Document ranking based on query likelihood Retrieval problem Estimation of p(w i |d) Smoothing is an important issue – Problem : If tf = 0, then p(w i |d) = 0 – Smoothing methods try to Discount the probability of words seen in a document Re-allocate the extra probability so that unseen words will have a non-zero probability – Most use a reference model(collection language model) to discriminate unseen words 7 Document language model Discounted LM estimate Collection language model
8
Copyright 2008 by CEBT Generation Model for Opinion Retrieval The Generation Model To find both sentimental and relevant documents with ranks Topic Relevance Ranking Opinion Generation Model and Ranking Ranking function of generation model for opinion retrieval 8
9
Copyright 2008 by CEBT The Proposed Generation Model Document generation model how well the document d “fits” the particular query q p(d |q) ∝ p(q |d)p(d) In opinion retrieval, p(d |q, s ) Users’ information need is restricted to only an opinionate subset of the relevant documents This subset is characterized by sentiment expressions s towards topic q 9
10
Copyright 2008 by CEBT The Proposed Generation Model (cont.) In this work, discuss lexicon-based sentiment analysis Assume – The latent variable s is estimated with a pre-constructed bag-of-word sentiment thesaurus – All sentiment words s i are uniformly distributed The final generation model 10 quadratic relationship
11
Copyright 2008 by CEBT Topic Relevance Ranking I rel (d,q) The Binary Independent Retrieval (BIR) model is one of the most famous ones in this branch Heuristic ranking function BM25 – TREC tests have shown this to be the best of the known probabilistic weighting schemes 11 IDF(w)
12
Copyright 2008 by CEBT Opinion Generation Model I op (d,q,s) I op (d,q,s) focus on the problem that given query q, how probably a document d generates a sentiment expression s. Sparseness problem smoothing Jelinek-mercer smoothing is applied. 12
13
Copyright 2008 by CEBT Opinion Generation Model I op (d,q,s) Use the co-occurrence of s i and q inside d within a window W as the ranking measure of P ml (s i |d,q) 13
14
Copyright 2008 by CEBT Ranking function of generation model for opinion retrieval The final ranking function To reduce the impact of unbalance between #(sentiment words) and #(query terms) logarithm normalization 14 Only use the topic relevance
15
Copyright 2008 by CEBT Experimental Setup Data set TREC blog 06 & 07 100,649 blogs during 2.5 month Strategy : find top 1000 relevant documents, then re-rank the list with proposed model Models General linear combination Proposed generation model with smoothing Proposed generation model with smoothing and normalization 15
16
Copyright 2008 by CEBT Experimental Setup (cont.) Sentimental Lexicons 16 Thesaurus nameSizeDesc. 1HowNet4621English translation of pos/neg Chinese words from HowNet 2WordNet7426Selected words from WordNet with seeds 3Intersection14131 ∩ 2 4Union10634 1 ∪ 2 5General Inquirer3642All words in the positive and negative categories 6SentiWordNet3133Words with a positive or negative score above 0.6
17
Copyright 2008 by CEBT Expetiment Results 17
18
Copyright 2008 by CEBT Conclusion Proposed a formal generation opinion retrieval model Topic relevance & sentimental scores are integrated with quadratic comb. Opinion generation ranking functions are derived Discussed the roles of the sentiment lexicon and the matching window It is a general model for opinion retrieval Use the domain-independent lexicons No assumption has been made on the nature of blog-structure text 18
19
Copyright 2008 by CEBT My opinions Good points proposed the opinion retrieval model and ranking function Do various experiments Lacking points Just find documents that has some opinions Don’t know what kinds of opinions in the documents Don’t use the sentimental polarity of words 19
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.