CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of Computer Science, University of Illinois at Chicago 2 Department of Computer Science, Binghamton University
CIKM Overview of the opinion retrieval Topic retrieval Opinion identification Ranking documents by opinion similarity Experimental results CIKM Outline
CIKM Overview of the Opinion Retrieval Opinion retrieval Given a query, find documents that have subjective opinions about the query A query “book” Relevant: “This is a very good book.” Irrelevant: “This book has 123 pages.”
CIKM Overview of the Opinion Retrieval Introduced at TREC 2006 Blog Track 14 groups, 57 submitted runs in TREC groups, 104 runs in TREC 2007 (on going) Key problems Opinion features Query-related opinions Rank the retrieved documents
CIKM Document set Our Algorithm Retrieved documents Query Opinionative documents Query-related opinionative documents
CIKM Topic Retrieval Retrieve query-relevant documents No opinion involved Features Phrase recognition Query expansion Two document-query similarities
CIKM Topic Retrieval – Phrase Recognition Semantic relationship among the words For phrase similarity calculation purpose 4 types Proper noun: “University of Lisbon” Dictionary phrase: “computer science” Simple phrase: “white car” Complex phrase: “small white car”
CIKM Topic Retrieval – Query Expansion Find the synonyms “wto” “world trade organization” Same importance Add additional terms “wto” negotiate, agreements, Tariffs,
CIKM Topic Retrieval - Similarity Sim(Query, Doc) = Phrase similarity Having or not having a phrase Sim_P = sum ( idf(P_i) ) Term similarity Sum of the Okapi scores of all the query terms Document ranking D1 is ranked higher than D2, if (Sim_P1>Sim_P2) OR (P1=P2 AND T1>T2)
CIKM Opinion Identification Feature Selection SVM classifier Subjective training data Objective training data From topic retrieval To opinion ranking retrieved documents opinionativ e documents
CIKM Opinion Identification – Training Data Subjective training data Review web sites Documents having opinionative phrases Objective training data Dictionary entries Documents not having opinionative phrases
CIKM Opinion Identification – Feature Selection The words expressing opinions Pearson’s Chi-square test Test of the independence between subjectivity label and words via contingency table Count the number of sentences Unigrams and bigrams
CIKM Opinion Identification – Classifier A support vector machine (SVM) classifier Objective sentencesSubjective sentences Features Training Feature vector representation SVM classifier
CIKM Opinion Identification – Classifier Apply the SVM classifier SVM classifier Document Sentence 1 … Label 1:objective … Sentence 2 Sentence n Label 2:subjective Label n:objective
CIKM Opinion Similarity - Query-Related Opinions Find the query-related opinions queryopinionative sentence document text window
CIKM Opinion Similarity – Similarity 1 Assumption 1 Higher topic relevance Higher rank OSim_ir = Sim(Query, Doc)
CIKM Opinion Similarity – Similarity 2 Assumption 2 More query-related opinions Higher rank OSim_stcc: total number of sentences OSim_stcs: total score of sentences
CIKM Opinion Similarity – Similarity 3 A linear combination of 1 and 2 a * Osim_ir + (1-a) * OSim_stcc b * Osim_ir + (1-b) * OSim_stcs
CIKM Opinion Similarity – Experimental Results TREC 2006 Blog Track data 50 queries, 3.2 million Blog documens UIC at TREC 2006 Blog Track Title-only queries: scored the first 28% - 32% higher than best TREC 2006 scores Good things learned More training data Combined similarity function
CIKM Conclusions Designed and implemented an opinion retrieval system. IR + text classification for opinion retrieval The best known retrieval effectiveness on TREC 2006 blog data Extend to polarity classification: positive/negative/mixed Plan to improve feature selection
CIKM Questions?