Download presentation
Presentation is loading. Please wait.
Published byBeverly Baldwin Modified over 9 years ago
1
Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media J. Bian, Y. Liu, E. Agichtein, and H. Zha ACM WWW, 2008
2
Introduction Question Answering (QA) Form of information retrieval where the users’ information need is specified in the form of a natural language question The desired result is a self-contained answer, not a list of documents Community Question Answering (CQA) Communities organized around QA, such as Yahoo! Answers and Naver Archive millions of questions and hundreds of millions of answers More effective alternative to web search, since CQA connects users to others who are willing to share information directly Users receive direct responses and thus do not have to browse results of search engines to locate their answers 2
3
Challenges Searching for existing answers is crucial to avoid duplication, and save time and efforts to users However, existing search engines are not designed for answering queries that require deep semantic understanding Example. Consider the query “When is the hurricane season in the Caribbean?”. Using Yahoo search users still need to click into web pages to find information 3
4
Challenges Example (cont.). Yahoo! Answers provides one brief, high-quality answer 4
5
Challenges A large portion of CQA content reflects personal, unsubstantiated opinion of users, which are not useful for factual information To retrieve correct factual answers to a question it is necessary to determine the relevance and quality of candidate answers Explicit feedback from users, in the form of “best answer” or “thumps up/down” rating, can provide a strong indicator of the quality of an answer However, how to integrate explicit user feedback and relevance into a single ranking, it is still a concern 5 Ranking framework that takes advantage of user interaction info. to retrieve high quality, relevant content in social media Proposed Solution
6
Learning Ranking Functions Problem definition of QA retrieval Given a user query Q, the ordering of a set of QA pairs according to their relevance to Q is done by learning a ranking function for triples of the form (qr k, qst i, ans i j ) where qr k is the k-query in a set of queries, qst i is the i th question in a CQA system, and ans i j is the j th answer to qst i 6
7
User Interactions in CQA Yahoo! Answers s upports effective search of archived questions and answers, and allows its users to Ask questions (“Asker”) Answer questions (“Answerer”) Evaluate the system (“Evaluator”), by voting for answers of other users, marking interesting questions, and reporting abusive behavior 7
8
Each query-question-answer triple is represented by Textual features, i.e., textual similarity between query, question, and answers Features Statistical features, i.e., independent features for query, question, and answers 8
9
Social features, i.e., user interaction activities & community- based features, that can approximate the users’ expertise in the QA community 9 Features
10
10 Preference Data Extraction “Users evaluation data” are extracted as a set of preference data which can be used for ranking answers For each query qr under the same question qst, consider two existing answers ans 1 and ans 2 Assume ans 1 has p 1 plus votes and m 1 minus votes out of n 1 impressions, whereas ans 2 has p 2 plus votes and m 2 minus votes out of n 2 impression To determine whether ans 1 is preferred over ans 2, in terms of their relevance to qst, it is assumed that plus votes obey a binomial distribution
11
11 Binomial Distribution A binomial experiment (i.e., Bernoulli trial) is a statistical experiment that has the following properties: The experiment consists of N repeated trials Each trial can result in just two possible outcomes, i.e., success or failure The probability of success, denoted by p, is the same on every trial. The probability of failure is 1 – p The trials are independent, i.e., the outcome on one trial does not affect the outcome on other trials In a binomial experiment that (i) consists of N trials, (ii) results in x successes, and (iii) the probability of success on an individual trial is p, the binomial probability is “Binomial coefficient” which is read as “x out of N”
12
12 Binomial Distribution Example. On a 10-question multiple choice test, with 4 options per question, the probability of getting 5 answers correct if the answers are guessed can be calculated as B(5; 10, 25%) = c(10, 5)(0.25) 5 (0.75) 5 ≈ 5.8% where p = 0.25, 1 - p = 0.75, x = 5, N = 10 Thus, if somebody guesses 10 answers on a multiple choice test with 4 options, they have about a 5.8% chance of getting 5 correct answers
13
13 Preference Data Extraction To determine whether ans 1 and ans 2 are significant, i.e., there are enough votes to compare the pair, the likelihood ratio test is applied If λ > threshold, then the pair is significant To determine the preference for the pair, ans 1 and ans 2, if then ans 1 is preferred over ans 2, denoted ans 1 ans 2 ; o.w., ans 2 is preferred over ans 1, denoted ans 2 ans 1 Positive constant Binomial Distribution p1p1 p 1 + m 1 + s p2p2 p 2 + m 2 + s >
14
14 Preference Data Extraction For two query-question-answer items with the same query, i.e., (qr, qst 1, ans 1 ) and (qr, qst 2, ans 2 ), let their feature vectors be X and Y If ans 1 has a higher labeled grade than ans 2, then the preference X Y is included If, on the other hand, ans 2 has a higher labeled grade than ans 1, then the preference Y X is included Suppose the set of available preferences is where S, x, y denote the feature vector for two query- question-answer triples with the same query, and x y means that x is preferred over y, i.e., x should be ranked higher than y
15
15 Learning Ranking from Preference Data The problem of learning ranking functions is cast as the problem of computing the ranking function h that matches the set of preferences, i.e., i = 1..N h(x i ) ≥ h(y i ), if x i y i (h) is the objective function (squared hinge loss function) that measures the risk of a given ranking function h, such that x i y i is a contradicting pair w.r.t. h if h( x i ) < h( y i ) where is a function class, chosen to be linear combinations of regression trees The minimization (min) problem is solved by using functional gradient descent, an algorithm based on gradient boosting
16
16 Learning Ranking from Preference Data Learning ranking function h using gradient boosting (GBRank) An algorithm that optimizes a cost function over function space by iteratively choosing a (ranking) function & the number of iterations are determined by cross validation decision tree
17
17 Experimental Setup Datasets 1,250 Factoid questions from the TREC QA benchmarks data QA collection dataset: Submit each query Q to the Yahoo! Answers & extracts up to 10 top-ranked related questions Retrieve as many answers to Q as available Total number of tuples: 89,642 with 17,711 relevant & 71,931 non-relevant ones Evaluation Metrics: MRR, P@K, and MAP Ranking Methods Compared Baseline_Yahoo (ordered by posting date) Baseline_Votes (ordered by Positive_Votes – Negative_Votes) GBRanking (ordered by proposed community/social features)
18
18 Experimental Results Ranking Methods Compared For each TREC query, there is a list of Yahoo! questions (YQ a, YQ b, …) & for each question, there are multiple answers (YQ a 1, YQ a 2, …)
19
19 Experimental Results Ranking Methods MRR_MAX: Calculate the MRR value of each Yahoo! Answers question & choose the highest MRR value as the TREC query’s MRR Simulates an “intelligent” user who always selects the most relevant retrieved Yahoo! question first MRR_STRICT: Same as MRR_MAX, but choose their average MRR values as the TREC query’s MRR Simulates a user who blindly follow the Yahoo! Answer’s ranking & their corresponding ordered answers MRR_RR (Round Robin): Use YQ a ’s 1 st answer as the TREC query’s 1 st answer, YQ b ’s 1 st answer as the TREC query’s 2 nd answer, and so on Simulates a “jumpy” user who believes in first answers
20
20 Experimental Results Ranking Methods Compared MAX performs better than the other two metrics for baseline GBrank is even better than MAX & achieves a gain of 18% relative to MAX (i.e., Baseline_Yahoo) (i.e., Baseline_Votes)
21
21 Experimental Results Learning Ranking Function Using 10-fold cross validation on 400(/1250) TREC queries
22
22 Experimental Results Robustness to Noisy Labels Use 50 manually-labeled queries & randomly select 350 TREC queries with related questions & answers Results show that a nearly-optimal model is generated even when trained on noisy relevance labels
23
23 Experimental Results Study on Feature Set P@K when learning ranking function with removing each category, respectively Results show that a nearly-optimal model is generated even when trained on noisy relevance labels
24
24 Experimental Results Study on Feature Set Users’ evaluations play a very important role in learning ranking function
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.