Finding Similar Questions in large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee ACM CIKM ‘05 Presented by Mat Kelly CS895 –

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,

A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Evaluating Search Engine

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Modern Information Retrieval

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

Chapter 5: Information Retrieval and Web Search

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.

Finding Similar Questions in Large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee Retrieval Models for Question and Answer Archives.

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Chapter 6: Information Retrieval and Web Search

CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.

Presenter: Shanshan Lu 03/04/2010

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

1 Blog site search using resource selection 2008 ACM CIKM Advisor ： Dr. Koh Jia-Ling Speaker ： Chou-Bin Fan Date ：

August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.

The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.

Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.

Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.

University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)

CS246: Information Retrieval

Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.

Relevance and Reinforcement in Interactive Browsing

INF 141: Information Retrieval

Presentation transcript:

Finding Similar Questions in large Question and Answer Archives Jiwoon Jeon, W. Bruce Croft and Joon Ho Lee ACM CIKM ‘05 Presented by Mat Kelly CS895 – Web-based Information Retrieval Old Dominion University December 13, 2011 Question Answering from Frequently Asked Question Files Robin D. Burke, Kristian J Hammond, Valdimir Kulyukin, Steven L. Lytinen, Noriko Tomurom and Scott Schoenberg AI magazine; Summer 1997

What is FAQ Finder? Matches answers to questions already asked in a site’s FAQ file 4 Assumptions 1.Information in QA Format 2.All information needed to determine relevance of QA is can be found in QA Pair 3.Q half of QA pair most relevant for matching to user’s question 4.Broad, shallow knowledge of language is sufficient for question matching

How Does It Work? Uses SMART IR system to narrow focus of relevant FAQ files Iterates through QA pairs in FAQ file, comparing against user’s question and computing a score using 3 metrics – Statistical term-vector similarity score t – Semantic similarity score s – Coverage score c T,S and C are constant weights that adjust reliance of system on each metric.

Calculating Similarity QA pair represented as a term vector w/ signif. Values for each term in the pair Significance value = tfidf n (term freq) = # time term appears in QA pair m = # QA pairs term appears in in file tfidf = n x log(M/m) Evaluate relative rarity of term within documents – Use as factor to weight freq of term in document

Nuances Many ways to express the same question – Synonymous terms often used in large documents – Thus, variations will have no effect However, FAQ Finder is matching on small # of terms, system needs means of matching synonyms – How do I reboot my system? – What do I do when my computer crashes? – Causal relationship resolved with WordNet

WordNet Semantic network of English words Provides relations between words and synonym sets & between synonym sets and themselves FAQ Finder utilizes through marker-passing algorithm – Compares each word in the user’s question to each word in FAQ file question

WordNet (cont…) Not a single semantic network, different sub- networks exist for nouns, verbs, etc. Syntactically ambiguous words (e.g. run) appears in more than one network. Simply relying on default word sense worked as well as any more sophisticated techniques

Penalize questions that lack corresponding words for each word in a user’s question

Evaluating Performance Corpus from log file of system’s use – May-Dec questions used Manually scanned and found 138 answers to questions and 103 questions unanswered Assumes there is a correct (single QA pair) Because this task is different than conventional IR problem, have to redefine recall and precision

Why Redefine Recall & Precision? RECALL – typically is measurement of % of relevant docs in set relative to query PRECISION – typically measurement of % retrieved docs that are relevant There is only one correct doc, these are not independent e.g. query returns 5 QA pairs – FAQ Finder returns either 100% recall and 20% precision OR – Returns 0% recall, 0% precision – If no answer exists, precision = 0%, recall = undefined

Redefining Recall & Precision Recall new =% questions FAQFinder returns correct answer when one exists – Does not penalize if >1 correct answer (original) Instead of precision, calculate rejection Rejection - % questions FAQFinder correctly reports as being answered – Adjusted to set cutoff point for minimum-allowable-matches There is still a tradeoff between rejection and recall – Rejection threshold too high, some correct answers eliminated – Rejection too low, incorrect answers given to user when no answer exists

Results Correct file appears 88% of time within top 5 files returned, 48% of time in first position Equates to 88% Recall, 23% Precision System confidently returns garbage when there is not correct answer in file

Ablation Study Evaluation of different components in matching scheme by disabling 1.QA pairs selected randomly from FAQ file 2.Coverage score for each condition used by itself 3.Semantic scores from WordNet used in eval 4.Term vector comparison used in isolation

Conditions’ Contributions WordNet and stat technique contribute strongly Their combination yields results that are better than either individually.

Where FAQ Finder Fails Biggest culprit of not finding is undue weight given to semantically useless words – Where can I find woodworking plans for a futon‽ – woodworking is incorporated as strongly as futon – futon should be much more important inside the woodworking FAQ than woodworking, which applies to everything Other problem: violation of assumptions about FAQ files

Conclusion When there is an existing collection of Qs & As, Qs can be reduced to matching new questions against QA pairs Power of approach is because FAQ Finder uses highly organized knowledge sources that are designed to answer commonly asked Qs.

Citing Paper’s Objectives Find questions in archive semantically similar to user’s question. Resolve: – Two questions that have the same meaning use very different wording – Similarity measures developed for document retrieval work poorly when there is little word overlap.

Approaches Toward The Word Mismatch Problem 1.Use knowledge databases as machine readable dictionaries (req. from first paper) – Current quality and structure are insufficient 2.Employ manual rules and templates – Expensive and hard to scale for large collections 3.Use statistical techniques from IR and natural language processing – Most promising with enough trained data

Problems with the Statistical Approach Need: Large # of semantically similar but lexically different sentences or Q pairs – No such collection exists on large scale Researchers artificially generate collections through methods like translation and subsequent reverse translation Paper proposed automatic way of building collections of semantically similar questions from existing Q&A collections

Question & Answer Archives Naver – leading portal site in S. Korea. Ex.  Avg len of Q field = 5.8w Avg Q body = 49w Avg Answer = 179w Made 2 test collections from archive – A-6.8M QA Pairs across all categories – B-68k QA Pairs across “Computer Novice” Categ. Question TitleHow to make multi- booting systems? Question BodyI am using Windows98. I’d like to multi-boot with Windows XP. How can I do this? AnswerYou must parition your hard disk, then install windows98 first. If there is no problem with windows98, then, install windows XP on…

Need: Sets of topics with relevance judgments – 2 sets of 50 QA pairs rand. Selected First set from Collection A and chosen across all Cats Second set from Collection B, chosen from “Comp. Novice” category Each pair converted to topic – Q TITLE  short query – Q BODY  long query – A  supplemental query } Used only in relevance judgement procedure

Find Relevant QA Pairs Given a topic, employ TREC pooling technique 18 diff. retrieval results generated by varying retrieval algorithm, query type & search field Retrieval models such as Okapi BM25, query-likelihood and overlap coefficient used Pooled top 20 QA pairs from each, did manual relevance judgments – As long as seman. Identical or very similar to query, QA pair is considered relevant – If no QA pairs found for a given topic, manually browse the collection to find ≥1 QA pair Result = 785 Relevant QA Pairs for A, 1557 for B

Verifying Field Importance Prev. Research: Similarity between questions is more important than similarity betw. Qs & As in FAQ Retrieval task Exp. 1: Search only Q Title field Exp. 2: Only Q Body Exp 3: Only Answer For all exps, use query likelihood model with Dirichlet smoothing and Okapi BM25 Regardless of retrieval model, best performance from searching the question title field. Performance gaps for others are significant.

Collecting Semantically Similar Questions Many people don’t search to see if Q has already been asked, so ask a seman. similar Q. Assume: If two answers are similar then corresponding Qs are semantically similar but lexically different. I’d like to insert music into Powerpoint. How can I link sounds in Powerpoint? How can I shut down my system in Dos-mode. How to turn off computers in Dos-mode Photo transfer from cell phones to computers. How to move photos taken by cell phones. Sample semantically similar questions with little word overlap

Algorithm Consider 4 popular document similarity measures: 1.Cosine similarity with vector space model 2.Negative KL divergence between language models 3.Output score of query likelihood model 4.Score of Okapi model

Finding a Similarity Measure: The Cosine Similarity Model Length of answers vary considerably – Some very short (factoids) – Others very long (C&P from web) Any similarity measure affected by length is not appropriate

Finding a Similarity Measure: Negative KL Divergence & Okapi Values are not symmetric and not probabilities – pair of answers that has a higher negative KL divergence than another pair does not necessarily have stronger semantic connections Hard to rank pairs Okapi Model has Similar Problems

Finding a Similarity Measure: Query Likelihood Model Score is a probability. Can be used across different answer pairs Score are NOT symmetric

Overcoming Problems Using ranks instead of scores was more effective – If answer A retrieves answer rank r 1 and answer B retrieves answer rank r 2 then similarity between 2 answers = reverse harmonic mean of two ranks: – Use query likelihood model to calc init. ranks

Experiments & Results 68,000*67,999/2 answers possible from 68,000 Q&A pairs in Collection B All ranked using established measure Empirically set threshold – Judge whether pair is related or not – Higher threshold = smaller but better quality collections – To acquire enough training samples, threshold cannot be too high 331,965 pairs have score above threshold

Word Translation Probabilities Question pair collection a parallel corpus IBM model 1 – Does not require any linguistic knowledge for src/target language, treats every word alignment equally – Translation from src s to target t = – λ s = normalization factor, so sum of probs = 1 – N = # training samples – J i = ith pair in training set

Word Translation Probabilities (cont) {s 1,…,s n } = words in src sentence in J i #(t,J i ) = number of times t occurs in J i Still need: old translation probs We initialize translation probs with rand values, then est. new translation probs – Repeat until probs converge – Procedure always converges to same final solution 1 [1] P. F. Brown, V. J. D. Pietra, S. A. D. Pietra and R. L. Mercer. The mathematics of statistical machine translation: parameter estimation. Comput. Linguis., 19(2): , 1993.

Experiments & Results (Word Translation) Removed stop words Collection of Q pairs duplicated by switching src and target pars then used as input Usually: most similar word to a given word is the word itself Found semantic relationships: found “bmp” to be similar to “jpg” and “gif”

Question Retrieval Where to go from Q titles from word translation probs? Similarity between query and document: Avoid 0 Probs, est. more accurate lang. models term w generated from collection C/D In translation model, convert to:

Experiments & Results (Question Retrieval) 50 short queries from collection B, searching only title field Similarities betw. query Q and Q titles calculated Compare performance model with vector space model w/ cosine similarity, Okapi BM25 and query likelihood language model

Experiments & Results cont… (Question Retrieval) ModelCosineLMOkapiTrans MAP Approach outperforms other baseline models at recall levels QL and Okapi show comparable performance In all evaluations, approach outperforms other models

Conclusions and Seminal Paper Relevance Retrieval model based on translation probs learned from archive significantly outperforms other approaches in finding semantically similar questions despite lexical mismatch Using translation probabilities and determining similarity of answers is a much more robust approach for resolving similar QA pairs with fewer prerequisite of corpus

References Burke, R. D., Hammond, K. J., Kulyukin, V. A., Lytinen, S. L., Tomuro, N., & Schoenberg, S. (1997). Question answering from frequently asked question files: Experience with the FAQ finder system (Tech. Rep.). Chicago,, IL, USA. Jiwoon Jeon, W. Bruce Croft, and Joon Ho Lee Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM '05). ACM, New York, NY, USA,