Natural Language Processing Group Department of Computer Science University of Sheffield, UK IR4QA: An Unhappy Marriage Mark A. Greenwood
Outline of Talk Background ‘Ancient’ History Recent Past An Uncertain Future Possible New Directions
Background Although QA is not new, the language processing community has yet to develop a clearly articulated and commonly accepted guiding framework and research methodology, parallel to that of IR, MT, or text summarization. As a result, despite ten years of system evaluations in the TREC QA track for specific kinds of questions and answers, the community does not have a clear idea how much progress was made during that period for QA in general. OAQA09 Call for Papers
Background We will focus here on the selection of promising documents which can be subjected to further processing in order to extract exact answers to questions. The common approach to this problem has been to employ an IR engine to retrieve a small set of relevant documents, a field known as IR4QA. The rest of this talk will explain How we got to this point Why it is fundamentally flawed Where we might go from here
Outline of Talk Background ‘Ancient’ History Recent Past An Uncertain Future Possible New Directions
‘Ancient’ History Traditionally IR and QA were separate research areas They had different users and goals The inputs and outputs to both systems were radically different Both had their own strengths and weaknesses
‘Ancient’ History Early QA systems were usually just interfaces to structured data LUNAR (Woods, 1973) BASEBALL (Green et al., 1961) Those systems which worked over text were usually based around reading comprehension exercises and used scenario templates SAM (Schank and Abelson, 1977) Questions varied in length but were asking for information which wasn’t known to the user Systems were not open-domain, i.e. LUNAR only knew about moon rocks.
‘Ancient’ History In comparison to QA systems early IR systems could be applied to any document collection Performance varied from collection to collection but in principal Queries were usually quite long and described the documents the user was looking for The CACM collection is a good example Systems returned full documents not exact answers As the user already knew what they were looking for this was OK Full documents doesn’t help when you don’t know what you are looking for as you then have to read all the returned documents
Outline of Talk Background ‘Ancient’ History Recent Past An Uncertain Future Possible New Directions
Recent Past Recent QA research has been guided by the TREC evaluations The TREC QA track was originally conceived as a task that would interest both the IR and IE communities Focused IR Open-Domain IE It was hoped that over time the two communities would work together to develop new combined approaches Unfortunately it would seem that the IR community is not, on the whole, interested in the QA task
Recent Past Most, if not all, modern QA systems have adopted a (roughly) three stage architecture: question analysis, document retrieval, and answer extraction.
Recent Past IR4QA has not been aggressively researched by the community yet we know that... IR performance places an upper-bound on end-to-end performance – a commonly quoted figure is 60% (Tellex et al., 2003) Even if we look at the top 1000 documents no relevant documents are returned for 8% of the questions (Hovy et al., 2000) Most systems use off-the-shelf IR components with little or no tuning to the task, i.e. Lucene, Okapi... Complex multi-query strategies have been tried in an effort to solve the problem, but they only serve to highlight how bad performance at this step actually is.
Recent Past IR4QA has focused on the development and evaluation of the document retrieval component in such systems. The main problems are QA researchers are not IR researchers We don’t fully understand the intricate details of IR engines QA and IR are fundamentally different tasks
Recent Past Commonly accepted evaluation framework consists of (Roberts and Gaizauskas, 2004) Coverage – the proportion of documents for which at least one answer bearing document is retrieved Redundancy – the average number of answer bearing documents retrieved for a question
Recent Past There have been two workshops focused on the problem of IR4QA Sheffield, SIGIR 2004 Manchester, Coling 2008 The main conclusions of both were that IR4QA is very hard Approaches that lead to increased IR performance do not necessarily lead to appreciable increases in end-to-end performance Selection of documents shouldn’t be performed in isolation from the rest of the system
Outline of Talk Background ‘Ancient’ History Recent Past An Uncertain Future Possible New Directions
An Uncertain Future It seems clear that, on the whole, the IR community are not interested in QA Using off-the-shelf IR components has been shown to introduce unacceptable caps on performance The IR4QA community need to consider radically different approaches to the problem of selecting relevant documents from large corpora
Outline of Talk Background ‘Ancient’ History Recent Past An Uncertain Future Possible New Directions
Answer extraction requires complex text processing Answer extraction techniques don’t scale well Some form of text selection component is required There are two orthogonal directions we could take Continue to use traditional IR techniques but discard the traditional view of what makes a document (and/or query) Continue to work with traditional documents but use a radically different selection approach We need approaches that scale – working on AQUAINT size collections is nice for self contained experiments but shouldn’t be the end goal!
What Is A Document? Topic Indexing and Retrieval (Ahn and Webber, 2008) throws away the common idea of documents while using a standard IR engine to directly retrieve answers not text. Topics are entities that answer questions People, companies, locations etc. Topic documents are built by simply joining together all sentences from a corpus that contain the topic (or variations of, i.e. Bill Clinton and William Clinton) QA is then a matter of retrieving the most relevant topic document using an IR engine and returning the associated topic as the answer
What Is A Document?
Let The Data Guide You A decade of recent QA research has yielded a lot of useful data We have lots of example questions (at least a few thousand just from TREC) each of which... Has a known correct answer Is associated with at least one answer bearing document We should use this data to guide new selection approaches. A simple approach would be to perform query expansion by looking for terms which are often associated with correct answers to certain question types (Derczynski et al., 2008) Look for patterns in the answer bearing documents and index collections based on these patterns rather than words
Answer By Understanding I’ve always been of the opinion that QA is intelligent IR Where intelligence equates to some level of understanding This suggests we should index meaning not just textual content. Take into account co-reference when selecting text passages Indexing relations should allow for more focused selection ‘Hybrid’ search that uses annotations and text (Bhagdev et al., 2008)
DISCUSSION
References Kisuh Ahn and Bonnie Webber Topic Indexing and Retrieval for Factoid QA. In Proceedings of the 2nd Workshop on Information Retrieval for Question Answering (IR4QA). Ravish Bhagdev, Sam Chapman, Fabio Ciravegna, Vitaveska Lanfranchi and Daniela Petrelli Hybrid Search: Effectively Combining Keywords and Semantic Searches. In Proceedings of the 5th European Semantic Web Conference, ESWC 08, Tenerife. Leon Derczynski, Jun Wang, Robert Gaizauskas and Mark A. Greenwood A Data Driven Approach to Query Expansion in Question Answering. In Proceedings of the 2nd Workshop on Information Retrieval for Question Answering (IR4QA). Bert F. Green, Alice K. Wolf, Carol Chomsky, and Kenneth Laughery BASEBALL: An Automatic Question Answerer. In Proceedings of the Western Joint Computer Conference, volume 19, pages Eduard Hovy, Laurie Gerber, Ulf Hermjakob, Michael Junk, and Chin-Yew Lin Question Answering in Webclopedia. In Proceedings of the 9th Text REtrieval Conference. Ian Roberts and Robert Gaizauskas Evaluating Passage Retrieval Approaches for Question Answering. In Proceedings of 26th European Conference on Information Retrieval (ECIR’04), pages , University of Sunderland, UK. Roger C. Schank and Robert Abelson Scripts, Plans, Goals and Understanding. Hillsdale. Stefanie Tellex, Boris Katz, Jimmy Lin, Aaron Fernandes, and Gregory Marton Quantitative Evaluation of Passage Retrieval Algorithms for Question Answering. In Proceedings of the Twenty-Sixth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages , Toronto, Canada, July. William Woods Progress in Natural Language Understanding - An Application to Lunar Geology. In AFIPS Conference Proceedings, volume 42, pages