Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq malshann@kent.edu December 05, 2011

2 Outline Introduction Overview of passage retrieval module Strategies for passage retrieval in QA Document segmentation Passage retrieval in Joost Experiments  Setup  Result Conclusion Future Work References

3 Introduction Information Retrieval (IR): is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the world wide web*. Passage Retrieval: retrieve individual passages within documents (one or more sentences, paragraphs). Precision: Number of relevant document retrieved Number of total retrieved Recall : Number of relevant document retrieved Number of total relevant Question Answering (QA): include a passage retrieval component to reduce the search space for information extraction modules. * http://en.wikipedia.org/wiki/Information_retrieval

5 Passage Types Discourse passage: is when the segmentation based on document structure. –Problems with this approach often arise with special structures such as headers, lists and tables which are easily mixed with other units such as proper paragraphs. Semantic passages: split documents into semantically motivated units using some topical structure. Window-based passages: use fixed or variable-sized windows to segment documents into smaller units. –window-based passages have a fixed length using non-overlapping parts of the document.

6 Passage Incorporation Approaches We can distinguish between two approaches to the incorporation of passages in information retrieval: 1) Passage-level evidence to improve document retrieval. 2) Using passages directly as the unit to be retrieved. Paper interested in the second approach to return small units in QA.

7 What are the differences between Passage Retrieval in QA and ordinary IR? Passage Retrieval in QA differs from ordinary IR in at least two points: 1)Queries are generated from user questions and not manually created as in standard IR. 1) The units to be retrieved are usually much smaller than documents in IR. The division of documents into passages is crucial for two reasons: 1) The textual units have to be big enough to ensure IR works properly. 2) They have to be small enough to enable efficient and accurate QA.

9 Strategies for Passage Retrieval in QA Search- time passaging: two-steps strategy of retrieving documents first and then selecting relevant passages within these documents. –Return only one passage per relevant document. Index- time passaging: one-step strategy that return relevant passages from documents. –Allow multiple passages per relevant document to be returned. In our QA system we adopt the second strategy using a standard IR engine to match keyword queries generated from a natural language question with passages.

11 Document Segmentation The experiments work with Dutch data from the QA tasks at the cross- lingual evaluation forum (CLEF). The document collection used there is a collection of two daily newspapers from the years 1994 and 1995. –It includes about 190,000 documents (newspaper articles). –4 million sentences including approximately 80 million words. –The documents include additional markup to segment them into paragraphs. We define document boundaries as hard boundaries, i.e., passages may never come from more than one document in the collection.

12 Document Segmentation Strategies Window-based passages: Documents are split into passages of fixed size (in terms of number of sentences). Variable-sized arbitrary passages: Passages may start at any sentence in each document and may have variable lengths. –This is implemented by adding redundant information to our standard IR index. – We create passages starting at every sentence in a document for each length defined. Sliding window passages: A sliding window approach also adds redundancy to the index by sliding over documents with a fixed-sized window

14 Passage Retrieval in Joost Joost QA system includes two strategies: 1) Table-lookup strategy using fact databases that have been created off- line. 2) On-Line answer extraction strategy with passage retrieval and subsequent answer identification and ranking modules. paper approach interested in the second strategy in order to check the passage retrieval component and its impact on QA performance.

15 Dutch CLEF Corpus The contents of the CLEF dataset evidently very diverse. Most of the documents are very short but the longest one contains 625 sentences. Figure 1: Distribution of document sizes in terms of sentences they contain in the Dutch CLEF corpus.

16 Dutch CLEF Corpus Figure 2: Distribution of paragraph sizes in terms of sentences in the Dutch CLEF corpus.

17 Dutch CLEF Corpus Figure 3: Distribution of paragraph sizes in terms of characters in the Dutch CLEF corpus.

19 Experiment Setup The entire Dutch CLEF document collection is used to create the index files with the various segmentation approaches. There are 777 questions, each question may have several answers. For each setting 20 passages retrieved per question using the same query generation strategy

20 Evaluation Measures 1) Redundancy: The average number of passages retrieved per question that contain a correct answer. 2) Coverage: Percentage of questions for which at least one passage is retrieved that contains a correct answer.

21 Evaluation Measures 3) Mean reciprocal ranks: The mean of the reciprocal rank of the first passage retrieved that contains a correct answer.

22 Coverage and redundancy Figure 4: Coverage and redundancy of passages retrieved for various segmentation strategies.

23 Mean Reciprocal Ranks Figure 5: Mean reciprocal ranks of passage retrieval (IR MRR) and question answering (QA MRR) for various segmentation strategies.

25 Conclusion Accurate passage retrieval is essential for Question Answering. Discourse based segmentation into paragraphs works well with standard information retrieval techniques. Among the window-based approaches a segmentation into overlapping passages of variable-length performs best, in particular for passages with sizes of 1 to 10 sentences. Passage retrieval is more effective than full document retrieval.

27 Future Work Advance improvement for discourse based segmentation. Combine several retrieval setting using various segmentation approaches.

29 References [1] J. P. Callan. Passage-level evidence in document retrieval. In SIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on Research and evelopment in information retrieval, pages 302–310, New York, NY, USA, 1994.Springer-Verlag New York, Inc. [2] CLEF. Multilingual question answering at CLEF. http://clef-qa.itc.it/, 2005. [3] M. A. Greenwood. Using pertainyms to improve passage retrieval for questions requesting information about a location. In Proceedings of the Workshop on Information Retrieval for Question Answering (SIGIR 2004), Sheffield, UK, 2004. [4] M. Kaszkiel and J. Zobel. Effective ranking with arbitrary passages. Journal of the American Society of Information Science, 52(4):344–364, 2001. [5] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Girju,R. Goodrum, and V. Rus. The structure and performance of an open-domain question answering system, 2000. [6] I. Roberts and R. Gaizauskas. Evaluating passage retrieval approaches for question answering. In Proceedings of 26th European Conference on Information Retrieval, 2004. [7] S. E. Robertson, S.Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC-3. In Text REtrieval Conference,pages 21–30, 1992. [8] http://en.wikipedia.org/wiki/Information_retrieval Thank You

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Similar presentations

Presentation on theme: "Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Similar presentations

Presentation on theme: "Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq"— Presentation transcript:

Similar presentations

About project

Feedback