Using Semantic Relations to Improve Information Retrieval Tom Morton
Introduction NLP techniques have been largely unsuccessful at information retrieval. Why? Document retrieval has been the primary measure of information retrieval success. Document retrieval reduces the need for NLP techniques. Discourse factors can be ignored. Query words perform word-sense disambiguation. Lack of robustness: NLP techniques are typically not as robust as word indexing.
Introduction Paragraph retrieval for natural-language questions. Paragraphs can be influenced by discourse factors. Correctness of answers to natural language questions can be accurately determined automatically. Standard precursor to TREC question answering task. What NLP technologies might help at this information retrieval task and are they robust enough?
NLP Technologies Question Analysis: Named-Entity Detection: Questions tend to specify the semantic type of their answer. This component tries to identify this type. Named-Entity Detection: Named-entity detection determines the semantic type of proper nouns and numeric amounts in text.
How these technologies help? Question Analysis The category predicted is appended to the question. Named-Entity Detection: The NE categories found in text are included as new terms. This approach requires additional question terms to be in the paragraph. What party is John Major in? (ORGANIZATION) It probably won't be clear for some time whether the Conservative Party has chosen in John Major a truly worthy successor to Margaret Thatcher, who has been a giant on the world stage. +ORGANIZATION +PERSON
NLP Technologies Coreference Relations: Interpretation of a paragraph may depend on the context in which it occurs. Syntactically-based Categorical Relation Extraction: Appositive and predicate nominative constructions provide descriptive terms about entities.
How these technologies help? Coreference: Use coreference relationships to introduce new terms referred to but not present in the paragraph’s text. How long was Margaret Thatcher the prime minister? (DURATION) The truth, which has been added to over each of her 11 1/2 years in power, is that they don't make many like her anymore. +MARGARET +THATCHER +PRIME +MINISTER +DURATION
How these technologies help? Categorical Relation Extraction Identifies DESCRIPTION category. Allows descriptive terms to be used in term expansion. Who is Frank Lloyd Wright? (DESCRIPTION) What architect designed Robie House? (PERSON) Famed architect Frank Lloyd Wright… +DESCRIPTION Buildings he designed include the Guggenheim Museum in New York and Robie House in Chicago. +FRANK +LLOYD +WRIGHT +FAMED +ARCHITECT
Conclusion Developed and evaluated new techniques in: Coreference Resolution. Categorical Relation Extraction. Question Analysis. Integrated these techniques with existing NLP components: NE detection, POS tagging, sentence detection, etc. Demonstrated that these techniques can be used to improve performance in an information retrieval task. Paragraph retrieval for natural language questions.
System overview Indexing Retrieval Documents Paragraphs+ Paragraphs Coreference Resolution Pre-processing Documents Categorical Relation Extraction NE Detection Paragraphs+ Search Engine Question Analysis Paragraphs Question
Will it work? Will these semantic relations improve paragraph retrieval? Are the implementations robust enough to see a benefit across large document collections and question sets? Are there enough questions where these relationships are required to find an answer. Questions need only be answered once. Short Answer: Yes!
How does it work? Coreference Use Approach described in ACL (Morton 2000). Divide referring expressions into three classes and create a separate resolution approach for each. Singular third person pronouns: Statistical Proper nouns: Rule-based Definite noun phrases: Rule-based Apply resolution approaches to text in an interleaved fashion.
Coreference John Major, a truly worthy… Margaret Thatcher, her, … The Conservative Party the undoubted exception Winston Churchill … she ? 20% 70% 10% 5% Pronoun is resolved to entity rather than most recent extent.
Paragraph Retrieval Results
Conclusion Developed and evaluated new techniques in: Coreference Resolution. Categorical Relation Extraction. Question Analysis. Integrated these techniques with existing NLP components: NE detection, POS tagging, Sentence detection, etc. Demonstrated that these techniques can be used to improve performance in an information retrieval task. Paragraph retrieval for natural language questions.
Future Work Extend answer categories and named-entity detection to include new types. Develop completely statistical coreference resolution mechanism. Re-run paragraph retrieval evaluation.