Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.

Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th annual meeting on Association for Computational Linguistics 2008. 08. 14. Presented by Jaehui Park, IDS Lab., Seoul National University

Copyright  2008 by CEBT Introduction  The real challenge in IR To understand and represent appropriately the content of a document and query  An NLP task in IR Extracting linguistic relations among terms  Main point An evaluation of noun-phrase analysis techniques to enhance phrase-based IR – Phrase-based indexing in the CLARIT system Extraction of meaningful compounds from complex noun phrases using 1. corpus statistics 2. linguistic heuristics 2

Copyright  2008 by CEBT Phrase-Based Indexing  Single words are rarely specific enough to support accurate discrimination Ex) word-based indexing cannot distinguish the phrases : “junior college” vs. “college junior”  Related work CLARIT [David et al., 1991] – uses simplex noun phrases New York University’s TREC system [Strzalkowski et al. 1994] – uses head-modifier pairs 3

Copyright  2008 by CEBT Phrase-Based Indexing (cont’d)  In CLARIT, the phrase “the quality of surface of treated stainless steel strip” yields index terms such as – “treated stainless steel strip” : simplex noun phrase – “treated stainless steel” : simplex noun phrase – “stainless steel strip” : simplex noun phrase – “stainless steel” : simplex noun phrase But cannot yields index term such as – “stainless steel” : lexical atom – “strip surface”, “surface quality” : cross-preposition modification pair – “stainless strip”, “steel strip”, “treated strip” : head modifier pair  We aim to augment CLARIT indexing with… 4

Copyright  2008 by CEBT Phrase-Based Indexing (cont’d)  Four kinds of phrases Lexical atoms – By creating new words, we can eliminate the effect of the independence assumption at the word level – ‘hot’ and ‘dog’ => ‘hot dog’ One that reflect more general linguistic relations – Head modifier pairs – Cross-preposition pairs – Subcompounds : simplex noun phrase  It is meaningful to extract the above four small compounds from a large unrestricted corpus A step toward a shallow interpretation of noun phrases 5

Copyright  2008 by CEBT Methodology  Preprocessing NP extraction by CLARIT NLP module  Parsing simplex NPs Multiple phase – partial parsing and concatenating  Generating candidates for all four kinds Lexical atoms : already available Head-modifier pairs are extracted based on the modification relation implied by the structure Subcompounds : substructures of the NP Cross-preposition pairs : are generated by enumerating all possible pairs of the head of each NP => need more detail explanations 6

Copyright  2008 by CEBT Methodology (cont’d)  Validity testing Lexical atom – is difficult to recognize. Ex) ‘Wilson’s disease’  In a medical docs : lexical atom  In a news stories : not a lexical atom – has strong association. (ex. proper names or technical terms) -> co-occurs as a phrase and is rarely separated – Detection based on the heuristics 1. “It is required that the frequency of the pair to be higher than the other pair that is formed by either word with other words” 2. “It is required that F(W1, W2) is much higher than DF(W1, W2)” =>These are not the answers to the difficulty 7

Copyright  2008 by CEBT Methodology (cont’d)  Validity testing (cont’d) Bottom-Up Association-Based Parsing (multiple phase) – Using the most recently created lexicon of lexical atoms (:reusing) – Grouping words together discover the most restrictive structure of a NP – Ex) Evidences  “high performance” : more reliable association  “general purpose” : less reliable association Phrase : “general purpose high performance computer” Multiple phases of grouping  General purpose high performance computer =>  General purpose [high=performance] computer =>  [General=purpose] [high=performance] computer =>  [General=purpose] [[high=performance]=computer] =>  [[General=purpose]=[[high=performance]=computer]] 8

Copyright  2008 by CEBT Methodology (cont’d)  Scoring the word pairs (smaller score == stronger association) Lexical atom : 0 (the highest) Adverb + Adjective, Past participle or Progressive verb : 0 Syntactically impossible pairs (ex. noun adj): 100 (the lowest) Others are scored according to the formulas  Threshold : 0.7 9

Copyright  2008 by CEBT Experiment  Evaluation by indexing document in an actual retrieval task PES : Phrase Extraction System Baseline : standard CLARIT  Corpus Associated Press newswire stories (AP98: 240MB) – 3-million simplx NPs  Queries TREC 51-100 10

Copyright  2008 by CEBT Conclusion  Generating phrase association using Linguistic heuristics Locality scoring along with statistics  Showing positive effect of the use of lexical atoms and other phrase association across NPs 12

Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.

Similar presentations

Presentation on theme: "Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.

Similar presentations

Presentation on theme: "Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th."— Presentation transcript:

Similar presentations

About project

Feedback