Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th.

Slides:

Advertisements

Similar presentations

Improved TF-IDF Ranker

Advertisements

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.

Sentiment Analysis An Overview of Concepts and Selected Techniques.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

KnowItNow: Fast, Scalable Information Extraction from the Web Michael J. Cafarella, Doug Downey, Stephen Soderland, Oren Etzioni.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

Towards the automatic identification of adjectival scales: clustering adjectives according to meaning Authors: Vasileios Hatzivassiloglou and Kathleen.

AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.

Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.

2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.

Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.

AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.

Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.

NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.

WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.

Natural Language Based Reformulation Resource and Web Exploitation for Question Answering Ulf Hermjakob, Abdessamad Echihabi, Daniel Marcu University of.

Why is Computational Linguistics Not More Used in Search Engine Technology? John Tait University of Sunderland, UK.

21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.

Answering Definition Questions Using Multiple Knowledge Sources Wesley Hildebrandt, Boris Katz, and Jimmy Lin MIT Computer Science and Artificial Intelligence.

Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

SVETLA KOEVA SVETLOZARA LESEVA BORISLAV RIZOV. The project Automatic information extraction based on semantic relations (RILA – a bilateral co-operation.

Summarization Focusing on Polarity or Opinion Fragments in Blogs Yohei Seki Toyohashi University of Technology Visiting Scholar at Columbia University.

Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

CSA2050 Introduction to Computational Linguistics Parsing I.

Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.

Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.

Data Mining: Text Mining

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.

Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.

Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.

Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

AUTONOMOUS REQUIREMENTS SPECIFICATION PROCESSING USING NATURAL LANGUAGE PROCESSING - Vivek Punjabi.

Improving QA Accuracy by Question Inversion John Prager, Pablo Duboue, Jennifer Chu-Carroll Presentation by Sam Cunningham and Martin Wintz.

Survey on Long Queries in Keyword Search : Phrase-based IR Sungchan Park

1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.

Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,

Natural Language Processing Vasile Rus

Personalized Ontology for Web Search Personalization S. Sendhilkumar, T.V. Geetha Anna University, Chennai India 1st ACM Bangalore annual Compute conference,

An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)

IR Theory: Evaluation Methods

CS246: Information Retrieval

Presentation transcript:

Noun-Phrase Analysis in Unrestricted Text for Information Retrieval David A. Evans, Chengxiang Zhai Laboratory for Computational Linguistics, CMU 34 th annual meeting on Association for Computational Linguistics Presented by Jaehui Park, IDS Lab., Seoul National University

Copyright  2008 by CEBT Introduction  The real challenge in IR To understand and represent appropriately the content of a document and query  An NLP task in IR Extracting linguistic relations among terms  Main point An evaluation of noun-phrase analysis techniques to enhance phrase-based IR – Phrase-based indexing in the CLARIT system Extraction of meaningful compounds from complex noun phrases using 1. corpus statistics 2. linguistic heuristics 2

Copyright  2008 by CEBT Phrase-Based Indexing  Single words are rarely specific enough to support accurate discrimination Ex) word-based indexing cannot distinguish the phrases : “junior college” vs. “college junior”  Related work CLARIT [David et al., 1991] – uses simplex noun phrases New York University’s TREC system [Strzalkowski et al. 1994] – uses head-modifier pairs 3

Copyright  2008 by CEBT Phrase-Based Indexing (cont’d)  In CLARIT, the phrase “the quality of surface of treated stainless steel strip” yields index terms such as – “treated stainless steel strip” : simplex noun phrase – “treated stainless steel” : simplex noun phrase – “stainless steel strip” : simplex noun phrase – “stainless steel” : simplex noun phrase But cannot yields index term such as – “stainless steel” : lexical atom – “strip surface”, “surface quality” : cross-preposition modification pair – “stainless strip”, “steel strip”, “treated strip” : head modifier pair  We aim to augment CLARIT indexing with… 4

Copyright  2008 by CEBT Phrase-Based Indexing (cont’d)  Four kinds of phrases Lexical atoms – By creating new words, we can eliminate the effect of the independence assumption at the word level – ‘hot’ and ‘dog’ => ‘hot dog’ One that reflect more general linguistic relations – Head modifier pairs – Cross-preposition pairs – Subcompounds : simplex noun phrase  It is meaningful to extract the above four small compounds from a large unrestricted corpus A step toward a shallow interpretation of noun phrases 5

Copyright  2008 by CEBT Methodology  Preprocessing NP extraction by CLARIT NLP module  Parsing simplex NPs Multiple phase – partial parsing and concatenating  Generating candidates for all four kinds Lexical atoms : already available Head-modifier pairs are extracted based on the modification relation implied by the structure Subcompounds : substructures of the NP Cross-preposition pairs : are generated by enumerating all possible pairs of the head of each NP => need more detail explanations 6

Copyright  2008 by CEBT Methodology (cont’d)  Validity testing Lexical atom – is difficult to recognize. Ex) ‘Wilson’s disease’  In a medical docs : lexical atom  In a news stories : not a lexical atom – has strong association. (ex. proper names or technical terms) -> co-occurs as a phrase and is rarely separated – Detection based on the heuristics 1. “It is required that the frequency of the pair to be higher than the other pair that is formed by either word with other words” 2. “It is required that F(W1, W2) is much higher than DF(W1, W2)” =>These are not the answers to the difficulty 7

Copyright  2008 by CEBT Methodology (cont’d)  Validity testing (cont’d) Bottom-Up Association-Based Parsing (multiple phase) – Using the most recently created lexicon of lexical atoms (:reusing) – Grouping words together discover the most restrictive structure of a NP – Ex) Evidences  “high performance” : more reliable association  “general purpose” : less reliable association Phrase : “general purpose high performance computer” Multiple phases of grouping  General purpose high performance computer =>  General purpose [high=performance] computer =>  [General=purpose] [high=performance] computer =>  [General=purpose] [[high=performance]=computer] =>  [[General=purpose]=[[high=performance]=computer]] 8

Copyright  2008 by CEBT Methodology (cont’d)  Scoring the word pairs (smaller score == stronger association) Lexical atom : 0 (the highest) Adverb + Adjective, Past participle or Progressive verb : 0 Syntactically impossible pairs (ex. noun adj): 100 (the lowest) Others are scored according to the formulas  Threshold : 0.7 9

Copyright  2008 by CEBT Experiment  Evaluation by indexing document in an actual retrieval task PES : Phrase Extraction System Baseline : standard CLARIT  Corpus Associated Press newswire stories (AP98: 240MB) – 3-million simplx NPs  Queries TREC

Copyright  2008 by CEBT Result  Recall Slightly improved (1%)  Precision 11

Copyright  2008 by CEBT Conclusion  Generating phrase association using Linguistic heuristics Locality scoring along with statistics  Showing positive effect of the use of lexical atoms and other phrase association across NPs 12