1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University.

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

Using the Web as an Implicit Training Set: Application to Structural Ambiguity Resolution Preslav Nakov and Marti Hearst Computer Science Division and.
Using Query Patterns to Learn the Durations of Events Andrey Gusev joint work with Nate Chambers, Pranav Khaitan, Divye Khilnani, Steven Bethard, Dan Jurafsky.
Tricks for Statistical Semantic Knowledge Discovery: A Selectionally Restricted Sample Marti A. Hearst UC Berkeley.
MINING FEATURE-OPINION PAIRS AND THEIR RELIABILITY SCORES FROM WEB OPINION SOURCES Presented by Sole A. Kamal, M. Abulaish, and T. Anwar International.
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Supporting Annotation Layers for Natural Language Processing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer Science Division and SIMS.
A Study of Using Search Engine Page Hits as a Proxy for n-gram Frequencies Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
Midterm Review CS4705 Natural Language Processing.
Stemming, tagging and chunking Text analysis short of parsing.
Unambiguous + Unlimited = Unsupervised Marti Hearst School of Information, UC Berkeley Invited Talk, University of Toronto January 31, 2006 This research.
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley UCB Neyman.
Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley Joint work.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov and Marti Hearst Computer Science Division and SIMS University.
The BioText Project: Recent Work Marti Hearst SIMS, UC Berkeley Supported by NSF DBI and a gift from Genentech.
1 Noun compounds (NCs) Any sequence of nouns that itself functions as a noun asthma hospitalizations asthma hospitalization rates health care personnel.
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Unambiguous + Unlimited = Unsupervised or Using the Web for Natural Language Processing Problems Marti Hearst School of Information, UC Berkeley This research.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
Scaling Up BioNLP: Application of a Text Annotation Architecture to Noun Compound Bracketing Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti Hearst Computer.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
BioNLP related talks and demos at ACL and CONLL ‘05 Presented by Beatrice Alex BioNLP meeting 11 th of July 2005.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio Supervised Relation Extraction.
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Automatically Identifying Candidate Treatments from Existing Medical Literature Catherine Blake Information & Computer Science University.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Consumer Health Question Answering Systems Rohit Chandra Sourabh Singh
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Language Identification and Part-of-Speech Tagging
Linguistic Graph Similarity for News Sentence Searching
PRESENTED BY: PEAR A BHUIYAN
Introduction to Parsing (adapted from CS 164 at Berkeley)
Natural Language Processing (NLP)
Machine Learning in Natural Language Processing
Supported by NSF DBI and a gift from Genentech
Category-Based Pseudowords
CS4705 Natural Language Processing
Supported by NSF DBI and a gift from Genentech
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
CS246: Information Retrieval
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

1 Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University of California, Berkeley Marti Hearst SIMS University of California, Berkeley

2 Outline Introduction Related Work Models and Features

3 Introduction Noun compound bracketing -> Noun compound interpretation liver cell antibody [[liver cell] antibody] liver cell line [liver [cell line]] POS equivalent, different syntactic trees

4 This Paper A highly accurate unsupervised method for making bracketing decisions for noun compounds (NCs) Current: using bigram estimates to compute adjacency and dependency scores Improvement χ 2 measure a new set of surface features for querying Web search engines Evaluate on 2 domains, encyclopedia & bioscience

5 Related Work NC syntax and semantics Still active -> J. of Com. Speech and Language – Special Issue on Multiword Expressions Adjacency model Probabilistic dependency model, Laucer (1995) Data sparseness (use categories instead) 244 NCs from encyclopedia Inter-annotator agreement 81.5% Baseline 66.8% -> 77.5% Adding POS -> state-of-the-art result of 80.7%

6 2003~2005 Keller and Lapata (2003) Use Web Search Engines for obtaining frequencies for unseen bigrams (2004) apply to six NLP tasks including disambiguation of NCs Simpler version (use frequency only) % Girju et al. (2005) supervised (decision tree) (5 WordNet semantic features) 83.1%

7 Models and Features Adjacency and dependency model w 1 w 2 w 3 -> [w 1 [w 2 w 3 ]] (two reasons) take on right bracketing 1. w 2 w 3 is a compound (modified by w 1 ) home health care Adjacency model checks w 1 and w 2 independently modify w 3 adult male rat (Better) Dependency model checks 2. Left bracketing -> only 1 choice [law enforcement] agent

8 Computing Probabilities Alternative Calculations

9 χ 2 measure B=#(w i )-(A) C=#(w j )-(A) D=~N-A-B-C N=8T =google 8B pages X 1000 words/page ( Yang and Pedersen, 1997) χ 2 better than MI

10 蛋包飯蛋包飯 蛋 蛋包 2217 包 包飯 3398 飯 χ 2 包飯 > 蛋包 67.32

11 Web-Derived Surface (1/2) Authors sometimes (consciously or not) disambiguate the words they write by using surface-level markers to suggest the correct meaning. Dash (hyphen) left bracketing cell cycle analysis -> cell-cycle right bracketing less reliable donor T-cell fiber optics-system t-cell-depletion Possessive marker brain ’ s stem cells, brain stem ’ s cells, brain ’ s stem-cells Internal capitalization Plasmodium vivax Malaria, brain Stem cells disable this feature on Roman digits and single-letter words vitamin D deficiency

12 Web-Derived Surface (2/2) Embedded slashes leukemia/lymphoma cell growth factor (beta) or (growth factor) beta (brain) stem cells a comma, a dot or a colon “ health care, provider ” or “ lung cancer: patients ” (weak indicator) mouse-brain stem cells (weak indicator) Unfortunately, Web SE ignore punctuation characters - hyphens, brackets, apostrophes, etc. collect them indirectly – post-processing the resulting summaries (up to 1000 results) Above features are clearly more reliable than others, we do not try to weight them Features verifying Counts returned by SE, page hits as a proxy for n-gram frequencies from 1000 summaries

13 Other Web-Derived Features Abbreviations tumor necrosis factor (NF) tumor necrosis (TN) factor Concatenation health care reform -> healthcare, carereform Wildcard (*) “ health care * reform ” “ health * care reform ” Reorder reform health care care reform health myosin heavy chain, heavy chain myosin Internal inflection variability tyrosine kinase activation, tyrosine kinases activation Switching “ adult male rat ”, we would also expect “ male adult rat ”.

14 新發現

15 Paraphrases Warren (1978) proposes stem cells in the brain cells from the brain stem Copula paraphrase office building that/which is a skyscraper pain associated with arthritis migraine search engines lack linguistic annotations small set of hand-chosen paraphrases associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for

16 Evaluations Lauer ’ s Dataset (1995) 244 unambiguous 3-noun NC-s Biomedical Dataset (Nakov et al., 2005, SIG BioLink) Open NLP tools sentence splitted, tokenized, POS tagged and shallow parsed a set of 1.4 million MEDLINE abstracts (citations between 1994 and 2003) 500 NCs, 361 left, 69 right, 70 ambiguous

17 Experiments used MSN Search statistics for the n- grams and the paraphrases (unless the pattern contained a “ * ” ) MSN always returned exact numbers Google for the surface features Google and Yahoo rounded their page hits, which generally leads to lower accuracy (Yahoo was better than Google for these estimates)

18 Tools Mentioned UMLS Specialist lexicon 得到生物領域字不同的拼法 mlslex.html Carroll ’ s morphological tools morph.html morph.html

19 UMLS Lexicon {base=AAAentry=E cat=noun variants=metaregvariants=uncountacronym_of=abdominal aortic aneurysmectomy|E acronym_of=acne-associated arthritis|E acronym_of=acquired aplastic anemia|E acronym_of=acute anxiety attack|E acronym_of=androgenic anabolic agent|E acronym_of=aneurysm of ascending aorta acronym_of=aromatic amino acid|E acronym_of=acute apical abscess|E abbreviation_of=abdominal aortic aneurysm|E } {base=AAMD spelling_variant=A.A.M.D. entry=E cat=nounvariants=groupuncountacronym_of=American Association on Mental Deficiency|E }

20

21

22

23 Conclusions and Future Work Improved upon the state-of-the-art approaches to NC bracketing Future include test on > 3 words recognize the ambiguous case Include determiners and modifiers on other NLP problems refine the parser output Parser typically assume right bracketing