Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

1() Information Extraction – why Google doesnt even come close Diana Maynard Natural Language Processing Group University of Sheffield, UK.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
University of Sheffield, NLP Case study: GATE in the NeOn project Diana Maynard University of Sheffield.
1/(19) GATE Evaluation Tools GATE Training Course October 2006 Kalina Bontcheva.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
University of Sheffield NLP Module 4: Machine Learning.
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.
ANNIE and JAPE GATE Training Course 23 November 2006 Diana Maynard Andrey Shafirin.
A Corpus for Cross- Document Co-Reference D. Day 1, J. Hitzeman 1, M. Wick 2, K. Crouch 1 and M. Poesio 3 1 The MITRE Corporation 2 University of Massachusetts,
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
July 9, 2003ACL An Improved Pattern Model for Automatic IE Pattern Acquisition Kiyoshi Sudo Satoshi Sekine Ralph Grishman New York University.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
4/14/20051 ACE Annotation Ralph Grishman New York University.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
Automatic Acquisition of Lexical Classes and Extraction Patterns for Information Extraction Kiyoshi Sudo Ph.D. Research Proposal New York University Committee:
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Proceedings of the 11 th National Conference on Artificial Intelligence,
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.
Ontology-based Information Extraction for Business Intelligence
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Intelius-NYU Cold Start System Ang Sun, Xin Wang, Sen Xu, Yigit Kiran, Shakthi Poornima, Andrew Borthwick (Intelius Inc.) Ralph Grishman (New York University)
Survey of Semantic Annotation Platforms
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA.
Information Extraction From Medical Records by Alexander Barsky.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Ling 570 Day 17: Named Entity Recognition Chunking.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
On the Issue of Combining Anaphoricity Determination and Antecedent Identification in Anaphora Resolution Ryu Iida, Kentaro Inui, Yuji Matsumoto Nara Institute.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
Ngoc Minh Le - ePi Technology Bich Ngoc Do – ePi Technology
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Measuring the Influence of Errors Induced by the Presence of Dialogs in Reference Clustering of Narrative Text Alaukik Aggarwal, Department of Computer.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
Using Semantic Relations to Improve Information Retrieval
MSM 2013 Challenge: Annotowatch Stefan Dlugolinsky, Peter Krammer, Marek Ciglan, Michal Laclavik Institute of Informatics, Slovak Academy of Sciences.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Supervised Machine Learning
Introduction Task: extracting relational facts from text
Automatic Extraction of Hierarchical Relations from Text
SVM Based Learning System for F-term Patent Classification
Using Uneven Margins SVM and Perceptron for IE
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK

Introduction Challenges posed by progression from traditional IE to a more semantic representation of NEs What techniques are best for the deeper level of analysis necessary? Can traditional rule-based methods cope with such a transition, or does the future lie solely with machine learning?

The ACE program “A program to develop technology to extract and characterise meaning from human language” Aims: produce structured information about entities, events and the relations that hold between them promote design of more generic systems rather than those tuned to a very specific domain and text type (as with MUC)

The ACE tasks Identification of entities and classification into semantic types (Person, Organisation, Location, GPE, Facility) Identification and coreference of all mentions of each entity in the text (name, pronominal, nominal) Identification of relations holding between such entities

<entity ID="ft-airlines-27-jul " GENERIC="FALSE" entity_type = "ORGANIZATION"> <entity_mention ID="M003" TYPE = "NAME" string = "National Air Traffic Services"> <entity_mention ID="M004" TYPE = "NAME" string = "NATS"> <entity_mention ID="M005" TYPE = "PRO" string = "its"> <entity_mention ID="M006" TYPE = "NAME" string = "Nats">

The MACE System Rule-based NE system developed within GATE, adapted from ANNIE PRs: tokeniser, sentence splitter, POS tagger, gazetteer, semantic tagger, orthomatcher, pronominal and nominal coreferencer Also: genre ID, switching controller to select different PRs automatically

Differences between ANNIE and MACE Locations  Location / GPE GPEs have roles (GPE, Per, Org, Loc) New type Facility (subsumes some Orgs) Metonymy means context is necessary for disambiguation (e.g. England cricket team vs England country) No Date, Time, Money, Percent, Address, Identifier

What does this mean in practical terms? Separation of specific from general information makes adaptation easier Reclassification of gazetteers unnecessary Changes mainly to semantic grammars to - use different gazetteer lookups -use more contextual information -group rules together differently

Semantic Grammars ANNIE uses 21 phases, 187 rules, 9 entity types (av rules per entity type) MACE uses 15 phases, 180 rules, 5 entity types (av. 36 rules per entity type) The important factor is the increased complexity of new rules, rather than the number Rules may be hand-crafted, but an experienced JAPE user can write several rules per minute 6 weeks for adaptation

Evaluation (1) TextPrecisionRecallFmeasure ACE MUC ENAMEX only

Evaluation (2) NEWS – 92 articles (business news) ACE – 86 broadcast news from September 2002 evaluation Difference on ACE task MACE on MUC-style annotations –GPEs are left as GPE (so count as errors) –GPEs are mapped to Locations

Comparison of ANNIE vs MACE 72% Precision, 84% Recall if GPEs mapped to Locations

Conclusions MACE is a rule-based NE system, in contrast with most systems which use ML. Advantages that doesn’t require much training data, and is fast to adapt because of its robust design If large amounts of training data are available, HMM-based systems tend to perform slightly better Rule-based systems tend to be good at recall but sometimes low on precision unless supported additionally by ML methods