Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

1() Information Extraction – why Google doesnt even come close Diana Maynard Natural Language Processing Group University of Sheffield, UK.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
University of Sheffield, NLP Case study: GATE in the NeOn project Diana Maynard University of Sheffield.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
University of Sheffield NLP Module 4: Machine Learning.
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
University of Sheffield NLP Module 10: Advanced GATE Applications © The University of Sheffield, This work is licenced under the Creative Commons.
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Jianwei Lu1 Information Extraction from Event Announcements Student: Jianwei Lu ( ) Supervisor: Robert Dale.
Towards an NLP `module’ The role of an utterance-level interface.
Alex Meng Chunshi Jin Elliott Conant Jonathan Fung.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
CALL: Computer-Assisted Language Learning. 2/14 Computer-Assisted (Language) Learning “Little” programs Purpose-built learning programs (courseware) Using.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Albert Gatt Corpora and Statistical Methods Lecture 9.
University of Sheffield NLP Module 9 Advanced GATE Applications.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Summary of our research project What is the bilingual project? The bilingual project in Madrid is called AICOLE and began in 2004 The project began with.
CLEF – Cross Language Evaluation Forum Question Answering at CLEF 2003 ( Bridging Languages for Question Answering: DIOGENE at CLEF-2003.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Survey of Semantic Annotation Platforms
Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA.
Information Extraction From Medical Records by Alexander Barsky.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Evaluation CSCI-GA.2590 – Lecture 6A Ralph Grishman NYU.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
A Language Independent Method for Question Classification COLING 2004.
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
MIT 6.893; SMA 5508 Spring 2004 Larry Rudolph Lecture Introduction Sketching Interface.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text Ross Israel Indiana University Joel Tetreault Educational Testing Service.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)
University of Sheffield NLP Sentiment Analysis (Opinion Mining) with Machine Learning in GATE.
Module 10: Advanced GATE Applications
Language Identification and Part-of-Speech Tagging
Information Extraction (IE)
CS246: Information Retrieval
Using Uneven Margins SVM and Perceptron for IE
Presentation transcript:

Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield, UK

On 4 March 2003, a bomb exploded in Davao City. The President of the Philippines classified this event as a terrorist attack. 24 hours later, Cebuano was announced as the language to be used in an experiment to create tools and resources for a surprise language. Within 4 days, we had developed a POS tagger for Cebuano, and within 7 days, we developed an NE system for Cebuano with 77.5% F measure. We did this with no native speaker and no training data.

Are we mad???? Quite possibly At least, most people thought we were mad to attempt this, and they’re probably right! Our results, however, are genuine. So, what is it all about, and how on earth did we do it?

The Surprise Language Exercise In the event of a national emergency, how quickly could the NLP community build tools for language processing to support the US government? Typical tools needed: IE, MT, summarisation, CLIR Main experiment in June 2003 gave sites a month to build such tools Dry run in March 2003 to explore feasibility of the exercise.

Dry Run Ran from 5-14 March as a test to: see how feasible such tasks would be see how quickly the community could collect language resources test working practices for communication and collaboration between sites

What on earth is Cebuano? Spoken by 24% of the Philippine population and the lingua franca of the S. Philippines (incl. Davao City) Classified by the LDC as a language of “medium difficulty”. Very few resources available (large scale dictionaries, parallel corpora, morphological analyser etc) But Latin script, standard orthography, words separated by white space, many Spanish influences and a lot of English proper nouns make it easier….

Named Entity Recognition For the dry run, we worked on resource collection and development for NE. Useful for many other tasks such as MT, so speed was very important. Test our claims about ANNIE being easy to adapt to new languages and tasks. Rule-based meant we didn’t need training data. But could we write rules without knowing any Cebuano?

Resources Collaborative effort between all participants, not just those doing IE Collection of general tools, monolingual texts, bilingual texts, lexical resources, and other info Resources mainly from web, but others scanned in from hard copy

Text Resources Monolingual Cebuano texts were mainly news articles (some archives, others downloaded daily) Bilingual texts were available, such as the Bible, but not very useful for NE recognition because of the domain. One news site had a mixture of English and Cebuano texts, which were useful for mining.

Lexical Resources Small list of surnames Some small bilingual dictionaries (some with POS info) List of Philippine cities (provided by Ontotext) But many of these were not available for several days

Other Resources Infeasible to expect to find Cebuano speakers with NLP skills and train them within a week But extensive and Internet search revealed several native speakers willing to help one local native speaker found - used for evaluation yahoogroups Cebuano discussion list found, leading to provision of new resources etc.

Adapting ANNIE for Cebuano Default IE system is for English, but some modules can be used directly Used tokeniser, splitter, POS tagger, gazetteer, NE grammar, orthomatcher (coreference) splitter and orthomatcher unmodified added tokenisation post-processing, new lexicon for POS tagger and new gazetteers Modified POS tagger implementation and NE grammars

Tokenisation Used default Unicode tokeniser Multi-word lexical items meant POS tags couldn’t be attached correctly added post-processing module to retokenise such words as single Tokens created gazetteer list of such words and a JAPE grammar to combine Token annotations modifications took approx. 1 person hour

POS tagger Used Hepple tagger but substituted Cebuano lexicon for English one Used empty ruleset since no training data available Used default heuristics (e.g. return NNP for capitalised words) Very experimental, but reasonable results

Evaluation of Tagger No formal evaluation was possible Estimate around 75% accuracy Created in 2 person days Results and a tagging service made available to other participants

Gazetteer Perhaps surprisingly, very little info on Web mined English texts about Philippines for names of cities, first names, organisations... used bilingual dictionaries to create “finite” lists such as days of week, months of year.. mined Cebuano texts for “clue words” by combination of bootstrapping, guessing and bilingual dictionaries kept English gazetteer because many English proper nouns and little ambiguity

NE grammars Most English JAPE rules based on POS tags and gazetteer lookup Grammars can be reused for languages with similar word order, orthography etc. No time to make detailed study of Cebuano, but very similar in structure to English Most of the rules left as for English, but some adjustments to handle especially dates

Evaluation (1) System annotated 10 news texts and output as colour-coded HTML. Evaluation on paper by native Cebuano speaker from University of Maryland. Evaluation not perfect due to lack of annotator training 85.1% Precision, 58.2% Recall, 71.65% Fmeasure Non-reusable 

Evaluation (2) 2 nd evaluation used 21 news texts, hand tagged on paper and converted to GATE annotations later System annotations compared with “gold standard” Reusable Also evaluated English NE system on these texts to get a baseline

Results CebuanoBaseline EntityPRFPRF Person Org Location Date Total

What did we learn? Even the most bizarre (and simple) ideas are worth trying Trying a variety of different approaches from the outset is fundamental Communication is vital (being nocturnal helps too if you’re in the UK) Good gazetteer lists can get you a long way Good mechanisms for evaluation need to be factored in

The future We learnt a lot about the capabilities of GATE and ANNIE from the experiment Further modifications to GATE to make it more language-agile Using other languages for annotation projection experiments (both to improve language agility and the English system)