Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

1() Information Extraction – why Google doesnt even come close Diana Maynard Natural Language Processing Group University of Sheffield, UK.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
University of Sheffield NLP Module 2: Introduction to IE and ANNIE.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.
University of Sheffield NLP Module 4: Machine Learning.
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
University of Sheffield NLP Module 10: Advanced GATE Applications © The University of Sheffield, This work is licenced under the Creative Commons.
Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Alex Meng Chunshi Jin Elliott Conant Jonathan Fung.
Named Entity Recognition LING 570 Fei Xia Week 10: 11/30/09.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Mining Gazetteer Data from Digital Library Collections David Smith Perseus Project Tufts University.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
University of Sheffield NLP Module 9 Advanced GATE Applications.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Survey of Semantic Annotation Platforms
Towards Constructing a Chinese Information Extraction System to Support Innovations in Library Services World Library and Information Congress: 72nd IFLA.
Aims of tonight. Give you a taster of some of the tasks your child will be facing in their SATS in the week starting Monday 12 th.May. Provide some suggestions.
Information Extraction From Medical Records by Alexander Barsky.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
1 Dr Alexiei Dingli Introduction to Web Science Harvesting the SW.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
MedKAT Medical Knowledge Analysis Tool December 2009.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Information Extraction Lecture 3 – Rule-based Named Entity Recognition Dr. Alexander Fraser, U. Munich September 3rd, 2014 ISSALE: University of Colombo.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text Ross Israel Indiana University Joel Tetreault Educational Testing Service.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
A method to restrict the blow-up of hypotheses... A method to restrict the blow-up of hypotheses of a non-disambiguated shallow machine translation system.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
Learning part-of-speech taggers with inter-annotator agreement loss EACL 2014 Barbara Plank, Dirk Hovy, Anders Søgaard University of Copenhagen Presentation:
Named entities recognition Jana Kravalová. Content 1. Task 2. Data 3. Machine learning 4. SVM 5. Evaluation and results.
University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)
University of Sheffield NLP Sentiment Analysis (Opinion Mining) with Machine Learning in GATE.
Module 10: Advanced GATE Applications
Information Extraction (IE)
Extracting Recipes from Chemical Academic Papers
CS246: Information Retrieval
Presentation transcript:

Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with no native speakers and no training data?

Named Entity Recognition Decided to work on resource collection and development for NE Test our claims about ANNIE being easy to adapt to new languages and tasks. Rule-based meant we didn’t need training data. But could we write rules without even knowing any Cebuano?

Adapting ANNIE for Cebuano Default IE system is for English, but some modules can be used directly Used tokeniser, splitter, POS tagger, gazetteer, NE grammar, orthomatcher splitter and orthomatcher unmodified added tokenisation post-processing, new lexicon for POS tagger and new gazetteers Modified POS tagger implementation and NE grammars

Gazetteer Perhaps surprisingly, very little info on Web mined English texts about Philippines for names of cities, first names, organisations... used bilingual dictionaries to create “finite” lists such as days of week, months of year.. mined Cebuano texts for “clue words” by combination of bootstrapping, guessing and bilingual dictionaries kept English gazetteer because many English proper nouns and little ambiguity

NE grammars Most English JAPE rules based on POS tags and gazetteer lookup Grammars can be reused for languages with similar word order, orthography etc. Most of the rules left as for English, with a few minor adjustments

Evaluation (1) System annotated 10 news texts and output as colour-coded HTML. Evaluation on paper by native Cebuano speaker from UMD. Evaluation not perfect due to lack of annotator training 85.1% Precision, 58.2% Recall, 71.65% Fmeasure Non-reusable 

Evaluation (2) 2 nd evaluation used 21 news texts, hand tagged on paper and converted to GATE annotations later System annotations compared with “gold standard” Reusable

Results EntityPRF Person Org Location Date Total767978

What did we learn? Even the most bizarre (and simple) ideas are worth trying Trying a variety of different approaches from the outset is fundamental Communication is vital (being nocturnal helps too if you’re in the UK) Good mechanisms for evaluation need to be factored in