20 th of May 2004 Beatrice Alex School of Informatics The University of Edinburgh Mixed-Lingual Entity Recognition.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information.
Improving Machine Translation Quality with Automatic Named Entity Recognition Bogdan Babych Centre for Translation Studies University of Leeds, UK Department.
SLTU 2014 – 4th Workshop on Spoken Language Technologies for Under-resourced Languages St. Petersburg, Russia KIT – University of the State.
Corpus Processing and NLP
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
Towards an NLP `module’ The role of an utterance-level interface.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Document Centered Approach to Text Normalization Andrei Mikheev LTG University of Edinburgh SIGIR 2000.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
Multilinguality to the Rescue Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
ML-based approaches to Named Entity Recognition for German newspaper texts ESSLLI 02 – Workshop on ML Aproaches for CL Marc Rössler University of Duisburg.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Ling 570 Day 17: Named Entity Recognition Chunking.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
金聲玉振 Taiwan Univ. & Academia Sinica 1 Spoken Dialogue in Information Retrieval Jia-lin Shen Oct. 22, 1998.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
School of something FACULTY OF OTHER School of Languages, Cultures and Societies – Faculty of Arts School of Computing – Faculty of Engineering Multilingual.
A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Measuring Monolinguality
Social Knowledge Mining
The CoNLL-2014 Shared Task on Grammatical Error Correction
Text Mining & Natural Language Processing
Open Source SUMMA Platform
Presentation transcript:

20 th of May 2004 Beatrice Alex School of Informatics The University of Edinburgh Mixed-Lingual Entity Recognition

Named Entity Recognition What is a named entity (NE)? A string that refers to a particular kind of object in the world, e.g. “John Lennon” = NE of type person “T-Mobile” = NE of type organisation “Edinburgh” = NE of type location How are they recognised? Use of internal and external context

NER Methods Rule-based hand-written patterns rely on punctuation, capitalisation and other features in the text Statistical-based data-driven approaches exploit the statistical properties of real language to learn models Hybrid Methods

PhD Proposal Supervisors: Claire Grover, Stephen Clark Proposed research topic: mixed-lingual NER, i.e. the detection and classification of NEs in a different language from the base language of the text Examples: „Das Central Command erklärte, dasSchicksal des Piloten sei noch ungeklärt.“ “Germany's Die Welt reports that four people died in the heat wave last week.”

Background and Motivation Multi-lingual and language-independent NER - active research area in NLP circles (MET-1/2, CoNLL02/03) Many errors in German NER due to amount of foreign language material in German articles (Rössler, 2002) Mixed-lingual NER - unspecified or beyond capabilities of existing approaches

Beneficiaries Performance improvements of applications where NER is standardly applied (IE, QA, text summarisation, topic identification) Valuable information to polyglot TTS synthesis Pre-processing tool for MT systems English phrase:Germany's Die Welt reports... French MT output German MT output: La trépointe de la matrice de l'Allemagne signale The welt of the die of Germany indicates Welt Deutschlands Würfel berichtet World Germany's die reports Die Welt de l'Allemagne signale Die Welt of Germany indicates Die Welt Deutschlands berichtet Die Welt Germany‘s indicates

Denglish English: dominant language of science & technology, air-traffic control, advertising Increasing influence on German The live event was really cool. There were tickets, fast food, drinks in the basement.

Preliminary Research Preliminary Research Analysis of English inclusions in German newspaper articles on different domains: (1) Internet & Telecoms, (2) EU and (3) space travel Corpus: 16,000 tokens per domain from German newspaper (FAZ) Automatic classification of English tokens (NN and FM) by means of a simple lookup procedure More than 90% of all English inclusions are nouns (Yang, 1999; Yeandle, 2001; Corr, 2003)

1. Lookup Procedure CELEX lookup (NN|FM) in German and English databases only in German database> DE only in English database> EN in both databases: –Computer, Trend, Monster –Generation, Union, Mission –Art, Tag, Rat, Fall, All in neither database > 2. lookup procedure

2. Lookup Procedure Google lookup with language preference German compounds: Mausklick (mouse click) English unhyphenated compounds: Homepage Mixed-lingual unhyphenated compounds: Shuttleflug (shuttle flight) English nouns with German inflections: Receivern Abbreviations and acronyms: GPS, UKW Words with spelling mistakes: Abruch (abortion) English words with American spelling: Center Classification based on number of hits

Results Output: Das Central Command erklärte, das Schicksal des Piloten sei noch ungeklärt. EN: Central Command explained, the fate of the pilot is still unclear. MT: Central Command explained, the fate of the pilot was still unsettled. Domains Internet & Telecoms European Union Space Travel TokensTypesTokensTypesTokensTypes Total Looked up Classed as EN Lexicon Lexicon + Google

English Inclusions Internet & Telecoms European Union Space Travel TokenFRWebTokenFRWebTokenFRWeb Internet106DCEI10Shuttle27 Online64Cluster3Crew19 UMTS24Spreads1US14 Handy13Scores1Shuttles7 PC12Portfolio1Space2 Software10Newcomer1Cockpit2 Pixel9talk1Shop2 Megabyte6road1Spacehab1 Recorder5map1Internet1 Center5Small1Heliumtanks1

Error Analysis Sources of Error: Wrong POS tags Mixed-lingual unhyphenated compounds New internationalisms Abbreviations with several expansions Unreliable Google hits Inclusions from other languages Need for better handling of NEs Morpheme level analysis for compounds Extension to other POS tag

Future Work Collection of more data and annotation for training and evaluation Development of sequence modelling classifier, e.g. maximum entropy Implementation of other languages Application-based evaluation (e.g. MT)