Download presentation
Presentation is loading. Please wait.
Published byScott Conley Modified over 9 years ago
1
20 th of May 2004 Beatrice Alex School of Informatics The University of Edinburgh Mixed-Lingual Entity Recognition
2
Named Entity Recognition What is a named entity (NE)? A string that refers to a particular kind of object in the world, e.g. “John Lennon” = NE of type person “T-Mobile” = NE of type organisation “Edinburgh” = NE of type location How are they recognised? Use of internal and external context
3
NER Methods Rule-based hand-written patterns rely on punctuation, capitalisation and other features in the text Statistical-based data-driven approaches exploit the statistical properties of real language to learn models Hybrid Methods
4
PhD Proposal Supervisors: Claire Grover, Stephen Clark Proposed research topic: mixed-lingual NER, i.e. the detection and classification of NEs in a different language from the base language of the text Examples: „Das Central Command erklärte, dasSchicksal des Piloten sei noch ungeklärt.“ “Germany's Die Welt reports that four people died in the heat wave last week.”
5
Background and Motivation Multi-lingual and language-independent NER - active research area in NLP circles (MET-1/2, CoNLL02/03) Many errors in German NER due to amount of foreign language material in German articles (Rössler, 2002) Mixed-lingual NER - unspecified or beyond capabilities of existing approaches
6
Beneficiaries Performance improvements of applications where NER is standardly applied (IE, QA, text summarisation, topic identification) Valuable information to polyglot TTS synthesis Pre-processing tool for MT systems English phrase:Germany's Die Welt reports... French MT output German MT output: La trépointe de la matrice de l'Allemagne signale The welt of the die of Germany indicates Welt Deutschlands Würfel berichtet World Germany's die reports Die Welt de l'Allemagne signale Die Welt of Germany indicates Die Welt Deutschlands berichtet Die Welt Germany‘s indicates
7
Denglish English: dominant language of science & technology, air-traffic control, advertising Increasing influence on German The live event was really cool. There were tickets, fast food, drinks in the basement.
8
Preliminary Research Preliminary Research Analysis of English inclusions in German newspaper articles on different domains: (1) Internet & Telecoms, (2) EU and (3) space travel Corpus: 16,000 tokens per domain from German newspaper (FAZ) Automatic classification of English tokens (NN and FM) by means of a simple lookup procedure More than 90% of all English inclusions are nouns (Yang, 1999; Yeandle, 2001; Corr, 2003)
9
1. Lookup Procedure CELEX lookup (NN|FM) in German and English databases only in German database> DE only in English database> EN in both databases: –Computer, Trend, Monster –Generation, Union, Mission –Art, Tag, Rat, Fall, All in neither database > 2. lookup procedure
10
2. Lookup Procedure Google lookup with language preference German compounds: Mausklick (mouse click) English unhyphenated compounds: Homepage Mixed-lingual unhyphenated compounds: Shuttleflug (shuttle flight) English nouns with German inflections: Receivern Abbreviations and acronyms: GPS, UKW Words with spelling mistakes: Abruch (abortion) English words with American spelling: Center Classification based on number of hits
11
Results Output: Das Central Command erklärte, das Schicksal des Piloten sei noch ungeklärt. EN: Central Command explained, the fate of the pilot is still unclear. MT: Central Command explained, the fate of the pilot was still unsettled. Domains Internet & Telecoms European Union Space Travel TokensTypesTokensTypesTokensTypes Total159194386160284200160664126 Looked up 378017373371161636801524 Classed as EN 632181944634072 Lexicon15073151311029 Lexicon + Google 482108793323043
12
English Inclusions Internet & Telecoms European Union Space Travel TokenFRWebTokenFRWebTokenFRWeb Internet106DCEI10Shuttle27 Online64Cluster3Crew19 UMTS24Spreads1US14 Handy13Scores1Shuttles7 PC12Portfolio1Space2 Software10Newcomer1Cockpit2 Pixel9talk1Shop2 Megabyte6road1Spacehab1 Recorder5map1Internet1 Center5Small1Heliumtanks1
13
Error Analysis Sources of Error: Wrong POS tags Mixed-lingual unhyphenated compounds New internationalisms Abbreviations with several expansions Unreliable Google hits Inclusions from other languages Need for better handling of NEs Morpheme level analysis for compounds Extension to other POS tag
14
Future Work Collection of more data and annotation for training and evaluation Development of sequence modelling classifier, e.g. maximum entropy Implementation of other languages Application-based evaluation (e.g. MT)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.