Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC.

Slides:



Advertisements
Similar presentations
An Introduction to GATE
Advertisements

MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge Technologies.
A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information.
University of Sheffield NLP Module 11: Advanced Machine Learning.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Corpus Processing and NLP
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden
Opinion Mapping Travelblogs Efthymios Drymonas Alexandros Efentakis Dieter Pfoser Research Center Athena Institute for the Management of Information Systems.
CS4025: Advanced Information Extraction. Overview CS4025, Department of Computing Science, University of Aberdeen 2 Overview of aspects of IE and General.
How do we work in a virtual multilingual classroom? A virtual multilingual classroom with Moodle and Apertium Cultural and Linguistic Practices in the.
Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.
The MULTEXT-East multilingual language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
The FIDA & MULTEXT-East language resources Tomaž Erjavec Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Leuven, Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
6th Intex Workshop, Sofia May th Intex Workshop & 10 years of (Silberztein, 1993) Sofia, May 2003.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Survey of Semantic Annotation Platforms
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Information Extraction From Medical Records by Alexander Barsky.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Edinburg March 2001CROSSMARC Kick-off meetingICDC ICDC background and know-how and expectations from CROSSMARC CROSSMARC Project IST Kick-off.
Virach Sornlertlamvanich Information R&D Division (iTech) National Electronics and Computer Technology Center (NECTEC) THAILAND 19 January 2001 Symposium.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Introduction to Human Language Technologies Tomaž Erjavec Karl-Franzens-Universität Graz Tomaž Erjavec Lecture 1: Overview
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.
Tomaž Erjavec Dept. of Knowledge Technologies Jožef Stefan Institute Ljubljana, Slovenia Polishing BootCat corpora: XML validation and tagset unification.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
MedKAT Medical Knowledge Analysis Tool December 2009.
Supertagging CMSC Natural Language Processing January 31, 2006.
Toward an Open Source Textual Entailment Platform (Excitement Project) Bernardo Magnini (on behalf of the Excitement consortium) 1 STS workshop, NYC March.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
GATE and the Semantic Web
Institute of Informatics & Telecommunications
Natural Language Processing (NLP)
MULTEXT-East Version 4: multilingual morphosyntactic specifications for lots of languages Tomaž Erjavec Department of Knowledge.
Topics in Linguistics ENG 331
Computer Aided Document Indexing System for Accessing Legislation A Joint Venture of Flanders and Croatia Bojana Dalbelo Bašić Faculty of Electrical Engineering.
Natural Language Processing (NLP)
SANSKRIT ANALYZING SYSTEM
Natural Language Processing (NLP)
Presentation transcript:

totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC Workshop, September 2005

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Overview of the talk 1. Introduction 2. The totale pipeline 3. Training totale 4. Annotating JRC-ACQUIS-sl 5. Conclusions

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Introduction Hypothesis: to efficiently exploit the JRC-ACQUIS its texts need to be linguistically pre-processed Hypothesis: to efficiently exploit the JRC-ACQUIS its texts need to be linguistically pre-processed This normalizes (reduces) the data and gives other tools more features to work with This normalizes (reduces) the data and gives other tools more features to work with

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Example TOKEN TYPE LEMMA MSD TOK_ENUM 2. Rmp (a) TOK_ENUM (a) Rmp Where TOK where Cs an TOK a Di exporter TOK exporter Ncns has TOK have Vaip3s declared TOK declare Vmps goods TOK good Ncnp packaged TOK package Vmis using TOK use Vmpp automatic TOK automatic Afp systems TOK system Ncnp for TOK for Sp bagging TOK bag Vmpp, PUN canning TOK can Vmpp, PUN bottling TOK bottle Vmpp, PUN etc. TOK_ABBR etc. Rmp MSD and LEMMA are context dependent MSD and LEMMA are context dependent MSD useful for any syntactically oriented further processing (PoS filtering) MSD useful for any syntactically oriented further processing (PoS filtering) LEMMA useful for reducing the lexical space (easier searches) LEMMA useful for reducing the lexical space (easier searches) Task is much harder for inflectionally rich (or agglutinative) languages than for English or most ‘old’ EU! Task is much harder for inflectionally rich (or agglutinative) languages than for English or most ‘old’ EU! 2. (a) Where an exporter has declared goods packaged using automatic systems for bagging, canning, bottling, etc.,

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Nagging doubts Normalization loses information Normalization loses information Annotation introduces errors and bias Annotation introduces errors and bias Evaluation for IE non-conclusive Evaluation for IE non-conclusive Unsupervised methods! Unsupervised methods!Still…

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Wanted A tool that would take text in any language and tokenise, tokenise, PoS tag and PoS tag and lemmatise it. lemmatise it. Should be simple to install and use, robust, fast, and adaptable to new languages, preferably with a large number of already available models (and work under Linux!)

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation What is out there Component software: tokenisers, taggers, (stemmers) Component software: tokenisers, taggers, (stemmers) FS/RE environments: INTEX, CLARK FS/RE environments: INTEX, CLARK Various LT workbenches, most famous GATE Various LT workbenches, most famous GATE Alas: Java, time investment, history Alas: Java, time investment, history

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Linguistic annotation with totale Multilingual tokenisation, tagging and lemmatisation Multilingual tokenisation, tagging and lemmatisation Perl program with a simple pipeline architecture Perl program with a simple pipeline architecture Input is plain UTF-8 text Input is plain UTF-8 text Output is a list of annotated tokens Output is a list of annotated tokens Several output formats (tabular, XML) Several output formats (tabular, XML)

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Example use $ totale -l en Doctor, can you help? ^D<TEXT> DoctorTOKdoctorNcfs,PUN canTOKcanVoip youTOKyouPp2 helpTOKhelpVmn ?PUN_TERM

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Multilingualresources Totale building blocks mlToken TnT CLOG Multilingualresources Multilingualresources Perl

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Tokenisation in totale Perl module mlToken.pm (Camelia Ignat, JRC) Perl module mlToken.pm (Camelia Ignat, JRC) Multilingual, with resource files for supported languages (also default rules) Multilingual, with resource files for supported languages (also default rules) Splits text into tokens, marks token type Splits text into tokens, marks token type Marks paragraph and sentence boundaries Marks paragraph and sentence boundaries Modelled on mtSeg Modelled on mtSeg

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Tagging in totale Annotating words in the text with their context disambiguated morphosyntactic annotations (MSDs) Annotating words in the text with their context disambiguated morphosyntactic annotations (MSDs) Used the tri-gram tagger TnT Used the tri-gram tagger TnT Trainable, fast, unknown-word guessing module, able to accommodate the large morphosyntactic tagsets of various EU languages Trainable, fast, unknown-word guessing module, able to accommodate the large morphosyntactic tagsets of various EU languages Uses (and induces from annotated corpus) a lexicon with ambiguity classes and tri-gram file Uses (and induces from annotated corpus) a lexicon with ambiguity classes and tri-gram file

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Lemmatisation in totale Used CLOG, which learns first-order decision lists (+ list of exceptions) Used CLOG, which learns first-order decision lists (+ list of exceptions) Learns lemmatisation rules for each MSD Learns lemmatisation rules for each MSD CLOG produces Prolog programs, but these converted into Perl CLOG produces Prolog programs, but these converted into Perl Tomaž Erjavec and Sašo Džeroski: Machine Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Applied Artificial Intelligence 18(1), pp , 2004.

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Example CLOG rule sub SUB_afcfda { my $w = $_[0]; my $lem; my $w = $_[0]; my $lem; if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"} if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"} elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"} elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"} elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"} elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"} elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"} elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"} elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"} elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"} elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"} elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"} elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"} elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"} elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"} elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"} elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"} elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"} elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"} elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"} elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"} else {$lem="???"} else {$lem="???"} return $lem; return $lem;}

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Training totale with MULTEXT-East resources Learning totale tagging and lemmatisation models Learning totale tagging and lemmatisation models MULTEXT-East language resources V3, a standardised multilingual dataset for language engineering R&D MULTEXT-East language resources V3, a standardised multilingual dataset for language engineering R&D Covers mainly Central and Eastern European languages Covers mainly Central and Eastern European languages Freely available for research use from Freely available for research use from Used MSD tagged “1984” corpus (100kW) for tagger training Used MSD tagged “1984” corpus (100kW) for tagger training Used MSD lexica (15k lemmas) for lemmatiser training Used MSD lexica (15k lemmas) for lemmatiser training

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Currently supported languages English English Slovene Slovene Czech Czech Romanian Romanian Serbian Serbian Estonian Estonian Hungarian Hungarian

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Processing JRC’s ACQUIS- sl with totale sl.tar.gz 03-Sep : M sl/slcelex_*.xml = 144M, 7772 files sl.tar.gz 03-Sep : M sl/slcelex_*.xml = 144M, 7772 files Wrapper perl program: for each file Wrapper perl program: for each file 1. extract text (all s except first) 2. | totale -l sl -f XML | 3. substitute contents of original s with annotated ones 4. validate against DTD 72 hrs on asterix but 10s startup time = 77720s = 21hrs 72 hrs on asterix but 10s startup time = 77720s = 21hrs

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation The problem of titles Dual role of titles: as text and name of document Dual role of titles: as text and name of document Should they contain P at all? Should they contain P at all? Many titles untranslated – experiment with TextCat: 4,964 sl 1,663 en “Ni na razpolago v slovenskem jeziku” 1,074 en 59 sl or en 12 en or sl Many titles untranslated – experiment with TextCat: 4,964 sl 1,663 en “Ni na razpolago v slovenskem jeziku” 1,074 en 59 sl or en 12 en or sl Also cases like “ODLOCBA t. 1346/2001/ES …” Also cases like “ODLOCBA t. 1346/2001/ES …” So, did not process them.. So, did not process them..

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Quantitative results: elements / / 7,771 7,771 <signature>7,683 <annex>3,658 <P>1,063,577 <c>2,865,307 <w>15,934,003 #IMPLIED2,452,541TERM412,766 #IMPLIED14,393,953DIG1,036,076 ENUM331,426 ABBR159,022 MW11,048 TAG2,234 URL

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Lexical analysis Extracted the MULTEXT lexicon from corpus: … 8 rafinacija rafinacija Ncfsn 2 rafinacije rafinacija Ncfpa 40 rafinacije rafinacija Ncfsg 2 rafinacije15 rafinacije15 Mc---d 26 rafinaciji rafinacij Npmpn 9 rafinaciji rafinacija Ncfsl 17 rafinacijo rafinacija Ncfsa … Number of lexical entries: 381,068 Number of lexical entries: 381,068 Different word-forms: 221,876 Different word-forms: 221,876 Different lemmas: 154,241 Different lemmas: 154,241 Different MSDs: 970 Different MSDs: 970

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Some problems Complex tokenisation – over 15% “weird” words: Complex tokenisation – over 15% “weird” words: priloge.opomba priloge.opomba Ncfsn who/fsf/fos/97.7 who/fsf/fos/97.7 Rgp zavarovalnica(-e) zavarovalnica(-e) Ncmsi Weak tagging model (likes verbs!): Weak tagging model (likes verbs!): 3 anion anion Ncmsa--n 4 anion anion Ncmsn 1 anion anion Npmsn 3 anion anion Vmp--smp 6 aniona anion Ncmsg 8 anione anion Ncmpa 1 anioni anioen Afpmsny 1 anioni anion Ncmpn 1 anioni anioni Vmp--pmp 1 anioni anioniti Vmip3s--n

JRC Workshop, September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Conclusions Presented processing with totale on ACQUIS-sl and a quick evaluation Presented processing with totale on ACQUIS-sl and a quick evaluation Further work: Further work: –methodology of semi-manual annotation (model tweaking) –“lexical priming” in totale Translations and collocates Translations and collocates