Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC.

totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC Workshop, 26-27 September 2005

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Overview of the talk 1. Introduction 2. The totale pipeline 3. Training totale 4. Annotating JRC-ACQUIS-sl 5. Conclusions

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Introduction Hypothesis: to efficiently exploit the JRC-ACQUIS its texts need to be linguistically pre-processed Hypothesis: to efficiently exploit the JRC-ACQUIS its texts need to be linguistically pre-processed This normalizes (reduces) the data and gives other tools more features to work with This normalizes (reduces) the data and gives other tools more features to work with

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Example TOKEN TYPE LEMMA MSD -------------------------------------- 2. TOK_ENUM 2. Rmp (a) TOK_ENUM (a) Rmp Where TOK where Cs an TOK a Di exporter TOK exporter Ncns has TOK have Vaip3s declared TOK declare Vmps goods TOK good Ncnp packaged TOK package Vmis using TOK use Vmpp automatic TOK automatic Afp systems TOK system Ncnp for TOK for Sp bagging TOK bag Vmpp, PUN canning TOK can Vmpp, PUN bottling TOK bottle Vmpp, PUN etc. TOK_ABBR etc. Rmp MSD and LEMMA are context dependent MSD and LEMMA are context dependent MSD useful for any syntactically oriented further processing (PoS filtering) MSD useful for any syntactically oriented further processing (PoS filtering) LEMMA useful for reducing the lexical space (easier searches) LEMMA useful for reducing the lexical space (easier searches) Task is much harder for inflectionally rich (or agglutinative) languages than for English or most ‘old’ EU! Task is much harder for inflectionally rich (or agglutinative) languages than for English or most ‘old’ EU! 2. (a) Where an exporter has declared goods packaged using automatic systems for bagging, canning, bottling, etc.,

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Nagging doubts Normalization loses information Normalization loses information Annotation introduces errors and bias Annotation introduces errors and bias Evaluation for IE non-conclusive Evaluation for IE non-conclusive Unsupervised methods! Unsupervised methods!Still…

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Wanted A tool that would take text in any language and tokenise, tokenise, PoS tag and PoS tag and lemmatise it. lemmatise it. Should be simple to install and use, robust, fast, and adaptable to new languages, preferably with a large number of already available models (and work under Linux!)

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation What is out there Component software: tokenisers, taggers, (stemmers) Component software: tokenisers, taggers, (stemmers) FS/RE environments: INTEX, CLARK FS/RE environments: INTEX, CLARK Various LT workbenches, most famous GATE Various LT workbenches, most famous GATE Alas: Java, time investment, history Alas: Java, time investment, history

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Linguistic annotation with totale Multilingual tokenisation, tagging and lemmatisation Multilingual tokenisation, tagging and lemmatisation Perl program with a simple pipeline architecture Perl program with a simple pipeline architecture Input is plain UTF-8 text Input is plain UTF-8 text Output is a list of annotated tokens Output is a list of annotated tokens Several output formats (tabular, XML) Several output formats (tabular, XML)

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Example use $ totale -l en Doctor, can you help? ^D<TEXT> DoctorTOKdoctorNcfs,PUN canTOKcanVoip youTOKyouPp2 helpTOKhelpVmn ?PUN_TERM

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Multilingualresources Totale building blocks mlToken TnT CLOG Multilingualresources Multilingualresources Perl

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Tokenisation in totale Perl module mlToken.pm (Camelia Ignat, JRC) Perl module mlToken.pm (Camelia Ignat, JRC) Multilingual, with resource files for supported languages (also default rules) Multilingual, with resource files for supported languages (also default rules) Splits text into tokens, marks token type Splits text into tokens, marks token type Marks paragraph and sentence boundaries Marks paragraph and sentence boundaries Modelled on mtSeg Modelled on mtSeg

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Tagging in totale Annotating words in the text with their context disambiguated morphosyntactic annotations (MSDs) Annotating words in the text with their context disambiguated morphosyntactic annotations (MSDs) Used the tri-gram tagger TnT Used the tri-gram tagger TnT Trainable, fast, unknown-word guessing module, able to accommodate the large morphosyntactic tagsets of various EU languages Trainable, fast, unknown-word guessing module, able to accommodate the large morphosyntactic tagsets of various EU languages Uses (and induces from annotated corpus) a lexicon with ambiguity classes and tri-gram file Uses (and induces from annotated corpus) a lexicon with ambiguity classes and tri-gram file

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Lemmatisation in totale Used CLOG, which learns first-order decision lists (+ list of exceptions) Used CLOG, which learns first-order decision lists (+ list of exceptions) Learns lemmatisation rules for each MSD Learns lemmatisation rules for each MSD CLOG produces Prolog programs, but these converted into Perl CLOG produces Prolog programs, but these converted into Perl Tomaž Erjavec and Sašo Džeroski: Machine Learning of Morphosyntactic Structure: Lemmatising Unknown Slovene Words. Applied Artificial Intelligence 18(1), pp. 17-40, 2004.

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Example CLOG rule sub SUB_afcfda { my $w = $_[0]; my $lem; my $w = $_[0]; my $lem; if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"} if ($w=~/^(.*)svetlej#353i$/){$lem=$1."svetel"} elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"} elsif ($w=~/^(.*)polnej#353i$/){$lem=$1."poln"} elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"} elsif ($w=~/^(.*)b#353i$/) {$lem=$1."b"} elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"} elsif ($w=~/^(.*)elej#353i$/) {$lem=$1."el"} elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"} elsif ($w=~/^(.*)ivej#353i$/) {$lem=$1."iv"} elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"} elsif ($w=~/^(.*)anej#353i$/) {$lem=$1."an"} elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"} elsif ($w=~/^(.*)kej#353i$/) {$lem=$1."ek"} elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"} elsif ($w=~/^(.*)tej#353i$/) {$lem=$1."t"} elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"} elsif ($w=~/^(.*)i#382ji$/) {$lem=$1."izek"} elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)enej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"} elsif ($w=~/^(.*)rej#353i$/) {$lem=$1."er"} elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"} elsif ($w=~/^(.*)nej#353i$/) {$lem=$1."en"} else {$lem="???"} else {$lem="???"} return $lem; return $lem;}

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Training totale with MULTEXT-East resources Learning totale tagging and lemmatisation models Learning totale tagging and lemmatisation models MULTEXT-East language resources V3, a standardised multilingual dataset for language engineering R&D MULTEXT-East language resources V3, a standardised multilingual dataset for language engineering R&D Covers mainly Central and Eastern European languages Covers mainly Central and Eastern European languages Freely available for research use from http://nl.ijs.si/ME/V3/ Freely available for research use from http://nl.ijs.si/ME/V3/ http://nl.ijs.si/ME/V3/ Used MSD tagged “1984” corpus (100kW) for tagger training Used MSD tagged “1984” corpus (100kW) for tagger training Used MSD lexica (15k lemmas) for lemmatiser training Used MSD lexica (15k lemmas) for lemmatiser training

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Currently supported languages English English Slovene Slovene Czech Czech Romanian Romanian Serbian Serbian Estonian Estonian Hungarian Hungarian

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Processing JRC’s ACQUIS- sl with totale sl.tar.gz 03-Sep-2005 03:51 34.4M sl/slcelex_*.xml = 144M, 7772 files sl.tar.gz 03-Sep-2005 03:51 34.4M sl/slcelex_*.xml = 144M, 7772 files Wrapper perl program: for each file Wrapper perl program: for each file 1. extract text (all s except first) 2. | totale -l sl -f XML | 3. substitute contents of original s with annotated ones 4. validate against DTD 72 hrs on asterix but 10s startup time = 77720s = 21hrs 72 hrs on asterix but 10s startup time = 77720s = 21hrs

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation The problem of titles Dual role of titles: as text and name of document Dual role of titles: as text and name of document Should they contain P at all? Should they contain P at all? Many titles untranslated – experiment with TextCat: 4,964 sl 1,663 en “Ni na razpolago v slovenskem jeziku” 1,074 en 59 sl or en 12 en or sl Many titles untranslated – experiment with TextCat: 4,964 sl 1,663 en “Ni na razpolago v slovenskem jeziku” 1,074 en 59 sl or en 12 en or sl Also cases like “ODLOCBA t. 1346/2001/ES …” Also cases like “ODLOCBA t. 1346/2001/ES …” So, did not process them.. So, did not process them..

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Quantitative results: elements / / 7,771 7,771 <signature>7,683 <annex>3,658 <P>1,063,577 <c>2,865,307 <w>15,934,003 #IMPLIED2,452,541TERM412,766 #IMPLIED14,393,953DIG1,036,076 ENUM331,426 ABBR159,022 MW11,048 TAG2,234 URL108 EMAIL47

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Lexical analysis Extracted the MULTEXT lexicon from corpus: … 8 rafinacija rafinacija Ncfsn 2 rafinacije rafinacija Ncfpa 40 rafinacije rafinacija Ncfsg 2 rafinacije15 rafinacije15 Mc---d 26 rafinaciji rafinacij Npmpn 9 rafinaciji rafinacija Ncfsl 17 rafinacijo rafinacija Ncfsa … Number of lexical entries: 381,068 Number of lexical entries: 381,068 Different word-forms: 221,876 Different word-forms: 221,876 Different lemmas: 154,241 Different lemmas: 154,241 Different MSDs: 970 Different MSDs: 970

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Some problems Complex tokenisation – over 15% “weird” words: Complex tokenisation – over 15% “weird” words: priloge.opomba priloge.opomba Ncfsn who/fsf/fos/97.7 who/fsf/fos/97.7 Rgp zavarovalnica(-e) zavarovalnica(-e) Ncmsi Weak tagging model (likes verbs!): Weak tagging model (likes verbs!): 3 anion anion Ncmsa--n 4 anion anion Ncmsn 1 anion anion Npmsn 3 anion anion Vmp--smp 6 aniona anion Ncmsg 8 anione anion Ncmpa 1 anioni anioen Afpmsny 1 anioni anion Ncmpn 1 anioni anioni Vmp--pmp 1 anioni anioniti Vmip3s--n

JRC Workshop, 26-27 September 2005 Tomaž Erjavec: Multilingual Tokenisation, Tagging & Lemmatisation Conclusions Presented processing with totale on ACQUIS-sl and a quick evaluation Presented processing with totale on ACQUIS-sl and a quick evaluation Further work: Further work: –methodology of semi-manual annotation (model tweaking) –“lexical priming” in totale Translations and collocates Translations and collocates

Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC.

Similar presentations

Presentation on theme: "Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC.

Similar presentations

Presentation on theme: "Totale Multilingual Tokenisation, Tagging and Lemmatisation Tomaž Erjavec Dept. of Knowledge Technologies, Jožef Stefan Institute Ljubljana, Slovenia JRC."— Presentation transcript:

Similar presentations

About project

Feedback