Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation University of Ljubljana
Overview Term identification in a monolingual context: Some known approaches Slovene-English setup: Corpora, tools, resources Multi-word terms: Collocations, nested terms and term variants Bilingual lexicon extraction and term equivalence Extracting semantic information Evaluation & things to improve
Bilingual term extraction: Usual processing sequence L1 L2 parallel corpus term candidates L1 term candidates L2 bilingual term lexicon finding translation equivalents
Corpus Slovene-English parallel corpus of terminological texts (TRANS), ca. 1 million tokens; created within a student project at our Department; aligned with DejaVu (hand- validated); on-line concordancing at here: 2 subcorpora Nuclear Engineering25,000 tokens Economic legislation166,000 tokens
Linguistic processing Slovene – tokenization – part-of-speech tagging TnT (Brants 2000) training corpus creation tagger training & error correction – lemmatization Amebis (thanks!) proprietory lemmatization tool; non-disambiguated: je biti, jesti, on lemma disambiguation though self-made rules
Linguistic processing II English – using DFKI tools (thanks!) – POS-tagging (TnT) – lemmatization (MMorph) – chunking (Chunkie)
What is a term: “keywordness” Measures of keywordness: subcorpus vs. general language corpus relative corpus frequency document vs. document collection tf.idf Applied to single or multi-word units. N df i weight(i, j) = (1 + log(tf i,j )) log —
Other indicators of termness Acronyms (NPP, SG, RBB...) Unknown words – not found in the reference corpus – unknown to the lemmatizer Cognates & Named entities JE KrškoKrško NPP KonzorcijConsortium Siemens/FramatomeSiemens/Framatome
Identifying multi-word units Collocation extraction techniques – Mutual Information (Church & Hanks 1990) – Log-likelihood ratio (Dunning 1993) – Entropy-based (Shimohata et al. 1997) – Semantic non-compositionality (Pearce 2001) According to Daille (1994), LL is the most appropriate measure for n > 3: n-gram frequency (+ stopword filtering) also works
N-gram term weighting statistically extracted n-grams are not necessarily terms need for filtering / weighting 1. Stopword filtering 2. Weighting with tf.idf, ll-rank/core frequency weight(t w1, w2, w3 ) = tf.idf w1 tf.idf w2 tf.idf w3 /n * 1/rank
Treatment of nested terms Local Max of bigram LL-scores previous steam generator replacements
Treatment of nested terms Local Max of bigram LL-scores previous steam generator replacements 34,17 602,05 77,88
Treatment of nested terms Local Max of bigram LL-scores previous steam generator replacements 34,17 602,05 77,88 steam generator replacement requires 602,05 77,88 20,44
Treatment of nested terms Local Max of bigram LL-scores C-value (Frantzi & Ananiadou 1996) C-value(a) = (length(a) –1)(freq(a) – t(a)/c(a)) n-gramC-value compressive force10,3 axial compressive 5,2 axial compressive force16,4
Extracting multi-word terms: Syntactic patterns Extraction of terminologically relevant part-of-speech patterns (applied as regular expressions or finite state automata) (Heid 1998, 2001; Bourigault 1996; Jacquemin 2001) Patterns enable extraction of single occurrences Patterns facilitate treatment of term variation (replacement of steam generator = steam generator replacement)NN 1 of NN 2 NN 3 = NN 2 NN 3 NN 1 Patterns facilitate treatment of nesting – head of phrase may be easily established
Bilingual lexicon extraction word-alignment tools: Twente (Hiemstra 1998), Egypt/Giza (Och 2000), PLUG (Tiedemann 1999) etc. comparison planned; currently using Twente – based on the Iterative Proportional Fitting Procedure (IPFP), word-to-word translation model – outputs translation candidates + scores for each word in the corpus; both ways – using stopword-filtered corpora to improve results
Output of Twente lexicon extraction
Extraction of cognates string comparison on the level of types in two parallel segments: Perl module String::Approx (Hietanainen 2002) high precision cognates override bilingual lexicon term relevance! Nr. of extracted cognate pairs: NE364 EL776 informatikainformatics infrastruktureinfrastructure instrumentacijainstrumentation integracijaintegrating integralaintegral iterativeniterative karakteristikcharacteristics kaskadecascade koeficientcoefficient komponentacomponent koncentracijoconcentration konceptconcept konstantaconstant konvergencaconvergence koordinatcoordinates linearnelinear logisticxnalogistic materialimaterials matrikaMatrix
Term alignment for each source term candidate we collect all single-word equivalents from the bilingual lexicon projekt zamenjave uparjalnikov project 1.00 [null] 0.00steam 0.49 generator 0.33 generators 0.18
Term alignment for each source term candidate we collect all single-word equivalents from the bilingual lexicon projekt zamenjave uparjalnikov among extracted target terms we choose the one with highest match of words scores are added up into equivalence score steam generator replacement project1.82 project 1.00 [null] 0.00steam 0.49 generator 0.33 generators 0.18
Bilingual term extraction: Statistical model L1 L2 parallel corpus single-word terms contiguous n-grams (2-4) contiguous n-grams (2-4) tf.idf, cognates, unknown words log-likelihood stopword filtering collapsing nesting term weighting multi-word terms bilingual lexicon cognate pairs term alignment bilingual term candidates
Bilingual term extraction: Pattern-based model L1 L2 tagged & lemmatized parallel corpus single-word terms multi-word pattern instances nouns only; tf.idf of lemmas, cognates pattern grammar stopword filtering term weighting multi-word terms bilingual lexicon cognate pairs term alignment bilingual term candidates
....
High precision term pairs
Evaluation & Results Evaluation data Slovene: one hand-tagged document (by a group of translation students, not domain expert!) – 181 terms (including nestings) Pattern-based term-tagging correctly detects 71 (precision x, recall 39.2%) Reasons for missed terms: – term length > 4 – term variation (low tf.idf) – automatic filtering of nestings too rigid – tagging/lemmatization mistakes (pattern not extracted) Fine-tuning of the weighting scheme needed (currently set too highest possible precision)