Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.

Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad de Sevilla *** Universidad de Malaga E-mail: *{l.a.ha,r.mitkov}@wlv.ac.uk, **gfernan@us.es, ***gcorpas@ya.com

Introduction Terms and Terminology –Terms: linguistic units which have specialised use. –Terminology: the system of terms in a subject field. –Terminology is vital for specialised communication, in both mono lingual and multi lingual contexts.

Mono and multi lingual terminology processing Mono lingual terminology processing –Three steps: extraction, validation, and organisation. –Automatic extraction approaches: linguistic (may produce noises), statistical (may overlook important but low frequency terms), and hybrid approaches Bilingual/Multilingual term extraction –The same three steps as in monolingual terminology processing: extraction, validation, and organisation –Relying on parallel corpora aligned at a certain level –Different models to align term candidates –Alignment as an independent step

Our approach: mutual bilingual term extraction Alignment plays an active role in term extraction. Automatic alignment is used to propagate the strengths of terminology extraction from one language into another. Relying on the availability of parallel corpora aligned at sentence level.

Mutual term extraction: Three step 1: lists of term candidates are extracted for the source and target languages; 2: term candidates from the target language are aligned to those in the source language; 3: if a term candidate in the target language is aligned to a term candidate in the source language, its term score is increased: this candidate promoted. Steps 1-3 can be repeated many times.

Mono-lingual term extraction Lexical-syntactic-statistical approach –Lexical-syntactic POS patterns English: [AN]*(NP)?[AN]*N Spanish: N[NA]*(PN)?[NA]* –Statistical measures Different measures tested Frequency is chosen

Term alignment Contingency table-based method: log- likelihood is used to estimate the likelihood of a term candidate in the source language is translated into another term candidate in the target language The table is built using a parallel corpus aligned at sentence level

Contingency table for “lymph node” and “ganglio linfático”

Boosting algorithms Hypothesis: the term score of a term candidate in one language can be used to improve the term score of its aligned candidate in the other language, and vice versa via boosting processes Given that: AL(T 1,T 2 ): alignment score of the two term candidates T 1 and T 2. TC s [T]: term score of the candidate T in the source language TC t [T]: term score of the candidate T in the target language BT(TC 1,TC 2 ): boosting function, i.e. how the term score of the aligned term affects the target term score; Example: simple addition: BT(TC 1,TC 2 )=TC 1 +TC 2 ;

Boosting algorithms (cont.) Single boosting: boosting process is performed on the target language only: Foreach term candidate T t in the target language T s =argmax(AL(T t,T i )); TC t [T t ]=BT(TC s [T s ],TC t [T t ]); Double boosting: boosting process is performed on both source and target languages Foreach term candidate T s in the source language T t =argmax(AL(T s,T i )); TC s [T s ]=BT(TC s [T s ],TC t [T t ]); Foreach term candidate T t in the target language T s =argmax((AL(T t,T i )); TC t [T t ]=BT(TC s [T s ],TC t [T t ]); Recursive boosting: boosting process is repeated for both languages until the term candidate lists are stabilised.

Parameters Factors affecting the outcome of the proposed algorithms: the alignment function AL, the mechanism to calculate the initial term scores TC s and TC t, and the boosting function BT. Different combinations of these functions have been experimented with. The best term score function is frequency, and the best boosting function is simple addition. –In our next research, we propose several probabilistic models which provide better probabilistic foundations for the boosting function.

Evaluation: data, gold standard, and evaluation metrics Data –MedlinePlus parallel texts (English/Spanish) on the topic of Cancer 9,250 segments for each language 31,498 English words, 30344 Spanish words Aligned by Trados winalign, manually corrected Gold standard –389 English terms, 442 Spanish terms, and 357 term pairs have been validated and used as a gold standard. Evaluation metrics –F-measure

Evaluation: results Alignment accuracy –In total, the algorithm suggests 472 translation pairs, of which 374 are confirmed as correct translation. This suggests that the accuracy of the alignment is 0.8. Term extraction performance: improved by 10 to 25%

Results (cont.) 0.5 0.55 0.6 0.65 0.7 0.75 400500600700800 Number of candidates F-measure English TF Spanish TF English TF (Boosted) Spanish TF (Boosted) English converge boosted Spanish converge boosted

Conclusion and future directions A promising approach, but More research will be needed A better mathematical foundation: –Probabilistic models –More experiments Other domains and language pairs –Legal –English-Hindi

Thank you very much Questions? Comments? Criticisms?

Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.

Similar presentations

Presentation on theme: "Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.

Similar presentations

Presentation on theme: "Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad."— Presentation transcript:

Similar presentations

About project

Feedback