Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

Data Mining and Text Analytics By Saima Rahna & Anees Mohammad Quranic Arabic Corpus.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.
Unsupervised Turkish Morphological Segmentation for Statistical Machine Translation Coskun Mermer and Murat Saraclar Workshop on Machine Translation and.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 Lending a Hand: Sign Language Machine Translation Sara Morrissey NCLT Seminar Series 21 st June 2006.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.
Jan 2005Statistical MT1 CSA4050: Advanced Techniques in NLP Machine Translation III Statistical MT.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Machine translation Context-based approach Lucia Otoyo.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
Kuang Ru; Jinan Xu; Yujie Zhang; Peihao Wu Beijing Jiaotong University
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
CLEF 2004 – Interactive Xling Bookmarking, thesaurus, and cooperation in bilingual Q & A Jussi Karlgren – Preben Hansen –
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Korea Maritime and Ocean University NLP Jung Tae LEE
CLEF2003 Forum/ August 2003 / Trondheim / page 1 Report on CLEF-2003 ML4 experiments Extracting multilingual resources from corpora N. Cancedda, H. Dejean,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
LREC 2008 Marrakech 29 May Caroline Lavecchia, Kamel Smaïli and David Langlois LORIA / Groupe Parole, Vandoeuvre-Lès-Nancy, France Phrase-Based Machine.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
September 2004CSAW Extraction of Bilingual Information from Parallel Texts Mike Rosner.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
An Adaptive Learning with an Application to Chinese Homophone Disambiguation from Yue-shi Lee International Journal of Computer Processing of Oriental.
Statistical Machine Translation Part II: Word Alignments and EM
Joint Training for Pivot-based Neural Machine Translation
Translating Collocations for Bilingual Lexicons
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad de Sevilla *** Universidad de Malaga

Introduction Terms and Terminology –Terms: linguistic units which have specialised use. –Terminology: the system of terms in a subject field. –Terminology is vital for specialised communication, in both mono lingual and multi lingual contexts.

Mono and multi lingual terminology processing Mono lingual terminology processing –Three steps: extraction, validation, and organisation. –Automatic extraction approaches: linguistic (may produce noises), statistical (may overlook important but low frequency terms), and hybrid approaches Bilingual/Multilingual term extraction –The same three steps as in monolingual terminology processing: extraction, validation, and organisation –Relying on parallel corpora aligned at a certain level –Different models to align term candidates –Alignment as an independent step

Our approach: mutual bilingual term extraction Alignment plays an active role in term extraction. Automatic alignment is used to propagate the strengths of terminology extraction from one language into another. Relying on the availability of parallel corpora aligned at sentence level.

Mutual term extraction: Three step 1: lists of term candidates are extracted for the source and target languages; 2: term candidates from the target language are aligned to those in the source language; 3: if a term candidate in the target language is aligned to a term candidate in the source language, its term score is increased: this candidate promoted. Steps 1-3 can be repeated many times.

Mono-lingual term extraction Lexical-syntactic-statistical approach –Lexical-syntactic POS patterns English: [AN]*(NP)?[AN]*N Spanish: N[NA]*(PN)?[NA]* –Statistical measures Different measures tested Frequency is chosen

Term alignment Contingency table-based method: log- likelihood is used to estimate the likelihood of a term candidate in the source language is translated into another term candidate in the target language The table is built using a parallel corpus aligned at sentence level

Contingency table for “lymph node” and “ganglio linfático”

Boosting algorithms Hypothesis: the term score of a term candidate in one language can be used to improve the term score of its aligned candidate in the other language, and vice versa via boosting processes Given that: AL(T 1,T 2 ): alignment score of the two term candidates T 1 and T 2. TC s [T]: term score of the candidate T in the source language TC t [T]: term score of the candidate T in the target language BT(TC 1,TC 2 ): boosting function, i.e. how the term score of the aligned term affects the target term score; Example: simple addition: BT(TC 1,TC 2 )=TC 1 +TC 2 ;

Boosting algorithms (cont.) Single boosting: boosting process is performed on the target language only: Foreach term candidate T t in the target language T s =argmax(AL(T t,T i )); TC t [T t ]=BT(TC s [T s ],TC t [T t ]); Double boosting: boosting process is performed on both source and target languages Foreach term candidate T s in the source language T t =argmax(AL(T s,T i )); TC s [T s ]=BT(TC s [T s ],TC t [T t ]); Foreach term candidate T t in the target language T s =argmax((AL(T t,T i )); TC t [T t ]=BT(TC s [T s ],TC t [T t ]); Recursive boosting: boosting process is repeated for both languages until the term candidate lists are stabilised.

Parameters Factors affecting the outcome of the proposed algorithms: the alignment function AL, the mechanism to calculate the initial term scores TC s and TC t, and the boosting function BT. Different combinations of these functions have been experimented with. The best term score function is frequency, and the best boosting function is simple addition. –In our next research, we propose several probabilistic models which provide better probabilistic foundations for the boosting function.

Evaluation: data, gold standard, and evaluation metrics Data –MedlinePlus parallel texts (English/Spanish) on the topic of Cancer 9,250 segments for each language 31,498 English words, Spanish words Aligned by Trados winalign, manually corrected Gold standard –389 English terms, 442 Spanish terms, and 357 term pairs have been validated and used as a gold standard. Evaluation metrics –F-measure

Evaluation: results Alignment accuracy –In total, the algorithm suggests 472 translation pairs, of which 374 are confirmed as correct translation. This suggests that the accuracy of the alignment is 0.8. Term extraction performance: improved by 10 to 25%

Results (cont.) Number of candidates F-measure English TF Spanish TF English TF (Boosted) Spanish TF (Boosted) English converge boosted Spanish converge boosted

Conclusion and future directions A promising approach, but More research will be needed A better mathematical foundation: –Probabilistic models –More experiments Other domains and language pairs –Legal –English-Hindi

Thank you very much Questions? Comments? Criticisms?