Acquisition of Character Translation Rules for Supporting SNOMED CT Localizations Jose Antonio Miñarro-Giménez a Johannes Hellrich b Stefan Schulz a a.

Slides:



Advertisements
Similar presentations
Semantic Interoperability in Health Informatics: Lessons Learned 10 January 2008Semantic Interoperability in Health Informatics: Lessons Learned 1 Medical.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Andrade et al. Corpus-based Error Detection in a Multilingual Medical Thesaurus HISA ltd. Biography proforma MEDINFO Lygon Street, Brunswick East.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Ontology-based Convergence of Medical Terminologies: SNOMED CT and ICD-11 Institute for Medical Informatics, Statistics and Documentation, Medical University.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alex Fraser Institute for Natural Language Processing University of Stuttgart
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
IATE EU tool for translation-oriented terminology work
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
The PATENTSCOPE search system: CLIR February 2013 Sandrine Ammann Marketing & Communications Officer.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Semi-Automated Extension of a Specialized Medical Lexicon for French Bruno Cartoni & Pierre Zweigenbaum LIMSI-CNRS, France.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
A School of Information Science, Federal University of Minas Gerais, Brazil b Medical University of Graz, Austria, c University Medical Center Freiburg,
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
Korea Maritime and Ocean University NLP Jung Tae LEE
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Mutual bilingual terminology extraction Le An Ha*, Gabriela Fernandez**, Ruslan Mitkov*, Gloria Corpas*** * University of Wolverhampton ** Universidad.
Representing nursing in SNOMED CT Proposal for TR or Guideline.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
FROM ONE NOMENCLATURES TO ANOTHER… Drs. Sven Van Laere.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Assessing SNOMED CT for Large Scale eHealth Deployments in the EU Workpackage 2- Building new Evidence Daniel Karlsson, Linköping University Stefan Schulz,
Web-based acquisition of Japanese katakana variants
contrastive linguistics
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
TYPES OF TRANSLATION.
Using language technology for SNOMED CT localization
Measuring Monolinguality
Can SNOMED CT be harmonized with an upper-level ontology?
CRF &SVM in Medication Extraction
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Our contribution to SNOMED CT implementation Javier Fernández
KantanNeural™ LQR Experiment
contrastive linguistics
Joint Training for Pivot-based Neural Machine Translation
Natural Language Processing of Knee MRI Reports
Ying He Wuhan University of Technology Twitter: #AMIA2017
Machine Learning Feature Creation and Selection
An ICALL writing support system tunable to varying levels
Lexical ambiguity in SNOMED CT
Reference terminologies vs. Interface terminologies
Yuri Pettinicchi Jeny Tony Philip
Morphoogle - A Multilingual Interface to a Web Search Engine
ITS 2.0 Enriched Terminology Annotation Showcase
SNOMED-CT representation Radiologic report Admission Letter
Improved Word Alignments Using the Web as a Corpus
Statistical Machine Translation Papers from COLING 2004
Statistical vs. Neural Machine Translation: a Comparison of MTH and DeepL at Swiss Post’s Language service Lise Volkart – Pierrette Bouillon – Sabrina.
The SAT, ACT, and SAT Subject Tests
A Path-based Transfer Model for Machine Translation
contrastive linguistics
Statistics Explained goes multilingual
contrastive linguistics
Digital Biomarkers – Patient data mining & precision medicine Stefan Schulz, Medical University of Graz Donausymposium Vienna, March 14, 2018.
Extracting Why Text Segment from Web Based on Grammar-gram
Machine Reading.
Presentation transcript:

Acquisition of Character Translation Rules for Supporting SNOMED CT Localizations Jose Antonio Miñarro-Giménez a Johannes Hellrich b Stefan Schulz a a Institute for Medical Informatics, Statistics and Documentatoin, Medical University of Graz, Austria b Jena University Language and Information Engineering Lab Jena, Germany

 SNOMED CT supports the development of comprehensive high-quality clinical content in health records.  Interoperability of EHR data across languages requires the translation of medical terminologies.  SNOMED CT is currently available fully or partially in English, Spanish, French, Danish, Dutch, Swedish.  Other costs related to the adoption SNOMED CT – Terminology license and participation in IHSTDO SIG – Terminology managment system and infraestructure – Human resources: coordination, terminologists. – Mapping with legacy systems. –... Introduction

 Machine translation techniques in combination with manual curation could reduce the cost of producing term translations.  Statistical machine translation (SMT) systems are based on the existence of parallel text to generate the translation model.  Rule-based translation systems are based on the definition of translation rules.  Medical terminologies contains many terms derivated from Greek or Latin origins which are shared across languages. – Appendicitis → Apendizitis

Methods List of words Rule Extraction Rule Selection Rule Evaluation Full list of transliteration rules Filtered and sorted list of transliteration rules Evaluation results of transliteration rules Training set SMT- translated training set Testing set SMT- translated testing set Evaluation set Curated translation evaluation set

Rule Extraction  Testing all combinations of characters substitution between source and target word.  Limit the total number of combinations by defining: – The max and min allowed length of the substitution strings in the source and target word. – The max number of characters between source and target substitution strings.  A rule is extracted when the source and target substitution strings improved the translation.  A rule Improves a translation when the Levenstheins’ distance between the rule translated word and the SMT translated word is lower than the distance between the source word and the SMT translated word.

Rule Selection  The set of extracted rules is tested to obtain the best list of rule with highest improvement using the testing dataset.  The set of extracted rules are grouped by the overlapping source substitution strings and we select for each group the one which better translatate the testing dataset more times.  The selected rules from each group is sorted based on the overall improvement achieved with the testing dataset.  The rank of selected rules depend on: 1.Highest number of improved translations. 2.Lowest number of deteriorated translations. 3.Highest number of words correctly translated. Overlapping group “ct” → “kt” “vect” → “vekt” “ecto” → “ekto” “ectomy” → “ektomie”

List of Transliteration rules EN → DE Rank RuleExample 1 “ine_” → “in_”“Adenine” → “Adenin” 2 “ate_” → “at_”“Fibrate” → “Fibrat” 3 “ia_” → “ie_”“Anemia” → “Anemie” 4 “ide_” → “id_”“Choride” → “Chlorid” 5 “sis_” → “se_”“Analysis” → “Analyse” 6 “one_” → “on_”“Deoxycortone” → “Deoxycorton” 7 “sm_” → “smus_”“Albinism” → “Albinismus” 8 “ole_” → “ol_”“Phenole” → “Phenol” 9 “hy_” → “hie_”“Hypertrophy” → “Hypertrophie” 10 “my_” → “mie_”Gastronomy” → “Gastronomie”

Rule Evaluation  Gold standard contains 29,790 manually curated list of translated words.  The selected and sorted list of rules is evaluated using the gold standard. 1.Rule-translated words are obtained. 2.Statistical machine translated (SMT) words are obtained. 3.The Levenstheins’ distance is calculated between the rule-translated words and the gold standard and also between the SMT words and the gold stardard. 4.The calculated distances are compared.

Results  A list of 286 rules was created.  Google translate produced 87% of correct translations  Rule translations obtained 60% of correct translations.  Rule approach improved 55% of not correctly translated words by Google translate.  Rule approach correctly translated 27% of not correctly translated words by Google translate.  The 59% of all words in the evaluation dataset have the same in English and German, e.g. “serum”, “escherichia”

Conclusions  Translation rules can be automatically obtained from parallel corpus produced by a statistical machine translation.  Inflection and variability of words in target language (German) complicates the exact translation based on rules.  Rule based approach cannot deal with words that do not share common root.  Statistical machine translation produces better results than rule-based translations. However, Rule approach could improve the translation of words that are not translated by the statistical machine translation. – Low frequency terms in specific domains, such as medicine.

Combined translation approach List of words Parallel corpus SMT Google translate Untranslated words Transliteration rules Rule Translation Rule-translated words SMT words Translated list of words

Questions