Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Introducing Formal Methods, Module 1, Version 1.1, Oct., Formal Specification and Analytical Verification L 5.
Why, what were the idea ? 1.Create a data infrastructure, 2.Data + the knowledge products that are produced on the basis of data a) Efficiant access to.
® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.
Module 1 Dictionary skills Part 1
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
Vocabulary Matching for Book Indexing Suggestion in Linked Libraries – A Prototype Implementation & Evaluation Antoine Isaac, Dirk Kramer, Lourens van.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Overview of Search Engines
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
LREC Combining Multiple Models for Speech Information Retrieval Muath Alzghool and Diana Inkpen University of Ottawa Canada.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Indexing Knowledge Daniel Vasicek 2014 March 27 Introduction Basic topic is : All Human Knowledge Who Cares? Simple Examples.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Lecture Four: Steps 3 and 4 INST 250/4.  Does one look for facts, or opinions, or both when conducting a literature search?  What is the difference.
SCONE: reusability, granularity and collection strength Gordon Dunsire & Dennis Nicholson Presented at the Collection Description Focus, Workshop 2, Birmingham,
Advanced Signal Processing 05/06 Reinisch Bernhard Statistical Machine Translation Phrase Based Model.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
A merging strategy proposal: The 2-step retrieval status value method Fernando Mart´inez-Santiago · L. Alfonso Ure ˜na-L´opez · Maite Mart´in-Valdivia.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Grade 8 – Writing Standards Text Types and Purposes (1b) Write arguments to support claims with clear reasons and relevant evidence. Support claim(s) with.
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
National Taiwan University, Taiwan
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Haitham Elmarakeby.  Speech recognition
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
CSE 425: Syntax I Syntax and Semantics Syntax gives the structure of statements in a language –Allowed ordering, nesting, repetition, omission of symbols.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
ORGANIZATION OF ELEMENTS OF INFORMATION The Thesaurus.
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Overview of the Final Report and Findings from the Review of Sampling Methods in Extrapolated New Base-Year Generation Studies May 11-12, 2004.
ICS312 Introduction to Compilers Set 23. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
This Briefing is: UNCLASSIFIED Aha! Analytics 2278 Baldwin Drive Phone: (937) , FAX: (866) An Overview of the Analytic Hierarchy Process.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
METEOR: Metric for Evaluation of Translation with Explicit Ordering An Improved Automatic Metric for MT Evaluation Alon Lavie Joint work with: Satanjeev.
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
What is a CAT? What is a CAT?.
Statistical NLP: Lecture 13
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
DESIGN OF EXPERIMENTS by R. C. Baker
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang Chen

Background and Problem Thesauri and ontologies provide important value in facilitating access to digital archives by Thesauri and ontologies provide important value in facilitating access to digital archives by –representing underlying principles of organization Translation of such resources into multiple languages is an important component Translation of such resources into multiple languages is an important component –Specificity of vocabulary terms in most ontologies precludes fully-automated machine translation using general-domain lexical resources

Research Approach Present an efficient process for leveraging human translations when constructing domain-specific lexical resources Present an efficient process for leveraging human translations when constructing domain-specific lexical resources Evaluate the effectiveness of this process by producing a probabilistic phrase dictionary and translating a thesaurus of 56,000 concepts used to catalogue a large archive of oral histories Evaluate the effectiveness of this process by producing a probabilistic phrase dictionary and translating a thesaurus of 56,000 concepts used to catalogue a large archive of oral histories

Purpose If we need humans to assist in the translation process, how can we maximize access while minimizing cost? If we need humans to assist in the translation process, how can we maximize access while minimizing cost?Reuse!! Useful First!!

Specific Problem Most digital collections of any significant size use a system of organization that facilitates easy access to collection contents Most digital collections of any significant size use a system of organization that facilitates easy access to collection contents –The organizing principles are captured in the form of a controlled vocabulary of keyword phrases –Usually arranged in a hierarchic thesaurus or ontology

Idea Collected 3,000 manual translations of keyword phrases Collected 3,000 manual translations of keyword phrases Reused the translated terms to generate a lexicon for automated translation of the rest of the thesaurus Reused the translated terms to generate a lexicon for automated translation of the rest of the thesaurus Priority is in terms of Priority is in terms of –value in accessing the collection –the reusability of their component terms

Checked and Aligned the Translations Translations collected from one human informant Translations collected from one human informant Checked and aligned to the original English terms by a second informant Checked and aligned to the original English terms by a second informant Induce a probabilistic English-Czech phrase dictionary Induce a probabilistic English-Czech phrase dictionary

Maximizing Value and Reusability To quantify the utility of 3,000 manual translations of keyword phrases To quantify the utility of 3,000 manual translations of keyword phrases –Define two values for each keyword phrase in the thesaurus Thesaurus value, Thesaurus value, –representing the importance of the keyword phrase for providing access to the collection Translation value Translation value –representing the usefulness of having the keyword phrase translated The second is related to the first The second is related to the first

Keyword hierarchy

Meaning of nodes Internal (non-leaf) nodes of the hierarchy are used to organize concepts and support concept browsing Internal (non-leaf) nodes of the hierarchy are used to organize concepts and support concept browsing Leaf nodes are very specific and are only used to index video content Leaf nodes are very specific and are only used to index video content

Example to Thesaurus value The keyword phrase “ Auschwitz IIBirkenau (Poland: Death Camp) ” The keyword phrase “ Auschwitz IIBirkenau (Poland: Death Camp) ” –Which describes a Nazi death camp –Assigned to 17,555 video segments in the collection –Has broader (parent) terms and narrower (child) terms “ German death camps ” is not assigned to any video segments “ German death camps ” is not assigned to any video segments However, “ German death camps ” has very important narrower terms including “ Auschwitz II-Birkenau ” and others However, “ German death camps ” has very important narrower terms including “ Auschwitz II-Birkenau ” and others

Thesaurus value Represents the importance of each keyword phrase to the thesaurus Represents the importance of each keyword phrase to the thesaurus An internal node is valuable in providing access to its children An internal node is valuable in providing access to its children But value a node by the sum value of all its children, grandchildren, etc., the resulting calculation would bias the top of the hierarchy But value a node by the sum value of all its children, grandchildren, etc., the resulting calculation would bias the top of the hierarchy

Thesaurus value Leaf node: the number of video segments to which the concept has been assigned Leaf node: the number of video segments to which the concept has been assigned Parent node: plus the average of the thesaurus value of any child nodes Parent node: plus the average of the thesaurus value of any child nodes The final values quantify how valuable the translation of any given keyword phrase would be in providing access to video segments The final values quantify how valuable the translation of any given keyword phrase would be in providing access to video segments

Translation value Compute the translation value for each word in the vocabulary as the sum of the thesaurus value for every keyword phrase that contains that word Compute the translation value for each word in the vocabulary as the sum of the thesaurus value for every keyword phrase that contains that word

Use these values The end result is a list of vocabulary words and the impact that correct translation of each word would have on the overall value of the translated thesaurus The end result is a list of vocabulary words and the impact that correct translation of each word would have on the overall value of the translated thesaurus We elicited human translations of entire keyword phrases rather than individual vocabulary terms We elicited human translations of entire keyword phrases rather than individual vocabulary terms

Prioritize The value gained by translating any given phrase is more accurately estimated by the total value of any untranslated words it contains The value gained by translating any given phrase is more accurately estimated by the total value of any untranslated words it contains Prioritized the order of keyword phrase translations based on the translation value of the untranslated words in each keyword phrase Prioritized the order of keyword phrase translations based on the translation value of the untranslated words in each keyword phrase Prioritizing their translation based on the assumption that any words contained in a keyword phrase of higher priority would already have been translated Prioritizing their translation based on the assumption that any words contained in a keyword phrase of higher priority would already have been translated

Alignment Obtained professional translations for the top 3000 English keyword phrases Obtained professional translations for the top 3000 English keyword phrases This second informant: This second informant: –tokenized these translations and presented them to another bilingual Czech speaker for verification and alignment Alignment process was then used to build a probabilistic dictionary of words and phrases Alignment process was then used to build a probabilistic dictionary of words and phrases

Example of alignment

Machine Translation It first scans the English input to find the longest matching substring in our dictionary, and replaces it with the most likely Czech translation It first scans the English input to find the longest matching substring in our dictionary, and replaces it with the most likely Czech translation Looks up “ monasteries and convents stills ” in the dictionary Looks up “ monasteries and convents stills ” in the dictionary –finds no translation, backs off to “ monasteries and convents ” backs off to “ monasteries and convents ” – translated to “ kl áš tery ”

Gain rate of access value

Experiment MALACH project MALACH project –an NSF-funded effort to improve multilingual information access to large archives of spoken language Leverages a small set of manually acquired English- Czech translations to translate a large ontology of keyword phrases Leverages a small set of manually acquired English- Czech translations to translate a large ontology of keyword phrases

Evaluation Compared our system output to human reference translations using Bleu (Papineni, et al., 2002) Compared our system output to human reference translations using Bleu (Papineni, et al., 2002) Showed corrected and uncorrected machine translations to Czech speakers and collected subjective judgments of fluency and accuracy Showed corrected and uncorrected machine translations to Czech speakers and collected subjective judgments of fluency and accuracy

Bleu Scores

Subjective Judgment Score selected 418 keyword phrases to be used as target translations selected 418 keyword phrases to be used as target translations

Conclusion Demonstrate that prioritization based on hierarchical position and frequency of use facilitates extremely efficient reuse of human input Demonstrate that prioritization based on hierarchical position and frequency of use facilitates extremely efficient reuse of human input Evaluations show that our technique boost performance of a simple translation system by 65%. Evaluations show that our technique boost performance of a simple translation system by 65%.