The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Universität des Saarlandes Seminar: Recent Advances in Parsing Technology Winter Semester Jesús Calvillo.
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
The current status of Chinese- English EBMT -where are we now Joy (Ying Zhang) Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Progress in Chinese EBMT for LingWear Ying Zhang (Joy) Language Technologies Institue Carnegie Mellon University Sep.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Flow Network Models for Sub-Sentential Alignment Ying Zhang (Joy) Advisor: Ralf Brown Dec 18 th, 2001.
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
ACL 2005 WORKSHOP ON BUILDING AND USING PARALLEL TEXTS (WPT-05), Ann Arbor, MI. June Competitive Grouping in Integrated Segmentation and Alignment.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
Symmetric Probabilistic Alignment Jae Dong Kim Committee: Jaime G. Carbonell Ralf D. Brown Peter J. Jansen.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
ABC--- A Phrase-to-Phrase Alignment Method Integrating monolingual and bilingual information in sub sentential phrase alignment Ying Zhang (Joy)
9/12/2003LTI Student Research Symposium1 An Integrated Phrase Segmentation/Alignment Algorithm for Statistical Machine Translation Joy Advisor: Stephan.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Machine translation Context-based approach Lucia Otoyo.
Statistical Alignment and Machine Translation
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Multi-Engine MT for Quick MT. Missing Technology for Quick MT LingWear ISI MT NICE Core Rapid MT - Multi-Engine MT - Omnivorous resource usage - Pervasive.
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
Learning Phonetic Similarity for Matching Named Entity Translations and Mining New Translations Wai Lam Ruizhang Huang Pik-Shan Cheung Department of Systems.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
MACHINE TRANSLATION PAPER 1 Daniel Montalvo, Chrysanthia Cheung-Lau, Jonny Wang CS159 Spring 2011.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Approaches to Machine Translation
Statistical NLP: Lecture 13
Approaches to Machine Translation
Presentation transcript:

The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001

Language Technologies Institute School of Computer Science, Carnegie Mellon University 2 Overview of Ch-En EBMT Adapting EBMT to Chinese –Segmentation of Chinese Corpus used –Hong Kong legal code (from LDC) –Hong Kong news articles (from LDC) In this project: –Robert Frederking, Ralf Brown, Joy, Erik Peterson, Stephan Vogel, Alon Lavie, Lori Levin,

Language Technologies Institute School of Computer Science, Carnegie Mellon University 3 Corpus Statistics Hong Kong Legal Code: Chinese: 23 MB English: 37.8 MB Hong Kong News (After cleaning): 7622 Documents Dev-test: Size: 1,331,915 byte, 4,992 sentence pairs Final-test: Size: 1,329,764 byte, 4,866 sentence pairs Training: Size: 25,720,755 byte, 95,752 sentence pairs Corpus Cleaning –Converted from Big5 to GB –Divided into Training set (90%), Dev-test (5%) and test set (5%) –Sentence level alignment, using Church & Gale Method (by ISI) –Cleaned –Convert two-byte Chinese characters to their cognates

Language Technologies Institute School of Computer Science, Carnegie Mellon University 4 Chinese Segmentation There are no spaces between Chinese words in written Chinese. The segmentation problem: Given a sentence with no spaces, break it into words. Definition of Chinese word is vague.

Language Technologies Institute School of Computer Science, Carnegie Mellon University 5 Our Definition of Words/Phrases/Terms Chinese Characters –The smallest unit in written Chinese is a character, which is represented by 2 bytes in GB-2312 code. Chinese Words –A word in natural language is the smallest reusable unit which can be used in isolation. Chinese Phrases –We define a Chinese phrase as a sequence of Chinese words. For each word in the phrase, the meaning of this word is the same as the meaning when the word appears by itself. Terms –A term is a meaningful constituent. It can be either a word or a phrase.

Language Technologies Institute School of Computer Science, Carnegie Mellon University 6 Complicated Constructions Transliterated foreign words and names Abbreviations Chinese Names Chinese Numbers

Language Technologies Institute School of Computer Science, Carnegie Mellon University 7 Segmenter Approaches –Statistical approaches: Idea: Building collocation models for Chinese characters, such as first-order HMM. Place the space at the place where two characters rarely co-occur. Cons: –Data sparseness –Cross boundary

Language Technologies Institute School of Computer Science, Carnegie Mellon University 8 Segmenter (2) –Dictionary-based approaches Idea: Use a dictionary to find the words in the sentence Forward maximum match / backward maximum match/ or both direction Cons: –The size and quality of the dictionary used are of great importance: New words, Named-entity –Maximum (greedy) match may cause mis-segmentations

Language Technologies Institute School of Computer Science, Carnegie Mellon University 9 Segmenter (3) –A combination of dictionary and linguistic knowledge Ideas: Using morphology, POS, grammar and heuristics to aid disambiguation Pros: high accuracy (possible) Cons: –Require a dictionary with POS and word-frequency –Computationally expensive

Language Technologies Institute School of Computer Science, Carnegie Mellon University 10 Segmenter (4) We first used LDC’s segmenter Currently we are using a forward/backward maximum match segmenter for baseline. The word frequency dictionary is from LDC The word frequency dictionary from LDC: 43,959 entries For HLT 2001, we augmented the frequency dictionary with new words found from the corpus by statistical method

Language Technologies Institute School of Computer Science, Carnegie Mellon University 11 Two-threshold method Two-threshold for tokenization (finding new words from the corpus) : for MT Summit VIII

Language Technologies Institute School of Computer Science, Carnegie Mellon University 12 For PI Meeting Baseline System –Using LDC’s frequency word dictionary Full System –Tokenize new words from the pre-segmented corpus using two- threshold method, augment the frequency dictionary with new words to re-segment the corpus –Bracket English –Using feedback from statDict to adjust segmentation/bracketing Baseline + Named-Entity –Named-entity tagger by Erik Peterson Multi-corpora System –Cluster the documents into sub-corpora according to their topics

Language Technologies Institute School of Computer Science, Carnegie Mellon University 13 Evaluation Issues Automatic Measures –EBMT Source Match –EBMT Source Coverage –EBMT Target Coverage –MEMT (EBMT+DICT) Unigram Coverage –MEMT (EBMT+DICT) PER

Language Technologies Institute School of Computer Science, Carnegie Mellon University 14 Evaluation Issues (2) Human Evaluations –4-5 graders each time –6 categories

Language Technologies Institute School of Computer Science, Carnegie Mellon University 15 After PI Meeting (0) Study of results reported in PI meeting ( –The quality of Named-Entity (Cleaned by Erik) –Performance difference of EBMT while changing the average length of Chinese word token (by changing segmentation) –How to evaluate the performance of the system Experiment of G-EBMT –Word clustering

Language Technologies Institute School of Computer Science, Carnegie Mellon University 16 After PI Meeting (1) Changing the average length of Chinese token –No bracket on English –Use a subset of LDC’s frequency dictionary for segmentation –Study the performance of EBMT system on different average Chinese token length

Language Technologies Institute School of Computer Science, Carnegie Mellon University 17 After PI Meeting (2) Avg. Token Len. Vs. PER

Language Technologies Institute School of Computer Science, Carnegie Mellon University 18 After PI Meeting (3) Type-Token curve of Chinese and English

Language Technologies Institute School of Computer Science, Carnegie Mellon University 19 Future Research Plan Generalized EBMT –Word-clustering –Grammar Induction Using Machine Learning to optimize the parameters used in MEMT Better Alignment Model: Integrating segmentation, brackting and alignment

Language Technologies Institute School of Computer Science, Carnegie Mellon University 20 New Alignment Model (1) Using both monolingual and bilingual collocation information to segment and align corpus

Language Technologies Institute School of Computer Science, Carnegie Mellon University 21 References Tom Emerson, “Segmentation of Chinese Text”. In #38 Volume 12 Issue2 of MultiLingual Computing & Technology published by MultiLingual Computing, Inc. Ying Zhang, Ralf D. Brown, and Robert E. Frederking. "Adapting an Example-Based Translation System to Chinese". To appear in Proceedings of Human Language Technology Conference 2001 (HLT-2001). Ying Zhang, Ralf D. Brown, Robert E. Frederking and Alon Lavie. "Pre- processing of Bilingual Corpora for Mandarin-English EBMT". Accepted in MT Summit VIII (Santiago de Compostela, Spain, Sep. 2001)