Advisors: Gabor Sarkozy, WPI Andras Kornai, MTA-Sztaki April 23 rd, 2013 Zhongxiu Liu CS 14’ Yidi Zhang CS 13’

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Where do we stand? Harold Somers Centre for Computational Linguistics, UMIST, Manchester, England Panel session, MT Summit VIII, September 2001.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Chinese Word Segmentation Method for Domain-Special Machine Translation Su Chen; Zhang Yujie; Guo Zhen; Xu Jin’an Beijing Jiaotong University.
Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Research methods in corpus linguistics Xiaofei Lu.
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
MACHINE TRANSLATION A precious key to communicate beyond linguistic barriers 1.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Systems Analysis And Design © Systems Analysis And Design © V. Rajaraman MODULE 14 CASE TOOLS Learning Units 14.1 CASE tools and their importance 14.2.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Bug Localization with Machine Learning Techniques Wujie Zheng
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Accent Colors Accent 1 (RGB: 141,178,24) Accent 2 (RGB: 63,63,63) Accent 3 (RGB: 127,127,127) Accent 4 (RGB: 195,214,155) Accent 5 (RGB: 222,222,222) FLAVIUS.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Natural Language Processing Course Project: Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms Mosleh Al-Adhaileh Tang Enya Kong Mosleh Al-Adhaileh and Tang Enya Kong Computer Aided.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
FEISGILTT Dublin 2014 Yves Savourel ENLASO Corporation QuEst Integration in Okapi This presentation was made possible by This project is sponsored by the.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Computational Linguistics Courses Experiment Test.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
PROGRAMMING. Computer Programs  A series of instructions to the computer  pre-written/packaged/off-the-shelf, or  custom made  There are 6 steps to.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
ELAN as a tool for oral history CLARIN Oral History Workshop Oxford Sebastian Drude CLARIN ERIC 18 April 2016.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
IT Training Webinar Converting to CAS 3.x
Computational and Statistical Methods for Corpus Analysis: Overview
Statistical n-gram David ling.
Using GOLD to Tracking L2 Development
Potential impact of QT21 Eleanor Cornelius
English project More detail and the data collection system
Presentation transcript:

Advisors: Gabor Sarkozy, WPI Andras Kornai, MTA-Sztaki April 23 rd, 2013 Zhongxiu Liu CS 14’ Yidi Zhang CS 13’

 Background  Methodology  Results & Evaluation  Future Work  User Guide

 Parallel Corpus  What is Parallel Corpus?  Usage  Training data for mathematical and machine learning model in the field of Natural Language Processing

 Dictionary  Automatically created dictionary is in great need  SZTAKI online Dictionary [1]

 Difference between Hungarian and Chinese  Medium-density vs. High-density  Word-based vs. Character based  Other linguistic Differences  Chinese does not have time-tense, plural or other form change of characters  Chinese does not have space between words  Chinese has less ambiguous sentence-ending characters

 Flowchart of our Methodology

 Variety  Literature, Captioning, Religious  Raw Documents “Clean-up”  Unifying file Encoding to UTF-8  Encoding_transform.py  Filtering out incomplete documents and useless  contents such as formatting tags in webpages  Total: 60 Parallel Documents

 Sentence/Word Level Segmentation  Rule-based segmenter: Huntoken [2]  Stemming  Extract root word in Hungarian  Stemming tool: Hunmorph[3]

 Sentence Level Segmentation  Rule-based segmenter: chinese-sentencizor.py  Word Level Segmentation  Stanford Segmentation[4]  Input:  Output:  Traditional Chinese to Simplified Chinese

 Hunalign  Input: Parallel Documents Output:

 Hunalign can gain help from bilingual dictionary  Input dictionary we used:

 Run Hunalign on the 60 normalized parallel documents with upenn_upann_hu-zh.dict  Run on Stemmed Documents -> Resulting in line-number pairs -> Parallel with unstemmed documents

 Hundict [6]  Dice score  Final Options we used: ▪ dice = 0.2 ▪ iter=5

 Run Hundict on the unstemmed sentence- level parallel corpus output by Hunalign  Filtering out name pairs

 Output  Size: sentence pairs  Sample:

 Evaluation  Score output by Hunalign is meaningless  We do manual checking: ▪ Randomly select sample sentence pairs ▪ Hungarian and Chinese native speaker communicate in English

 Output  Size: word pairs name pairs  Example:

 Evaluation: Manual Checking Similarity Score Range % Word PairsAccuracy [0, 0.1)33.2%7/20 ≈ 35% [0.1,0.2)19%12/20 ≈ 60% ( ]20.5%17/20≈ 85% ( ]16.1%18/20≈ 90% ( ]8.6%20/20≈ 100% ( ]2.6%19/20≈ 95%

 Common Error:  Translation Error draga: 亲爱 (dear, darling) Right: 昂贵 (dear, expensive)  Incomplete Translation Sellő : 鱼 (fish) Right: 美人鱼 (mermaid)  Word Phrase Gyalázatos : 行为 (act) Right: 可耻 (shameful)

 Introducing new dictionaries with our methodology  E.g. Hungarian-Japanese, Romanian-Chinese  Collecting larger and better parallel documents  Developing quantified evaluation for parallel corpora and bilingual dictionaries

 Create you own sentence-level parallel corpus and dictionary  Step1: Download our package  Step2: runHunalign.sh  Step3: runHundict.sh

 András Kornai, MTA-SZTAKI  Gábor Sárközy, Worcester Polytechnic Institute  Attila Zséder, MTA-SZTAKI  Judit Ács, MTA-SZTAKI

[1] SZTAKI. [2] [3] Tron, V., Gyepesi, G., Halacsy, P., Kornai, A., Nemeth, L., Varga, D. Hunmorph: Open Source Word Analysis. In Proceedings of the Association of Computational Linguistics. Web ftp://ftp.mokk.bme.hu/LDC/doc/acl05software.pdf [4] Tseng, P., Chang, G., Andrew, G., Jurafsky, D., Manning, C. A Conditional Random Field Word Segmenter. SIGHAN of the Association for Computational Linguistics. Web [5] Varga, D., Halacsy, P., Kornai, A., Nagy, V., Nemeth, L., Tron, V. Parallel Corpora for Medium Density Languages. Proceedings of Recent Advances in Natural Language Processing. Web [6] Zséder, A. SZTAKI.

Thank You!