Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Creating Online Presentations. Creating a Presentation To create a presentation 1.Open PowerPoint. In the task pane under New select From Design Template,
MULTIMEDIA What is Multimedia? What is Information?
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Chapter 1 - An Introduction to Computers and Problem Solving
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
Python Programming Chapter 1: The way of the program Saad Bani Mohammad Department of Computer Science Al al-Bayt University 1 st 2011/2012.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
Natural Language Processing AI - Weeks 19 & 20 Natural Language Processing Lee McCluskey, room 2/07
EBMT1 Example Based Machine Translation as used in the Pangloss system at Carnegie Mellon University Dave Inman.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
1 of 6 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Project Overview Making remote handin easy! Spencer Pratt Procedures.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Conversational Computers
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF August 6, 2001 NICE: Native language Interpretation and Communication.
Level 2 IT Users Qualification – Unit 1 Improving Productivity Name.
Dragon Naturally Speaking Tutorial What is Dragon Naturally Speaking? Dragon is a dictation software, students can dictate a paper rather than type it.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
WRITING EFFECTIVE S. Before writing the Make a plan! Think about the purpose of the Think about the person who will read the and.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.
DCT 1123 PROBLEM SOLVING & ALGORITHMS INTRODUCTION TO PROGRAMMING.
Natural Language Processing Neelnavo Kar Alex Huntress-Reeve Robert Huang Dennis Li.
Mobile and Pervasive Computing - 8 Natural Language Processing Presented by: Dr. Adeel Akram University of Engineering and Technology, Taxila,Pakistan.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
MULTIMEDIA What is Multimedia? The word MULTIMEDIA is made up from two words, MULTI meaning more than one and MEDIA meaning a way of displaying or passing.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Lesson 1 -What is a Database? -Fields and Records
TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval.
CS161 Topic #21 CS161 Introduction to Computer Science Topic #2.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
1 Computational Linguistics Ling 200 Spring 2006.
Area Report Machine Translation Hervé Blanchon CLIPS-IMAG A Roadmap for Computational Linguistics COLING 2002 Post-Conference Workshop.
SIG IL 2000 Evaluation of a Practical Interlingua for Task-Oriented Dialogue Lori Levin, Donna Gates, Alon Lavie, Fabio Pianesi, Dorcas Wallace, Taro Watanabe,
1 Nassau Community CollegeProf. Vincent Costa Acknowledgements: An Introduction to Programming Using Visual Basic 2012, All Rights ReservedAn Introduction.
Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.
Evolution of Machine Translation: systems and use John Hutchins [ homepages/WJHutchins] [
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
What is a Database? A Database is…  an organized set of stored information usually on one topic  a collection of records  a way to organize information.
1 Introduction to Linguistics Teacher: Simon Smith ( 史尚明 ) – “Dr Smith”, “Simon” or “ 老師 ”: OK – “Smith” or “Teacher”: not OK This semester’s course: –
Downloading and Installing Autodesk Revit 2016
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
MT with an Interlingua Lori Levin April 13, 2009.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.
Unit 8 LANGUAGE FOCUS. Content  Word study  Word used in Computing and Telephoning  Grammar  Pronoun  Indirect speech with conditional sentences.
Book Review Multimedia Presentation Book Title Author: Presenter’s Name: Plot Summary About the Author Setting Characters Themes Issues My Review Internet.
Learning to use the Interactive Online Classroom Classroom Activities.
1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and.
金聲玉振 Taiwan Univ. & Academia Sinica 1 Spoken Dialogue in Information Retrieval Jia-lin Shen Oct. 22, 1998.
11/23/00UNU/IAS/UNL Centre1 The Universal Networking Language United Nations University Institute of Advanced Studies United Networking Language ® UNU/IAS.
Language Technologies Capability Demonstration Alon Lavie, Lori Levin, Alex Waibel Language Technologies Institute Carnegie Mellon University CATANAL Planning.
Recent Advances in Speech Translation Systems ESSLLI-2002 Tutorial Course August 12-16, 2002 Course Organizers: Alon Lavie – Carnegie Mellon University.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Ioana Barbantan and Rodica Potolea. Lots of technology to capture health information.
Using the Automatic Captions Feature. Objectives Learn how to use the Automatic Captions feature in YouTube  Edit the generated captions  Extract the.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Approaches to Machine Translation
Approaches to Machine Translation
Presentation transcript:

Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow, Alaska March 8-9, 2001

CATANAL Meeting - Barrow Alaska Outline History of MT--See Wired magazine May 2000 issue. Available on the web. How well does it work? Procedure for designing an MT project. Choose an application: What do you want to do? Identify the properties of your application. Methods: knowledge-based, statistical/corpus based, or hybrid. Methods: interlingua, transfer, direct Typical components of an MT system. Typical resources required for an MT system.

March 8-9, 2001CATANAL Meeting - Barrow Alaska How well does it work? Example: SpanAm Handout: Example from the SpanAm system of the Pan American Health Organization. Probably the best Spanish-English MT system. Around 20 years of development.

March 8-9, 2001CATANAL Meeting - Barrow Alaska How well does it work? Example: Systran Try it on the Altavista web page. Many language pairs are available. Some language pairs might have taken up to a person-century of development. Can translate text on any topic. Results may be amusing.

March 8-9, 2001CATANAL Meeting - Barrow Alaska How well does it work? Example: KANT Translates equipment manuals for Caterpillar. Input is controlled English: many ambiguities are eliminated. The input is checked carefully for compliance with the rules. Around 5 output languages. The output might be post-edited. The result has to be perfect to prevent accidents with the equipment.

March 8-9, 2001CATANAL Meeting - Barrow Alaska How well does it work? Example: JANUS Translates spoken conversations about booking hotel rooms or flights. Six languages: English, French, German, Italian, Japanese, Korean (with partners in the C-STAR consortium). Input is spontaneous speech spoken into a microphone. Output is around 60% correct. Task Completion is higher than translation accuracy: users can always get their flights or rooms if they are willing to repeat 40% of their sentences.

March 8-9, 2001CATANAL Meeting - Barrow Alaska How well does it work? Speech Recognition Jupiter weather information: You can say things like “what cities do you know about in Chile?” and “What will be the weather tomorrow in Santiago?”. Communicator flight reservations: CMU-PLAN. You can say things like “I’m travelling to Pittsburgh.” Speechworks demo: SAY-DEMO. You can say things like “Sell my shares of Microsoft.” These are all in English, and are toll-free only in the US, but they are speaker-independent and should work with reasonable foreign accents.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Different kinds of MT Different applications: for example, translation of spoken language or text. Different methods: for example, translation rules that are hand crafted by a linguist or rules that are learned automatically by a machine. The work of building an MT program will be very different depending on the application and the methods.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Procedure for planning an MT project Choose an application. Identify the properties of your application. List your resources. Choose one or more methods. Make adjustments if your resources are not adequate for the properties of your application.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Choose an application: What do you want to do? Exchange or chat in Mapudungun and Spanish. Translate Spanish web pages about science into Mapudungun so that kids can read about science in their language. Scan the web: “Is there any information about such-and- such new fertilizer and water pollution?” Then if you find something that looks interesting, take it to a human translator. Answer government surveys about health and agriculture (spoken or written). Ask directions (“where is the library?”) (spoken). Read government publications in Mapudungun.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Identify the properties of your application. Do you need reliable, high quality translation? How many languages are involved? Two or more? Type of input. One topic (for example, weather reports) or any topic (for example, calling your friend on the phone to chat). Controlled or free input. How much time and money do you have? Do you anticipate having to add new topics or new languages?

March 8-9, 2001CATANAL Meeting - Barrow Alaska Do you need high quality? Assimilation: Translate something into your language so that you can: –understand it--may not require high quality. –evaluate whether it is important or interesting and then send it off for a better translation-- does not require high quality. –use it for educational purposes--probably requires high quality.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Do you need high quality? Dissemination: Translate something into someone else’s language e.g., for publication. Usually should be high quality.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Do you need high quality? Two-Way: e.g., chat room or spoken conversation May not require high reliability on correctness if you have a native language paraphrase. –Original input : I would like to reserve a double room. –Paraphrase: Could you make a reservation for a double room.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Type of Input Formal text: newspaper, government reports, on-line encyclopedia. –Difficulty: long sentences Formal speech: spoken news broadcast. –Difficulty: speech recognition won’t be perfect. Conversational speech: –Difficulty: speech recognition won’t be perfect –Difficulty: disfluencies –Difficulty: non-grammatical speech Informal text: , chat –Difficulty: non-grammatical speech

March 8-9, 2001CATANAL Meeting - Barrow Alaska Resources People who speak the language. Linguists who speak the language. Computational linguists who speak the language. Text on paper. Text on line. Comparable text on paper or on line. Parallel text on paper or on line. Annotated text (part of speech, morphology, etc.) Dictionaries (mono-lingual or bilingual) on paper or online. Recordings of spoken language. Recordings of spoken language that are transcribed. Etc.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Methods: Knowledge-Based Knowledge-based MT: a linguist writes rules for translation: –noun adjective --> adjective noun Requires a computational linguist who knows the source and target languages. Usually takes many years to get good coverage. Usually high quality.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Methods: statistical/corpus-based Statistical and corpus-based methods involve computer programs that automatically learn to translate. The program must be trained by showing it a lot of data. Requires huge amounts of data. The data may need to be annotated by hand. Does not require a human computational linguist who knows the source and target languages. Could be applied to a new language in a few days. At the current state-of-the-art, the quality is not very good.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Methods: Interlingua An interlingua is a machine-readable representation of the meaning of a sentence. –I’d like a double room/Quisiera una habitacion doble. –request-action+reservation+hotel(room-type=double) Good for multi-lingual situations. Very easy to add a new language. Probably better for limited domains -- meaning is very hard to define.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Multilingual Interlingual Machine Translation Instructions: Delete sample document icon and replace with working document icons as follows: Create document in Word. Return to PowerPoint. From Insert Menu, select Object… Click “Create from File” Locate File name in “File” box Make sure “Display as Icon” is checked. Click OK Select icon From Slide Show Menu, Select Action Settings. Click “Object Action” and select “Edit” Click OK

March 8-9, 2001CATANAL Meeting - Barrow Alaska Methods: Transfer A transfer rule tells you how a structure in one language corresponds to a different structure in another language: –an adjective followed by a noun in English corresponds to a noun followed by an adjective in Spanish. Not good when there are more than two languages -- you have to write different transfer rules for each pair. Better than interlingua for unlimited domain.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Methods: Direct Direct translation does not involve analyzing the structure or meaning of a language. For example, look up each word in a bilingual dictionary. Results can be hilarious: “the spirit is willing but the flesh is weak” can become “the wine is good, but the meat is lousy.” Can be developed very quickly. Can be a good back-up when more complicated methods fail to produce output.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Components of a Knowledge-Based Interlingua MT System Morphological analyzer: identify prefixes, suffixes, and stem. Parser (sentence-to-syntactic structure for source language, hand-written or automatically learned) Meaning interpreter (syntax-to-semantics, source language). Meaning interpreter (semantics-to-syntax, target language). Generator (syntactic structure-to-sentence) for target language.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Resources for a knowledge-based Interlingua MT system Computational linguists who know the source and target languages. As large a corpus as possible so that the linguists can confirm that they are covering the necessary constructions, but the size of the corpus is not crucial to system development. Lexicons for source and target languages, syntax, semantics, and morphology. A list of all the concepts that can be expressed in the system’s domain.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Components of Example Based MT: a direct statistical method A morphological analyzer and part of speech tagger would be nice, but not crucial. An alignment algorithm that runs over a parallel corpus and finds corresponding source and target sentences. An algorithm that compares an input sentence to sentences that have been previously translated, or whose translation is known. An algorithm that pulls out the corresponding translation, possibly slightly modifying a previous translation.

March 8-9, 2001CATANAL Meeting - Barrow Alaska Resources for Example Based MT Lexicons would improve quality of translation, but are not crucial. A large parallel corpus (hundreds of thousands of words).

March 8-9, 2001CATANAL Meeting - Barrow Alaska “Omnivorous” Multi-Engine MT: eats any available resources

March 8-9, 2001CATANAL Meeting - Barrow Alaska Approaches we have in mind Direct bilingual-dictionary lookup: because it is easy and is a back-up when other methods fail. Generalized Example-Based MT: because it is easy and fast and can be also be a back-up. Instructable Transfer-based MT: a new, untested idea involving machine learning of rules from a human native speaker. Useful when computational linguists don’t know the language, and people who know the language are not computational linguists. Conventional, hand-written transfer rules: in case the new method doesn’t work.