Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.

Slides:



Advertisements
Similar presentations
EndNote Web Reference Management Software (module 5)
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Non-Native Users in the Let s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Machine Translation II How MT works Modes of use.
“How Can Research Help Me?” Please make SURE your notes are similar to what I have written in mine.
Researching Your Presentation
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
PHONEXIA Can I have it in writing?. Discuss and share your answers to the following questions: 1.When you have English lessons listening to spoken English,
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
Natural Language Processing AI - Weeks 19 & 20 Natural Language Processing Lee McCluskey, room 2/07
NICE: Native language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown Carnegie Mellon University.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Machine Translation with Scarce Resources The Avenue Project.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF August 6, 2001 NICE: Native language Interpretation and Communication.
Level 2 IT Users Qualification – Unit 1 Improving Productivity Name.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
MULTIMEDIA What is Multimedia? The word MULTIMEDIA is made up from two words, MULTI meaning more than one and MEDIA meaning a way of displaying or passing.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
TagHelper: Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval.
1 Computational Linguistics Ling 200 Spring 2006.
SIG IL 2000 Evaluation of a Practical Interlingua for Task-Oriented Dialogue Lori Levin, Donna Gates, Alon Lavie, Fabio Pianesi, Dorcas Wallace, Taro Watanabe,
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
LTI Education Committee Report Alon Lavie LTI Retreat March 2, 2012.
Big6 Overview Big6™ Trainers Program McDowell County Schools.
Multi-Engine MT for Quick MT. Missing Technology for Quick MT LingWear ISI MT NICE Core Rapid MT - Multi-Engine MT - Omnivorous resource usage - Pervasive.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
AVENUE Automatic Machine Translation for low-density languages Ariadna Font Llitjós Language Technologies Institute SCS Carnegie Mellon University.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Data Collection and Language Technologies for Mapudungun Lori Levin, Rodolfo Vega, Jaime Carbonell, Ralf Brown, Alon Lavie Language Technologies Institute.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.
Language Technologies Institute School of Computer Science Carnegie Mellon University NSF, August 6, 2001 Machine Translation for Indigenous Languages.
NICE: Native Language Interpretation and Communication Environment Lori Levin, Jaime Carbonell, Alon Lavie, Ralf Brown, Erik Peterson, Katharina Probst,
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Language Technologies Capability Demonstration Alon Lavie, Lori Levin, Alex Waibel Language Technologies Institute Carnegie Mellon University CATANAL Planning.
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
1 February 2012 ILCAA, TUFS, Tokyo program David Nathan and Peter Austin Hans Rausing Endangered Languages Project SOAS, University of London Language.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
Developing affordable technologies for resource-poor languages Ariadna Font Llitjós Language Technologies Institute Carnegie Mellon University September.
FROM BITS TO BOTS: Women Everywhere, Leading the Way Lenore Blum, Anastassia Ailamaki, Manuela Veloso, Sonya Allin, Bernardine Dias, Ariadna Font Llitjós.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
A POCKET GUIDE TO PUBLIC SPEAKING 4 TH EDITION Chapter 9 Locating Supporting Material.
Chapter 1 WHAT IS A COMPUTER Faculty of ICT & Business Management Tel : BCOMP0101 Introduction to Information Technology.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Background of the NICE Project Lori Levin Jaime Carbonell Alon Lavie Ralf Brown.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
How to Develop and Write a Research Paper.
Approaches to Machine Translation
Basque language: is IT right on?
Approaches to Machine Translation
Presentation transcript:

Overview of the Language Technologies Institute and AVENUE Project Jaime Carbonell, Director March 2, 2002

Language Technologies Institute Director: Jaime Carbonell Founded in 1986 as the Center for Machine Translation.

School of Computer Science Carnegie Mellon University Center for Automated Learning and Discovery Computer Science Department Entertainment Technology Center Human-Computer Interaction Institute Institute for Software Research, International Language Technologies Institute Robotics Institute

LTI Research Areas Machine Translation Information Retrieval, Summarization, Extraction, Topic Detection and Tracking Speech Recognition and Synthesis Computer Assisted Language Instruction Multi-modal Interaction

Members of LTI 18 core faculty over 40 students (working on masters and Ph.D.) approximately 10 courtesy and adjunct faculty (e.g., in machine learning) probably around 20 funded projects

Funding of LTI Projects Companies: e.g., Caterpillar National Science Foundation DARPA: Defense Advance Research Projects Agency Other: ATR research institute, Japan

LTI Degree Programs Masters in Language Technologies (MLT): two years. Ph.D. in Language and Information Technologies: usually three years after masters degree. Graduate Program Director: Robert Frederking

Potential Uses of Language Technologies in Bilingual Education For language instruction For teaching subjects other than language

LT for Language Instruction Speech Recognition: –Pronunciation tutor (Eskenazi) –Reading tutor (Mostow) Grammar checking Dialogue immersion –Adventure game –Blocks world Using authentic materials

LT for bilingual education in indigenous languages Partially automated translation of teaching materials (science, history, etc.) into indigenous languages.

The AVENUE Project

Machine Translation of Indigenous Languages Policy makers have access to information about indigenous people. –Epidemics, crop failures, etc. Indigenous people can participate in –Health care –Education –Government –Internet without giving up their languages.

History of AVENUE Arose from a series of joint workshops of NSF and OAS. Workshop recommendations: –Create multinational projects using information technology to: provide immediate benefits to governments and citizens develop critical infrastructure for communication and collaborative research –training researchers and engineers –advancing science and technology

Resources for MT People who speak the language. Linguists who speak the language. Computational linguists who speak the language. Text on paper. Text on line. Comparable text on paper or on line. Parallel text on paper or on line. Annotated text (part of speech, morphology, etc.) Dictionaries (mono-lingual or bilingual) on paper or on line. Recordings of spoken language. Recordings of spoken language that are transcribed. Etc.

MT for Indigenous Languages Minimal amount of parallel text Possibly competing standards for orthography/spelling Maybe not so many trained linguists Access to native informants possible Need to minimize development time and cost

Two Technical Approaches Generalized EBMT Parallel text 50K-2MB (uncontrolled corpus) Rapid implementation Proven for major L’s with reduced data Transfer-rule learning Elicitation (controlled) corpus to extract grammatical properties Seeded version-space learning

Architecture Diagram User Learning Module Elicitation Process SVS Learning Process Transfer Rules Run-Time Module SL Input SL Parser Transfer Engine TL Generator EBMT Engine Unifier Module TL Output

EBMT Example English: I would like to meet her. Mapudungun: Ayükefun trawüael fey engu. English: The tallest man is my father. Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw. English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

Version Space Learning Symbolic learning from + and – examples Invented by Mitchell, refined by Hirsch Builds generalization lattice implicitly Bounded by G and S sets Worse-case exponential complexity (in size of G and S) Slow convergence rate

Example of Transfer Rule Lattice

Seeded Version Spaces Generate concept seed from first + example –Generalization-level hypothesis (POS + feature agreement for T-rules in NICE) Generalization/specialization level bounds –Up to k-levels generalization, and up to j-levels specialization. Implicit lattice explored seed-outwards

Complexity of SVS O(g k ) upward search, where g = # of generalization operators O(s j ) downward search, where s = # of specialization operators Since m and k are constants, the SVS runs in polynomial time of order max(j,k) Convergence rates bounded by F(j,k)

Next Steps in SVS Implementation of transfer-rule intepreter (partially complete) Implementation of SVS to learn transfer rules (underway) Elicitation corpus extension for evaluation (under way) Evaluation first on Mapudungun MT (next)

NICE Partners LanguageCountryInstitutions Mapudungun (in place) Chile Universidad de la Frontera, Institute for Indigenous Studies, Ministry of Education Iñupiaq (advanced discussion) US (Alaska) Ilisagvik College, Barrow school district, Alaska Rural Systemic Initiative, Trans-Arctic and Antarctic Institute, Alaska Native Language Center Siona (discussion) Colombia OAS-CICAD, Plante, Department of the Interior

Agreement Between LTI and Institute of Indigenous Studies (IEI), Universidad De La Frontera, Chile Contributions of IEI –Native language knowledge and linguistic expertise in Mapudungun –Experience in bicultural, bilingual education –Data collection: recording, transcribing, translating –Orthographic normalization of Mapudungun

Agreement between LTI and Institute of Indigenous Studies (IEI), Universidad de la Frontera, Chile Contributions of LTI –Develop MT technology for indigenous languages –Training for data collection and transcription –Partial support for data collection effort pending funding from Chilean Ministry of Education –International coordination, technical and project management

LTI/IEI Agreement Continue collaboration on data collection and machine translation technology. Pursue focused areas of mutual interest, such as bilingual education. Seek additional funding sources in Chile and the US.

The IEI Team Coordinator (leader of a bilingual and multicultural education project): –Eliseo Canulef Distinguished native speaker: –Rosendo Huisca Linguists (one native speaker, one near-native) –Juan Hector Painequeo –Hugo Carrasco Typists/Transcribers Recording assistants Translators Native speaker linguistic informants

MINEDUC/IEI Agreement Highlights: Based on the LTI/IEI agreement, the Chilean Ministry of Education agreed to fund the data collection and processing team for the year This agreement will be renewed each year, as needed.

MINEDUC/IEI Agreement: Objectives  To evaluate the NICE/Mapudungun proposal for orthography and spelling  To collect an oral corpus that represent the four Mapudungun dialects spoken in Chile. The main domain is primary health, traditional and western.

MINEDUC/IEI Agreement: Deliverables  An oral corpus of 800 hours recorded, proportional to the demography of each current spoken dialect  120 hours transcribed and translated from Mapudungun to Spanish  A refined proposal for writing Mapudungun

Nice/Mapudungun: Database Writing conventions (Grafemario) Glossary Mapudungun/Spanish Bilingual newspaper, 4 issues Ultimas Familias –memoirs Memorias de Pascual Coña –Publishable product with new Spanish translation 35 hours transcribed speech 80 hours recorded speech`

NICE/Mapudungun: Other Products Standardization of orthography: Linguists at UFRO have evaluated the competing orthographies for Mapudungun and written a report detailing their recommendations for a standardized orthography for NICE. Training for spoken language collection: In January 2001 native speakers of Mapudungun were trained in the recording and transcription of spoken data.

Underfunded Activities Data collection –Colombia (unfunded) –Chile (partially funded) Travel –More contact between CMU and Chile (UFRO) and Colombia. Training –Train Mapuche linguists in language technologies at CMU. –Extend training to Colombia Refine MT system for Mapudungun and Siona –Current funding covers research on the MT engine and data collection, but not detailed linguistic analysis

Outline History of MT--See Wired magazine May 2000 issue. Available on the web. How well does it work? Procedure for designing an LT project. Choose an application: What do you want to do? Identify the properties of your application. Methods: knowledge-based, statistical/corpus based, or hybrid. Methods: interlingua, transfer, direct Typical components of an MT system. Typical resources required for and MT system.

How well does it work? Example: SpanAm Possibly the best Spanish-English MT system. Around 20 years of development.

How well does it work? Example: Systran Try it on the Altavista web page. Many language pairs are available. Some language pairs might have taken up to a person-century of development. Can translate text on any topic. Results may be amusing.

How well does it work? Example: KANT Translates equipment manuals for Caterpillar. Input is controlled English: many ambiguities are eliminated. The input is checked carefully for compliance with the rules. Around 5 output languages. The output might be post-edited. The result has to be perfect to prevent accidents with the equipment.

How well does it work? Example: JANUS Translates spoken conversations about booking hotel rooms or flights. Six languages: English, French, German, Italian, Japanese, Korean (with partners in the C-STAR consortium). Input is spontaneous speech spoken into a microphone. Output is around 60% correct. Task Completion is higher than translation accuracy: users can always get their flights or rooms if they are willing to repeat 40% of their sentences.

How well does it work? Speech Recognition Jupiter weather information: You can say things like “what cities do you know about in Chile?” and “What will be the weather tomorrow in Santiago?”. Communicator flight reservations: CMU-PLAN. You can say things like “I’m travelling to Pittsburgh.” Speechworks demo: SAY-DEMO. You can say things like “Sell my shares of Microsoft.” These are all in English, and are toll-free only in the US, but they are speaker-indepent and should work with reasonable foreign accents.

Different kinds of MT Different applications: for example, translation of spoken language or text. Different methods: for example, translation rules that are hand crafted by a linguist or rules that are learned automatically by a machine. The work of building an MT program will be very different depending on the application and the methods.

Procedure for planning an MT project Choose an application. Identify the properties of your application. List your resources. Choose one or more methods. Make adjustments if your resources are not adequate for the properties of your application.

Choose an application: What do you want to do? Exchange or chat in Quechua and Spanish. Translate Spanish web pages about science into Quechua so that kids can read about science in their language. Scan the web: “Is there any information about such-and- such new fertilizer and water pollution?” Then if you find something that looks interesting, take it to a human translator. Answer government surveys about health and agriculture (spoken or written). Ask directions (“where is the library?”) (spoken). Read government publications in Quechua.

Identify the properties of your application. Do you need reliable, high quality translation? How many languages are involved? Two or more? Type of input. One topic (for example, weather reports) or any topic (for example, calling your friend on the phone to chat). Controlled or free input. How much time and money do you have? Do you anticipate having to add new topics or new languages?

Do you need high quality? Assimilation: Translate something into your language so that you can: –understand it--may not require high quality. –evaluate whether it is important or interesting and then send it off for a better translation-- does not require high quality. –use it for educational purposes--probably requires high quality.

Do you need high quality? Dissemination: Translate something into someone else’s language e.g., for publication. Usually should be high quality.

Do you need high quality? Two-Way: e.g., chat room or spoken conversation May not require high reliability on correctness if you have a native language paraphrase. –Original input : I would like to reserve a double room. –Paraphrase: Could you make a reservation for a double room.

Type of Input Formal text: newspaper, government reports, on- line encyclopedia. –Difficulty: long sentences Formal speech: spoken news broadcast. –Difficulty: speech recognition won’t be perfect. Conversational speech: –Difficulty: speech recognition won’t be perfect –Difficulty: disfluencies –Difficulty: non-grammatical speech Informal text: , chat –Difficulty: non-grammatical speech

Methods: Knowledge-Based Knowledge-based MT: a linguist writes rules for translation: –noun adjective --> adjective noun Requires a computational linguist who knows the source and target languages. Usually takes many years to get good coverage. Usually high quality.

Methods: statistical/corpus-based Statistical and corpus-based methods involve computer programs that automatically learn to translate. The program must be trained by showing it a lot of data. Requires huge amounts of data. The data may need to be annotated by hand. Does not require a human computational linguist who knows the source and target languages. Could be applied to a new language in a few days. At the current state-of-the-art, the quality is not very good.

Methods: Interlingua An interlingua is a machine-readable representation of the meaning of a sentence. –I’d like a double room/Quisiera una habitacion doble. –request-action+reservation+hotel(room-type=double) Good for multi-lingual situations. Very easy to add a new language. Probably better for limited domains -- meaning is very hard to define.

Multilingual Interlingual Machine Translation Instructions: Delete sample document icon and replace with working document icons as follows: Create document in Word. Return to PowerPoint. From Insert Menu, select Object… Click “Create from File” Locate File name in “File” box Make sure “Display as Icon” is checked. Click OK Select icon From Slide Show Menu, Select Action Settings. Click “Object Action” and select “Edit” Click OK

Methods: Transfer A transfer rule tells you how a structure in one language corresponds to a different structure in another language: –an adjective followed by a noun in English corresponds to a noun followed by an adjective in Spanish. Not good when there are more than two languages -- you have to write different transfer rules for each pair. Better than interlingua for unlimited domain.

Methods: Direct Direct translation does not involve analyzing the structure or meaning of a language. For example, look up each word in a bilingual dictionary. Results can be hilarious: “the spirit is willing but the flesh is weak” can become “the wine is good, but the meat is lousy.” Can be developed very quickly. Can be a good back-up when more complicated methods fail to produce output.

Components of a Knowledge- Based Interlingua MT System Morphological analyzer: identify prefixes, suffixes, and stem. Parser (sentence-to-syntactic structure for source language, hand-written or automatically learned) Meaning interpreter (syntax-to-semantics, source language). Meaning interpreter (semantics-to-syntax, target language). Generator (syntactic structure-to-sentence) for target language.

Resources for a knowledge-based interlingua MT system Computational linguists who know the source and target languages. As large a corpus as possible so that the linguists can confirm that they are covering the necessary constructions, but the size of the corpus is not crucial to system development. Lexicons for source and target languages, syntax, semantics, and morphology. A list of all the concepts that can be expressed in the system’s domain.

Components of Example Based MT: a direct statistical method A morphological analyzer and part of speech tagger would be nice, but not crucial. An alignment algorithm that runs over a parallel corpus and finds corresponding source and target sentences. An algorithm that compares an input sentence to sentences that have been previously translated, or whose translation is known. An algorithm that pulls out the corresponding translation, possibly slightly modifying a previous translation.

Resources for Example Based MT Lexicons would improve quality of translation, but are not crucial. A large parallel corpus (hundreds of thousands of words).

“Omnivorous” Multi-Engine MT: eats any available resources

Approaches we had in mind Direct bilingual-dictionary lookup: because it is easy and is a back-up when other methods fail. Generalized Example-Based MT: because it is easy and fast and can be also be a back-up. Instructable Transfer-based MT: a new, untested idea involving machine learning of rules from a human native speaker. Useful when computational linguists don’t know the language, and people who know the language are not computational linguists. Conventional, hand-written transfer rules: in case the new method doesn’t work.