LING 388: Language and Computers Sandiway Fong Lecture 26: 11/29.

Slides:



Advertisements
Similar presentations
Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell Corpus Linguistics 2003, Lancaster, England.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
IX- CONSTRUCTION PLANNING
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Independent Learning: A few brief ideas Strategies to use in and outside of the classroom.
Section 4: Language and Intelligence Overview Instructor: Sandiway Fong Department of Linguistics Department of Computer Science.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 26: 11/30.
C SC 620 Advanced Topics in Natural Language Processing Lecture 20 4/8.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
C SC 620 Advanced Topics in Natural Language Processing 3/11 Lecture 15.
LING 388: Language and Computers Sandiway Fong Lecture 28: 12/6.
C SC 620 Advanced Topics in Natural Language Processing Lecture 19 4/6.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Corpora and Language Teaching
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/24.
C SC 620 Advanced Topics in Natural Language Processing Lecture 24 4/22.
C SC 620 Advanced Topics in Natural Language Processing 3/9 Lecture 14.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
C SC 620 Advanced Topics in Natural Language Processing Lecture 10 2/19.
C SC 620 Advanced Topics in Natural Language Processing Lecture 17 3/25.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
C SC 620 Advanced Topics in Natural Language Processing Lecture 13 3/4.
Mathematics Grade Level Considerations for Grades 6-8.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Finding parallel texts on the web using cross-language information retrieval Achim Ruopp Joint work with Fei Xia University of Washington.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A mental image or best example of a category A methodical, logical rule or procedure that guarantees solving a particular problem.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Chapter 10 Language and Computer English Linguistics: An Introduction.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
An Investigation of Statistical Machine Translation (Spanish to English) Raghav Bashyal.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Statistical Machine Translation Part III – Phrase-based SMT / Decoding Alexander Fraser Institute for Natural Language Processing Universität Stuttgart.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Unit 7B: Cognition: Thinking, Problem Solving, Creativity, and Language.
 There must be a coherent set of links between techniques and principles.  The actions are the techniques and the thoughts are the principles.
C SC 620 Advanced Topics in Natural Language Processing Lecture 25 5/4.
Chap 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers Using Microsoft Excel 7 th Edition, Global Edition Copyright ©2014 Pearson Education.
Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Keyword Translation Accuracy and Cross-Lingual Question Answering in Chinese and Japanese Teruko Mitamura Mengqiu Wang Hideki Shima Frank Lin In CMU EACL.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Introduction to Machine Translation
Ling 575: Machine Translation Yuval Marton Winter 2016 February 9: MT Evaluation Much of the materials was borrowed from course slides of Chris Callison-Burch.
Approaches to Machine Translation
GRE.
Introduction to Machine Translation
Statistical Machine Translation Part III – Phrase-based SMT / Decoding
Approaches to Machine Translation
Introduction to Machine Translation
Introduction to Information Retrieval
Statistical n-gram David ling.
Aims of the meeting To inform you of the end of Key Stage 2 assessment procedures. To give you a better understanding of what’s involved in the SATs tests.
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

LING 388: Language and Computers Sandiway Fong Lecture 26: 11/29

Administrivia Homework #5 –due today

Homework 5: Question 1 what have (specific senses) of the following nouns in common? –Umbrella –Saucepan –Baseball bat –Carpet beater but do not share with: –Giraffe –Pretzel –Homework

Homework 5: Question 1 compound nouns are present in wnconnect –baseball batbaseball_bat(‘_’ from Prolog representation) –carpet beatercarpet_beater

Homework 5: Question 1 wnconnect is designed to look for links between two concepts –homework can be done this way –but perhaps better to explore with a WordNet browser, e.g. from Princeton

Last Time Internet search and language –information retrieval precision –what is the proportion of hits returned that are relevant? recall –what proportion of the true relevant answers are returned? –stemming pre-processing stage: find root forms of words to expand search can increase recall (perhaps at the expense of precision) compromise: selective use of stemming from Google

Last Time Internet search and language –compounding stemming interacts with compounds: operating systems compound identification is important for information retrieval semantics is more difficult –can have compositional semantics: tea leaf, teabag, teabreak –can be idiomatic: bootleg, marshmallow structural ambiguity: [computer furniture] design, computer [furniture design]

Today’s Topic Statistical Machine Translation (SMT)

Beginnings c (just after WWII) –electronic computers invented for numerical analysis code breaking Book (Collection of Papers) Readings in Machine Translation, Eds. Nirenburg, S. et al. MIT Press (Part 1: Historical Perspective) –Weaver, Reifer, Yngve, and Bar-Hillel … Killer Apps: Language comprehension tasks and Machine Translation (MT)

Basis in Cryptoanalysis? Success with computational methods and code-breaking [Translation. Weaver, W.] citing Shannon’s work, Weaver asks: “If we have useful methods for solving almost any cryptographic problem, may it not be that with proper interpretation we already have useful methods for translation?”

Statistical Basis Popular in the early days and has undergone a modern revival The Present Status of Automatic Translation of Languages (Bar-Hillel, 1951) –“I believe this overestimation is a remnant of the time, seven or eight years ago, when many people thought that the statistical theory of communication would solve many, if not all, of the problems of communication” Bar-Hillel’s criticisms include –much valuable time spent on gathering statistics –no longer a bottleneck today

Statistical Basis Popular in the early days and has undergone a modern revival Statistical Methods and Linguistics (Abney, 1996) –Chomsky vs. Shannon Statistics and low (zero) frequency items Colorless green ideas sleep furiously vs. furiously sleep ideas green colorless (lecture 22) Modern answer: smoothing No relation between order of approximation and grammaticality –n-th order approximation reflecting degree of grammaticality as n increases Parameter estimation problem is intractable (for humans) –statistical models involve learning or estimating very large number of parameters –“we cannot seriously propose that a child learns the values of 10 9 parameters in a childhood lasting only 10 8 seconds” –see IBM translation reference later (17 million parameters)

Early Misplaced Optimism (Bar-Hillel, 1951) Reifer (University of Washington) –Unbelievably optimistic claims –Compounding: –“found moreover that only three matching procedures and four matching steps are necessary to deal effectively with any of these ten types of compounds of any language in which they occur” –(compounding: problems, see lecture 25) –[i.e. we have heuristics that we think work] –“it will not be very long before the remaining linguistic problems in machine translation will be solved for a number of important languages”

Early Misplaced Optimism [Wiener] –“Basic English is the reverse of mechanical and throws upon such words as get a burden which is much greater than most words carry” [Weaver] –Multiple meanings on get yes –but a limited number of two word combinations get up, get over, get back –2000 words => 4 million two word combinations –not formidable to a “modern” (1947) computer get is very polysemous WordNet (Miller, 1981) lists 36 senses

Re-emergence of the Statistical Basis Conditions are different now –Computers 10 5 times faster –There has been a data revolution Gigabytes of storage really cheap Large, machine-readable corpora readily available for parameter estimation

Statistical MT Avoid the explicit construction of linguistically sophisticated models of grammar –Not the only way: e.g. Example-based MT (EBMT) Pioneered by IBM researchers (Brown et al., 1990) –Language Model Pr(S) estimated by n-grams –Translation Model Pr(T|S) estimated through alignment models

N-grams –we’ll talk about this more next time... idea: –collect statistics on co-occurrence of adjacent words Brown corpus (1 million words): –word wfrequency(w)probability(w) –the 69, –rabbit example: –Just then, the white –expectation is p(white rabbit) > p(white the) –but p(the) > p(rabbit)

Statistical MT Parameter estimation by crunching large-scale corpora Hansard French/English parallel corpus –The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament. While the content is therefore limited to legislative discourse, it spans a broad assortment of topics and the stylistic range includes spontaneous discussion and written correspondance along with legislative propositions and prepared speeches. (IBM’s experiment: 100 million words, est. 17 million parameters)

The State of the Art Statistical MT System [Spinoff from USC/ISI work] “ Language Weaver ’ s SMTS system is a significant advancement in the state of the art for machine translation … and [we] are confident that Language Weaver has produced the most commercially viable Arabic translation system available today. ” Metrics: performance determined by competition –common test and training data 1980s Japanese 1970s1960s Russian W. European languages present day Arabic

Real Progress or Not? (2003) MT Summit IX. –Proceedings available online Interesting paper by J. Hutchins: Has machine translation improved? Some historical comparisons. “… overall there has been definite progress since the mid 1960s and very probably since the early 1970s. What is more uncertain is whether and where there have been improvements since the early 1980s.” – Compared modern day systems against systems from the 1960s, 1970s (e.g. SYSTRAN) and 1980s Difficult: first systems are lost to us Languages –Russian to English –French to English –German to English

Real Progress or Not?

Real Progress or Not? [Hutchins, pp.7-8] “The impediments to the improvement of translation quality are the same now that they have been from the beginning: –failures of disambiguation –incorrect selection of target language words –problems with anaphora pronouns (it vs. she/he) definite articles (e.g. when translating from Russian and French) –inappropriate retention of source language structures e.g. verb-initial constructions (from Russian) verb-final placements (from German) non-English pre-nominal participle constructions (e.g. with interest to be read materials from both Russian and German) –problems of coordination –numerous and varied difficulties with prepositions –in general always problems with any multi-clause sentence” Roughly echoes what Bar-Hillel said about 50 years earlier

Statistical vs. Traditional Which ones are commercially deployed? –internet translators: traditional –new languages: statistical