Development of a German- English Translator Felix Zhang TJHSST Computer Systems Research Lab 2007-2008 Period 5.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Der Genitiv-Genitive Case
Development of a German- English Translator Felix Zhang.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Part-Of-Speech Tagging and Chunking using CRF & TBL
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Text Features Dr. Paula Matuszek (610)
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
1 A Sentence Boundary Detection System Student: Wendy Chen Faculty Advisor: Douglas Campbell.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Part of speech (POS) tagging
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Parsing the NEGRA corpus Greg Donaker June 14, 2006.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Evidence from Content INST 734 Module 2 Doug Oard.
Methods in Computational Linguistics II Queens College Lecture 5: List Comprehensions.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Natural Language Processing Lecture 6 : Revision.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Development of a German- English Translator Felix Zhang Period Thomas Jefferson High School for Science and Technology Computer Systems Research.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Playing GWAP with strategies - using ESP as an example Wen-Yuan Zhu CSIE, NTNU.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Linguistics The eleventh week. Chapter 4 Syntax  4.1 Introduction  4.2 Word Classes.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שבע Partial Parsing אורן גליקמן.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Tokenization & POS-Tagging
CSA3202 Human Language Technology HMMs for POS Tagging.
DGP 18 i wish that i were old enough to vote because im very concerned about some of the issues in this presidential election.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Supertagging CMSC Natural Language Processing January 31, 2006.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.
Pastra and Saggion, EACL 2003 Colouring Summaries BLEU Katerina Pastra and Horacio Saggion Department of Computer Science, Natural Language Processing.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Word classes and part of speech tagging Chapter 5.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Measuring Monolinguality
Automatic Digitizing.
CSCI 5832 Natural Language Processing
Tagging and Statistically Translating Latin Sentences
Development of a German-English Translator
Presentation transcript:

Development of a German- English Translator Felix Zhang TJHSST Computer Systems Research Lab Period 5

Summary of previous quarters Developed a functional translator for simple German sentences to simple English sentences –All rule-based, no statistical methods, yet –Very specifically geared towards German – language dependent

Scope for 4 th quarter Statistical methods –Part-of-speech tagging –Morphological analysis –Information available in TIGER corpus Accuracy testing –Reliability of statistical methods –Vs. Rule-based

Finishing Rule-Based Translation Convert translation into user-readable format Punctuation and capitalization ~/research $ python proj.py Part of speech tags: [['den', 'art'], ['kurzen', 'adj'], ['Mann', 'nou'], ['machen', 'ver'], ['die', 'art'], ['kleinen', 'adj'], ['Kinder', 'nou']] Morphological analysis: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen', 'ver'], [['1', 'pl'], ['3', 'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl'], ['akk', 'pl']]]] Disambiguated after noun-verb agreement: [[['kurzen', 'adj'], [['akk', 'mas'], ['dat', 'pl']]], [['Mann', 'nou'], [['akk', 'mas'], ['dat', 'pl']]], [['machen', 'ver'], [['3', 'pl'], 'pres']], [['kleinen', 'adj'], [['nom', 'pl'], ['akk', 'pl']]], [['Kinder', 'nou'], [['nom', 'pl']]]] Lemmatized: [['kurzen', ['kurz']], ['Mann', ['Mann', 'Man']], ['machen', ['machen']], ['kleinen', ['klein']], ['Kinder', ['Kind']]] Root translated: [['den', 'the'], ['kurzen', 'short'], ['Mann', 'man'], ['machen', 'make'], ['die', 'the'], ['kleinen', 'small'], ['Kinder', 'child']] NP Chunked English: [[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]]], ['make', 'ver', [['3', 'pl'], 'pres']], [['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]]]] Assigned an element type: [[['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], [['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub']] Assigned priority: [['5', ['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['1', ['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub']] Rearranged to English structure: [['1', ['the', 'art'], ['small', 'adj'], ['child', 'nou', [['nom', 'pl']]], 'sub'], ['2', 'make', 'ver', [['3', 'pl'], 'pres'], 'mverb'], ['5', ['the', 'art'], ['short', 'adj'], ['man', 'nou', [['akk', 'mas'], ['dat', 'pl']]], 'dobj']] Inflected: The small childs make the short man.

Statistical Methods Trivial methods –Part of speech tagging, morphological analysis –Tags based on most frequently occurring tag with the word in corpus –Theoretically, should still achieve reasonable levels of accuracy

Testing Check all 746,660 words in TIGER corpus –Problem: Running time Match: Actual part of speech / linguistic properties of the word in the corpus == part of speech, etc. that program assigns based on maximum likelihood Predictions from research: 90% accuracy for part of speech tags

Accuracy – Part of speech Reaches about 90% percent accuracy, as predicted Match NE NE Nato Match ADV ADV allein Match VAFIN VAFIN sein Match PIAT PIAT kein Match NN NN Zukunftskonzept Match APPR APPR für Match ART ART der Match NN NN Sicherheit Match APPR APPR in Match NE NE Europa Total matches: Total words: Accuracy: %

Accuracy – Morphological Analysis Lower accuracy – Properties are more context- dependent Match allein Match 3.Sg.Pres.Ind 3.Sg.Pres.Ind ist Match Nom.Sg.Neut Nom.Sg.Neut Zukunftskonzept Match für Match Acc.Sg.Fem Acc.Sg.Fem die Match Acc.Sg.Fem Acc.Sg.Fem Sicherheit Match in Match Dat.Sg.Neut Dat.Sg.Neut Europa Total matches: Total words: Accuracy: %

Problems No good way to compare effectiveness of rule-based vs. statistical translations Accuracy not high enough –10% error too large in a 700,000+ word corpus –27% even worse Slow run times

Future Research Combine statistical and rule-based methods together in hybrid program Find way to compare accuracy of two techniques More advanced (and more accurate) statistical methods – Context-based