Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Chapter 3: Modules, Hierarchy Charts, and Documentation
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Computers: Tools for an Information Age
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
MT Summit VIII, Language Technologies Institute School of Computer Science Carnegie Mellon University Pre-processing of Bilingual Corpora for Mandarin-English.
Models of Computation as Program Transformations Chris Chang
Developing a Basic Web Page with HTML
Chapter Section A: Verb Basics Section B: Pronoun Basics Section C: Parallel Structure Section D: Using Modifiers Effectively The Writer’s Handbook: Grammar.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
(C) 2013 Logrus International Practical Visualization of ITS 2.0 Categories for Real World Localization Process Part of the Multilingual Web-LT Program.
Latin Grammar: Singular and Plural Magister Henderson Latin I.
Assessing Reading Meeting Year 5 Expectations
Kalyani Patel K.S.School of Business Management,Gujarat University.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
 (Worse) The number of banks charging their customers ATM user fees are increasing.  (Better) The number of banks charging their customers ATM user.
Organizing Information Digitally Norm Friesen. Overview General properties of digital information Relational: tabular & linked Object-Oriented: inheritance.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Data Elicitation for AVENUE Lori Levin Alison Alvarez Jeff Good (MPI Leipzig) Bob Frederking Erik Peterson Language Technologies Institute Carnegie Mellon.
Computational Investigation of Palestinian Arabic Dialects
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Eliciting Features from Minor Languages The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
ISURF -An Interoperability Service Utility for Collaborative Supply Chain Planning across Multiple Domains Prof. Dr. Asuman Dogac METU-SRDC Turkey METU.
ACE TESOL Diploma Program – London Language Institute OBJECTIVES You will understand: 1. Semantic feature analysis 2. Semantic field analysis You will.
Object-Oriented Modeling Chapter 10 CSCI CSCI 1302 – Object-Oriented Modeling2 Outline The Software Development Process Discovering Relationships.
Finalizing Design Specifications
Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 3 1 Software Size Estimation I Material adapted from: Disciplined.
An ICALL writing support system tunable to varying levels of learner initiative Karin Harbusch 1 & Gerard Kempen 2,3 1 University of Koblenz-Landau, Koblenz,
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
UML Class Diagram Trisha Cummings. What we will be covering What is a Class Diagram? Essential Elements of a UML Class Diagram UML Packages Logical Distribution.
THE FUNCTIONS AND PURPOSES OF TRANSLATORS TOPIC 2 CONTENT: 2.1. Types of translators 2.2. Lexical analysis 2.3. Syntax analysis 2.4. Code generation 2.5.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Carnegie Mellon Goal Recycle non-expert post-editing efforts to: - Refine translation rules automatically - Improve overall translation quality Proposed.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Chapter 1 Introduction. Chapter 1 - Introduction 2 The Goal of Chapter 1 Introduce different forms of language translators Give a high level overview.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
Communicative and Academic English for the EFL Professional.
Data Elicitation for AVENUE By: Alison Alvarez Lori Levin Bob Frederking Jeff Good (MPI Leipzig) Erik Peterson.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Bridging the Gap: Machine Translation for Lesser Resourced Languages
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Grammar Chapter 10. What is Grammar? Basic Points description of patterns speakers use to construct sentences stronger patterns - most nouns form plurals.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Eliciting a corpus of word-aligned phrases for MT
Business Process and Functional Modeling
Assessing Grammar Module 5 Activity 5.
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
Unified Modeling Language
Introduction to Unified Modeling Language (UML)
Part of the Multilingual Web-LT Program
An ICALL writing support system tunable to varying levels
Applied Linguistics Chapter Four: Corpus Linguistics
TECHNICAL REPORTS WRITING
Presentation transcript:

Semi-Automated Elicitation Corpus Generation The elicitation tool provides a simple interface for bilingual informants with no linguistic training and limited computer skills to translate and word-align a corpus in some source language. The output of the elicitation tool is a text file containing triplets of eliciting sentence, elicited sentence, and alignment. The elicitation tool can produce bilingual glossaries based on the aligned corpus. It also has a simple "auto-align" option to add alignments for unambiguous word pairs in the same file. The Elicitation Tool Our Goals Alison Alvarez Lori Levin Robert Frederking Erik Peterson Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA Jeff Good (MPI Leipzig) Max Planck Institute for Evolutionary AnthropologyDeutscher Platz Leipzig Prewritten Feature Specification Linguist + GUI + control language Used by Sends corpus definition Feature Structure Production Module Produces Set of surface text sentences and context comments GenKit + Prewritten Grammar and Lexicon GL GenKit Generates Used by Elicitation corpus Translated by Bilingual informant using elicitation tool List of feature structures The start-to-finish process of the corpus generation system. Ovals indicate software components. The page- boxes indicate human or computer generated documents The Generation Process Feature Specification Generation & Output Functional-Typological Corpora ((subj ((np-my-general-type pronoun-type common-noun-type) (np-my-person person-first person-second person-third) (np-my-number num-sg num-pl) (np-my-biological-gender bio-gender-male bio-gender-female) (np-my-function fn-predicatee))) {[(predicate ((np-my-general-type common-noun-type) (np-my-definiteness definiteness-minus) (np-my-person person-third) (np-my-function predicate))) (c-my-copula-type role)] [(predicate ((adj-my-general-type quality-type))) (c-my-copula-type attributive)] [(predicate ((np-my-general-type common-noun-type) (np-my-person person-third) (np-my-definiteness definiteness-plus) (np-my-function predicate))) (c-my-copula-type identity)]} (c-my-secondary-type secondary-copula) (c-my-polarity #all) (c-my-function fn-main-clause)(c-my-general-type declarative) (c-my-speech-act sp-act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state) (c-v-my-absolute-tense past present future) (c-v-my-phase-aspect durative)) “Use all values of polarity” “Multiply out by these lists of values” Disjoint set of copula types and their predicates Control Language A control language is used to define the size and scope of the set of feature structures that will be used by GenKit to generate the corpus The purpose of the feature specification is to define the list of features and corresponding values that are available for producing feature structures. Additionally, the feature specification determines what kind of phrases can use what kinds of features. For example, the polarity feature carries the value of positive and negative but can only be applied at the clause level. We have written the feature specification with XML markup language. The specification itself is realized as a hierarchical structure of values contained within features. Each level also contains markup listing exclusions and further source notes To the left is the specification for the NP feature number and its three values: singular, plural and dual. np-my-number num-sg num-pl num-dual Notes for analysis of data: CS, page 38, seem to imply that some combinations of numbers are more expected than others Overview This research is part of the AVENUE Machine Translation Project. AVENUE is supported by the US National Science Foundation, NSF grant number IIS In the field of Machine Translation fully aligned and tagged translation corpora are considered to be one of the most valuable resources for automatically training translation systems. However, among minority languages such resources are hard to find. It is possible to overcome this obstacle by using techniques inspired by field linguistics. That is, by drawing on bilingual informants to translate and align given sentences. Field linguists have relied on questionnaires that have remained relatively static over a number of years. We want the flexibility to change the questionnaire to reflect different semantic domains, different goals for machine translation systems, different levels of detail, etc. We also want the questionnaire to be available in multiple languages. For example, we would want a version of the questionnaire in Spanish for use by Latin American minority language speakers. We also want flexibility in lexical selection in order to avoid cultural bias and to choose appropriate lexical items for the major language. This paper will look at methods for specifying the scope and depth of an elicitation corpus as well as methods for quick design and implementation of elicitation corpora. The resulting can also be used as a test suite to explore existing machine translation systems or design far-reaching corpora for studying low resource languages. ((subj ((np-my-general-type pronoun-type) (np-my-person person-third) (np-my-number num-sg) (np-my-biological-gender bio-gender-male) (np-my-function fn-predicatee)(np-my-animacy anim-human) (np-my-info-function info-neutral)(np-d-my-distance-from-speaker distance-neutral) (np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph) (np-my-semantic-class NEED_VALUES)(np-pronoun-exclusivity exclusivity-n/a) (np-pronoun-antecedent-function antecedent-n/a))) (predicate ((np-my-general-type common-noun-type) (np-my-person person-third) (np-my-function predicate)(np-my-animacy anim-human) (np-my-info-function info-neutral) (np-d-my-distance-from-speaker distance-neutral) (np-pronoun-reflexivity reflexivity-n/a)(np-my-emphasis emph-no-emph) (np-my-number num-sg)(np-my-semantic-class NEED_VALUES) (np-pronoun-exclusivity exclusivity-n/a) (np-pronoun-antecedent-function antecedent-! n/a))) (c-my-copula-type role) (c-my-secondary-type secondary-copula) (c-my-polarity polarity- positive) (c-my-function fn-main-clause) (c-my-general-type declarative)(c-my-speech-act sp- act-state) (c-v-my-grammatical-aspect gram-aspect-neutral) (c-v-my-lexical-aspect state)(c- v-my-absolute-tense past)(c-v-my-phase-aspect durative)(c-my-imperative-degree imp- degree-n/a)(c-my-ynq-type ynq-n/a)(c-my-actor's-sem-role actor-sem-role-neutral)(c-my- minor-type minor-n/a)(c-my-headedness-rc rc-head-n/a)(c-my-answer-type ans-n/a)(c-my- restrictivess-rc rc-restrictive-n/a)(c-my-focus-rc focus-n/a)(c-my-actor's-status actor- neutral)(c-my-gaps-function gap-n/a)(c-my-relative-tense relative-n/a)) Feature Structures They are multi-level sets of feature-value pairs that are used to reflect the grammatical structures intended for elicitation. When paired with an English grammar and lexicon the above feature structure will generate ‘He was a teacher.’ srcsent: I was the teacher. context: I = one_man srcsent: I was interesting. context: I = one_man srcsent: I was a teacher. context: I = one_man srcsent: I am a teacher. context: I = one_man srcsent: I was a teacher. context: I = one_woman A typological-functional corpus is designed to elicit a range of language features (for example, tense, person, number) and explore the way those features are manifested in a target language. AVENUE is a system for learning translation rules from a word aligned bilingual corpus. One phase of rule learning is feature detection, which uses the elicitation corpus to discover morpho-syntactic properties of a minority language. Thus, we expect to generate sentences with high degrees of uniformity that can easily be compared in order to discover typological properties such as whether the verb agrees with the subject, whether nouns have singulars and plurals, etc. In order to discover these properties, we compare sentences like "The child read a book" and "The children read a book" in order to see if the translation of "child" or "read" changes when "child" is understood as plural. An elicitation corpus is the untranslated major language corpus that will be presented to the language informant using the elicitation tool. Although our corpus creation tools allow the creation of any kind of corpus, we have focused on linguistic functions such as cardinality and identifiability rather than on linguistic forms such as suffixes and determiners. Our focus on function is a consequence of the AVENUE rule learning scenario, in which it is possible that nothing is known about the form of the minor language. Our goal is to vary the functions and observe changes in the forms. Generation is performed using GenKit generation software (Tomita et al. 1988). It takes the feature structures along with a corresponding grammar and lexicon and generates a surface string along with a comment. Generated comments are used to show pieces of meaning that might not be evident in the major/source language but may be found in the target/minority language. For example, the first person singular pronoun in English does not carry gender, so a comment will be generated indicating that “I = gender-female” or “I = gender-male”. When using the elicitation tool this information is presented to the bilingual informant using the context field. 1. Tools for semi-automated corpus design: Test suite for MT Structured corpus for input to machine learning 2. A user interface for producing high quality, word-aligned parallel corpora (Elicitation Tool) 3. Automated learning of morpho-syntax for low-resource languages