Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
APA Style Grammar. Verbs  Use active rather than passive voice, select tense and mood carefully  Poor: The survey was conducted in a controlled setting.
Advertisements

Syntax. Definition: a set of rules that govern how words are combined to form longer strings of meaning meaning like sentences.
Linguistics, Morphology, Syntax, Semantics. Definitions And Terminology.
A / A* Communicate a lot of relevant information in well sequenced paragraphs Narrate events, give full descriptions Express and explain ideas and points.
Statistical NLP: Lecture 3
Chapter 4 Basics of English Grammar
Ian Cushing English teacher, Surbiton High School UK Linguistics Olympiad Committee Education Committee, Linguistics Association of Great Britain Grammar.
Sentences Pasco-Hernando Community College Tutorial Series.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Albert Gatt LIN3022 Natural Language Processing Lecture 8.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Word Classes and English Grammar.
1 Annotation Guidelines for the Penn Discourse Treebank Part B Eleni Miltsakaki, Rashmi Prasad, Aravind Joshi, Bonnie Webber.
Elicitation Corpus April 12, Agenda Tagging with feature vectors or feature structures Combinatorics Extensions.
NLP and Speech 2004 English Grammar
Quantitative Evaluation of Machine Translation Systems: Sentence Level Palmira Marrafa António Ribeiro.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
GRAMMAR APPROACH By: Katherine Marzán Concepción EDUC 413 Prof. Evelyn Lugo.
Grammar Skills Workshop
Chapter 4 Basics of English Grammar Business Communication Copyright 2010 South-Western Cengage Learning.
Coordination Types of conjunctions Compound Sentences
2. Phrases / Agreement. Phrases A phrase is a group of words that cannot stand alone as a sentence. Unlike the clause, a phrase does not have a subject-verb.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Dr. Monira Al-Mohizea MORPHOLOGY & SYNTAX WEEK 12.
© 2006 SOUTH-WESTERN EDUCATIONAL PUBLISHING 11th Edition Hulbert & Miller Effective English for Colleges Chapter 9 SENTENCES: ELEMENTS, TYPES, AND STRUCTURES.
CRESST ONR/NETC Meetings, July 2003, v1 ONR Advanced Distributed Learning Linguistic Modification of Test Items Jamal Abedi University of California,
Assessment of Morphology & Syntax Expression. Objectives What is MLU Stages of Syntactic Development Examples of Difficulties in Syntax Why preferring.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
English Review for Final These are the chapters to review. In Textbook: Chapter 1 Nouns Chapter 2 Pronouns Chapter 3 Adjectives Chapter 4 Verbs Chapter.
GrammaticalHierarchy in Information Flow Translation Grammatical Hierarchy in Information Flow Translation CAO Zhixi School of Foreign Studies, Lingnan.
Metalanguage Revision English language year
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Culture , Language and Communication
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
$100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100.
C HAPTER 11 Grammar Fundamentals. T HE P ARTS OF S PEECH AND T HEIR F UNCTIONS Nouns name people, places things, qualities, or conditions Subject of a.
English Review for Final These are the chapters to review. In Textbook: Chapter 9 Nouns Chapter 10 Pronouns Chapter 11 Adjectives Chapter 12 Verbs Chapter.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Corpus search What are the most common words in English
1 Machine translation or Automatic translation or Computer-assisted translation.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Basic Syntactic Structures of English CSCI-GA.2590 – Lecture 2B Ralph Grishman NYU.
Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.
Category 2 Category 6 Category 3.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
Subject/Predicate Bell Ringer…
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
6 TH GRADE ACADEMIC VOCABULARY 2 ND GRADING PERIOD.
Writing 2 ENG 221 Norah AlFayez. Lecture Contents Revision of Writing 1. Introduction to basic grammar. Parts of speech. Parts of sentences. Subordinate.
THE GENITIVE CASE Their Syntactical Classification.
Nouns Parts of Speech Adverb Verb Adjective Pronoun Preposition
Eliciting a corpus of word-aligned phrases for MT
Coordination Types of conjunctions Compound Sentences
Parts of Speech Review.
Appendix A: Basic Grammar and Punctuation Reference
Statistical NLP: Lecture 3
Cracking the English Test
Syntax of the English Language
Writing: Grammar and Usage
Cracking the English Test
SAT Writing and Language/ACT English:
Chapter 4 Basics of English Grammar
Parts of Speech Friendly Feud
Writing Analytics Clayton Clemens Vive Kumar.
Major categories of test
English parts of speech
PREPOSITIONAL PHRASES
Chapter 4 Basics of English Grammar
The 7Cs: A Pedagogical Framework for Grammar Teaching and Learning
Vocabulary/Lexis LEXIS: n., collective, uncountable
Information Retrieval
Presentation transcript:

Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University

The data set should support Machine learning  Machine learning from small data can work if the data is structured. Analysis by humans  Humans can learn a lot from a small data set if the form-function mappings are clear.

Concrete Suggestions 1. Hand align a portion of the corpus. 2. Include parse trees and feature structures for a portion of the corpus. 3. Include a representative sample of diversity of phrase structures. 4. Include a representative sample of diversity in function/meaning. 5. Include some simple, single sentences. 6. Include some full texts. 7. Look for well-known divergences. 8. Conduct an evaluation to be sure that the corpus elicits what you want it to elicit.

Hand align a portion of the corpus Automatic alignments algorithms can be bootstrapped from the hand alignments. A lexicon can be created from the alignments. Humans can study word usage.

Provide parse trees for a portion of the corpus Parse trees plus alignments can be input to  Avenue-style rule learning  Automatic treebanking of the minor language Humans can study the translation of specific structures. There should be semantic and functional information in addition to structural information. See below.

Include a representative example of structural diversity Part of the corpus can be structured to include simple, common sub-trees from the English Penn TreeBank. Learn a collection of structural mappings that is compositional  A lot of mileage from small data Preliminary work with Katharina Probst  Raw WSJ data requires editing  Need redundant examples of each structure

Include a representative example of function or meaning Finding out how English structures translate into minor language structures is not enough  For example, finding out how to translate English auxiliary verbs is not useful because they have many functions: tense, aspect, epistemics, evidentials, etc.  Finding out how to express tense, aspect, epistemics, evidentials, etc. is useful.

Include some multi-sentence texts In order to observe  Temporal sequencing of events  Causation  Rhetorical relations Contrast, elaboration, etc.  Given and new information  Co-reference

Look for well-known divergences E.g., run across the street vs cross the street running But see below for our view of divergences.

Include some simple sentences So that the form-function mapping is clear to a human without confounding factors As a seed for machine learning

Evaluation Test the corpus on a few languages that in order to be sure that the intended structures and functions are elicited.  Need to watch out for idiosyncrasies, lexical gaps, special constructions, etc.  For example, if you want to elicit a noun modified by a preposition, the person in the room will work better than a bottle of wine.

Hard problems Body of common phenomena with a tail of phenomena that are individually rare, but collectively massive.

Extra slides Our view of translation divergences Elaboration on the different roles of structure and function

Our view of divergences which is divergent from some other views of divergences Divergences arise when the same function is expressed by a different structure. Many functions are expressed by specialized constructions that do not translate literally into other languages. Divergences cannot be neatly grouped into a few classes. Typological differences between languages are relevant:  Embedding vs serialization  Synthetic vs analytic causative constructions

Coverage: Structure and Function Structural Diversity  Appositives, adjuncts, embedded clauses, coordinate structures, ellipsis, etc. Functional/Meaning Diversity  Temporal relations, rhetorical relations, modality, negation, tense, aspect, etc.

Structure and Function The way you understand a text is by knowing which structure has which function. The same function is expressed by different structures in different languages.

What a human needs to know (function) Who did what to who when? What happened before/after what? What caused what? Is it first hand knowledge, hearsay, or inference? Is it certain, probable, or improbable? Did it happen or not? What do these words mean?

How a human knows these things (structure/grammar) Who did what to who when?  Grammatical relations, coreference, time expressions, pronouns/pro-drop, nominalizations, subordinate clauses, case marking, word order, agreement, tense, aspect What happened before/after what?  Time expressions, temporal connectives, tense and aspect morphemes What caused what  Markers of rhetorical relationsbetween sentences Is it first hand knowledge, hearsay, or inference? Is it certain, probable, or improbable?  Markers of modality and epistemics Did it happen or not?  Markers of negation and counterfactuals What do these words mean?  Vocabulary Other  Questions, existentials, possessives, coordinate structures

How to make sure the corpus captures what a human needs to know Organize the corpus by function and then a human can observe the corresponding structure.

Coverage of data for human analysis: basics Closed Class and Special Constructions  Dates, names, numbers, prices, etc.  Pronouns, prepositions, etc. Encoding of grammatical relations and/or semantic roles.  How do you know who did what to who?  Word order, case marking, agreement Encoding of old and new information  Word order, special constructions (e.g., clefts), etc. Questions Negation Modification Possession Coordination Indirect speech

Coverage of data for human analysis: multi-sentence and multi-clause Rhetorical relations  Cause, elaboration, contrast, etc. Temporal relations  Before, after, during, etc. Same subject and obviation phenomena Subordination  As subject or object  As complement  As adjunct

Other grammatically encoded meanings Modality and Epistemics  Certainty, source of information (first hand, second hand, inference), etc. Conditionals Comparatives Existentials Tense and aspect Definiteness