Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University
The data set should support Machine learning Machine learning from small data can work if the data is structured. Analysis by humans Humans can learn a lot from a small data set if the form-function mappings are clear.
Concrete Suggestions 1. Hand align a portion of the corpus. 2. Include parse trees and feature structures for a portion of the corpus. 3. Include a representative sample of diversity of phrase structures. 4. Include a representative sample of diversity in function/meaning. 5. Include some simple, single sentences. 6. Include some full texts. 7. Look for well-known divergences. 8. Conduct an evaluation to be sure that the corpus elicits what you want it to elicit.
Hand align a portion of the corpus Automatic alignments algorithms can be bootstrapped from the hand alignments. A lexicon can be created from the alignments. Humans can study word usage.
Provide parse trees for a portion of the corpus Parse trees plus alignments can be input to Avenue-style rule learning Automatic treebanking of the minor language Humans can study the translation of specific structures. There should be semantic and functional information in addition to structural information. See below.
Include a representative example of structural diversity Part of the corpus can be structured to include simple, common sub-trees from the English Penn TreeBank. Learn a collection of structural mappings that is compositional A lot of mileage from small data Preliminary work with Katharina Probst Raw WSJ data requires editing Need redundant examples of each structure
Include a representative example of function or meaning Finding out how English structures translate into minor language structures is not enough For example, finding out how to translate English auxiliary verbs is not useful because they have many functions: tense, aspect, epistemics, evidentials, etc. Finding out how to express tense, aspect, epistemics, evidentials, etc. is useful.
Include some multi-sentence texts In order to observe Temporal sequencing of events Causation Rhetorical relations Contrast, elaboration, etc. Given and new information Co-reference
Look for well-known divergences E.g., run across the street vs cross the street running But see below for our view of divergences.
Include some simple sentences So that the form-function mapping is clear to a human without confounding factors As a seed for machine learning
Evaluation Test the corpus on a few languages that in order to be sure that the intended structures and functions are elicited. Need to watch out for idiosyncrasies, lexical gaps, special constructions, etc. For example, if you want to elicit a noun modified by a preposition, the person in the room will work better than a bottle of wine.
Hard problems Body of common phenomena with a tail of phenomena that are individually rare, but collectively massive.
Extra slides Our view of translation divergences Elaboration on the different roles of structure and function
Our view of divergences which is divergent from some other views of divergences Divergences arise when the same function is expressed by a different structure. Many functions are expressed by specialized constructions that do not translate literally into other languages. Divergences cannot be neatly grouped into a few classes. Typological differences between languages are relevant: Embedding vs serialization Synthetic vs analytic causative constructions
Coverage: Structure and Function Structural Diversity Appositives, adjuncts, embedded clauses, coordinate structures, ellipsis, etc. Functional/Meaning Diversity Temporal relations, rhetorical relations, modality, negation, tense, aspect, etc.
Structure and Function The way you understand a text is by knowing which structure has which function. The same function is expressed by different structures in different languages.
What a human needs to know (function) Who did what to who when? What happened before/after what? What caused what? Is it first hand knowledge, hearsay, or inference? Is it certain, probable, or improbable? Did it happen or not? What do these words mean?
How a human knows these things (structure/grammar) Who did what to who when? Grammatical relations, coreference, time expressions, pronouns/pro-drop, nominalizations, subordinate clauses, case marking, word order, agreement, tense, aspect What happened before/after what? Time expressions, temporal connectives, tense and aspect morphemes What caused what Markers of rhetorical relationsbetween sentences Is it first hand knowledge, hearsay, or inference? Is it certain, probable, or improbable? Markers of modality and epistemics Did it happen or not? Markers of negation and counterfactuals What do these words mean? Vocabulary Other Questions, existentials, possessives, coordinate structures
How to make sure the corpus captures what a human needs to know Organize the corpus by function and then a human can observe the corresponding structure.
Coverage of data for human analysis: basics Closed Class and Special Constructions Dates, names, numbers, prices, etc. Pronouns, prepositions, etc. Encoding of grammatical relations and/or semantic roles. How do you know who did what to who? Word order, case marking, agreement Encoding of old and new information Word order, special constructions (e.g., clefts), etc. Questions Negation Modification Possession Coordination Indirect speech
Coverage of data for human analysis: multi-sentence and multi-clause Rhetorical relations Cause, elaboration, contrast, etc. Temporal relations Before, after, during, etc. Same subject and obviation phenomena Subordination As subject or object As complement As adjunct
Other grammatically encoded meanings Modality and Epistemics Certainty, source of information (first hand, second hand, inference), etc. Conditionals Comparatives Existentials Tense and aspect Definiteness