The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin Snyder University of Wisconsin-Madison 28 July, 2011
The University of Wisconsin-Madison Unsupervised NLP Unsupervised learning in NLP has become popular 27 papers in this year ACL+EMNLP Relies on inductive bias, encoded in model structure or learning algorithm. Example : HMM for POS induction, encodes transitional regularity ? ? ? ? I liketoread 1
The University of Wisconsin-Madison Inductive Biases Formulated with weak empirical grounding (or left implicit) Single, simple bias for all languages low performance, complicated models, fragility, language dependence. Our approach : learn complex, universal bias using labeled languages 2 i.e. Empirically learn what the space of plausible human languages looks like to guide unsupervised learning
The University of Wisconsin-Madison Key Idea 1)Collect labeled corpora (non-parallel) for several training languages Training languages 3 Test language
The University of Wisconsin-Madison 2) Map each (x,y) pair into a “universal feature space” - i.e. to allow cross-lingual generalization Training languages 4 Test language Key Idea
The University of Wisconsin-Madison score (·) 3) Train scoring function over universal feature space - i.e. treat each annotated language as single data point in structured prediction problem Training languages 5 Test language Key idea
The University of Wisconsin-Madison score (·) argmax y 4) Predict test labels which yield highest score Training languages 6 Test language score ( ) Key Idea
The University of Wisconsin-Madison Test Case: Nominal Morphology Languages differ in morphological complexity - Only 4 English noun tags in Penn Treebank noun tags in Hungarian corpus (suffix encode case, number, and gender) Our analysis will break each noun into : stem, phonological deletion rule, and suffix - utiskom [ stem = utisak, del = (..ak# →..k#), suffix = om ] Question : Can we use morphologically annotated languages to train a universal morphological analyzer ? 7
The University of Wisconsin-Madison Our Method Universal feature space (8 features) - Size of stem, suffix, and deletion rule lexicons - Entropy of stem, suffix, and deletion rule distributions - Percentage of suffix-free words, and words with phonological deletions. Learning algorithm - Broad characteristics of morphology often similar across select language pairs - Motivates a nearest neighbor approach - In structured scenario, learning becomes a search problem over label space 8
The University of Wisconsin-Madison Structured Nearest Neighbor Main Idea: predict analysis for test language which brings us closest in feature space to a training language. 1) Initialize analysis of test language: 2) For each training language : - iteratively and greedily update test language analysis to bring closer in feature space to 3)After T iterations, choose training language closest in feature space: 4)Predict the associated analysis: 9 TrainingTest
The University of Wisconsin-Madison Structured Nearest Neighbor 10 Training languages: Initialize test language labels:
The University of Wisconsin-Madison Structured Nearest Neighbor 11 Iterative Search:
The University of Wisconsin-Madison Structured Nearest Neighbor 12 Iterative Search:
The University of Wisconsin-Madison Structured Nearest Neighbor 13 Iterative Search:
The University of Wisconsin-Madison Structured Nearest Neighbor 14 Predict:
The University of Wisconsin-Madison Morphology Search Algorithm 15 Initialization Reanalyze Each Word Find New Stems Find New Suffixes Based on (Goldsmith 2005) - He minimizes description length - We minimize distance to training language Training Candidates Select Stage 0: Stage 1: Stage 2: Stage 3:
The University of Wisconsin-Madison Iterative Search Algorithm 16 Stage 0 : Using “character successor frequency,” initialize sets T, F, and D. Stem Set TSuffix Set FDeletion rule Set F
The University of Wisconsin-Madison Iterative Search Algorithm 17 Stage 1 : - greedily reanalyze each word, keeping T and F fixed. Stem Set TSuffix Set FDeletion rule Set F
The University of Wisconsin-Madison Iterative Search Algorithm 18 Stage 2 : - greedily analyze unsegmented words, keeping F fixed Stem Set TSuffix Set FDeletion rule Set F
The University of Wisconsin-Madison Iterative Search Algorithm 19 Stage 3 : Find new Suffixes - greedily analyze unsegmented words, keeping T fixed Stem Set TSuffix Set FDeletion rule Set F
The University of Wisconsin-Madison Experimental Setup Corpus: Orwell’s Nineteen Eighty Four (Multext East V3) - Languages: Bulgarian, Czech, English, Estonian, Hungarian, Romanian, Slovene, Serbian - 94,725 tokens (English). Slight confound: data is parallel. Method does not assume or exploit this fact. - all words tagged with morpho-syntactic analysis. Baseline: Linguistica model (Goldsmith 2005) - same search procedure, greedily minimizes description length Upper bound: supervised model - structured perceptron framework (Collins 2002) 20
The University of Wisconsin-Madison Aggregate Results 21 Accuracy: fraction of word types with correct analysis 64.6
The University of Wisconsin-Madison Aggregate Results 22 Accuracy: fraction of word types with correct analysis Supervised
The University of Wisconsin-Madison Aggregate Results 23 Accuracy: fraction of word types with correct analysis Our Model: Train with 7, test on 1 -average absolute increase of reduces error by 42% Supervised
The University of Wisconsin-Madison Aggregate Results 24 Accuracy: fraction of word types with correct analysis Our Model: Train with 7, test on 1 -average absolute increase of reduces error by 42% Oracle: Each language guided using own gold standard feature values Accuracy still below supervised: (1) search errors (2) coarseness of feature space Supervised Oracle
The University of Wisconsin-Madison Results By Language 25 Best accuracy: English Lowest accuracy: Estonian Linguistica
The University of Wisconsin-Madison Results By Language Biggest improvements for Serbian (15 points) and Slovene (22 points). For all languages other than English, improvement over baseline Our Model (train with 7, test on 1)
The University of Wisconsin-Madison Visualization of Feature Space 27 Feature space reduced to 2D using MDS Linguistica Gold Standard Our Method
The University of Wisconsin-Madison Visualization of Feature Space 28 Serbian and Slovene: -Closely related Slavic languages -Nearest Neighbors under our model’s analysis -Essentially they “swap places” Linguistica Gold Standard Our Method
The University of Wisconsin-Madison Visualization of Feature Space 29 Estonian and Hungarian: - Highly inflected Uralic Languages - They “swap places” Linguistica Gold Standard Our Method
The University of Wisconsin-Madison Visualization of Feature Space 30 English: - Failed to find a good neighbor - Pulled towards Bulgarian (second least inflected language in dataset) Linguistica Gold Standard Our Method
The University of Wisconsin-Madison Accuracy as Training Languages Added 31 Averaged over all language combinations of various sizes - Accuracy climbs as training languages added - Worse than baseline when only one training language available - Better than baseline when two or more training languages available
The University of Wisconsin-Madison Why does accuracy improve with more languages? Resulting distance VS accuracy for all 56 train-test pairs - More training languages ⇒ find a closer neighbor - Closer neighbor ⇒ higher accuracy 32
The University of Wisconsin-Madison Summary 33 Main Idea : Recast unsupervised learning as cross-lingual structured prediction Test case : morphological analysis of 8 languages. Formulated universal feature space for morphology Developed novel structured nearest neighbor approach Our method yields substantial accuracy gains
The University of Wisconsin-Madison Future Work 34 Shortcoming - uniform weighting of dimensions in the the universal feature space - some features may be more important than others Future work: learn distance metric on universal feature space
The University of Wisconsin-Madison Thank You 35