Ryan Cotterell and Hinrich Schütze

Ryan Cotterell and Hinrich Schütze
Joint Semantic Synthesis and Morphological Analysis of the Derived Word Ryan Cotterell and Hinrich Schütze

A Tale of Four Random Variables
For the next 14 minutes, I am going to tell you a tale of four random variables and how they came to meet n a joint distribution.

Morphological Decomposition Orthography Meaning (Word Vector) Word Form (a String) Our distribution takes the following form. We condition on the input-string of a word type – the first random variables, and then jointly model its orthography, morphological decomposition and the meaning of the word, conveniently located in R^n. For the remainder of this talk, I will construct our exact model, and how explain we do inference and learning

Old Idea: Surface Morphological Segmentation
I am going to start this tale with an idea that has been bouncing around the NLP community for years – surface morphological segmentation

unachievability un achiev abil ity Segment
For instance, given an input word string unachievability, we want to segment it into four morphs: un, achiev, abil and ity. Perhaps, with a labeling. This task has attracted a lot of attention over the years with a number supervise and unsupervised methods being proposed.

Semi-New Idea (NAACL 2016): Canonical Morphological Segmentation
Building on that idea, our recent work has extended this approach to jointly model the orthography of the word.

unachievability unachieveableity un achieve able ity Restore Segment
That is, given an input string, we want to first restore orthographic and phonological changes that were made during the process of word formation. After this process, we then want apply a segmentation into canonicalized morphemes as so. Again, possibly with morphological labels. To point out the differences, compared to the last slide, we have added an "e" to "achieve" and mapped "abil" to "able". un achieve able ity

Why is canonicalization useful?
Why did we go through the trouble of making our model more complicated? Segmenting words alone is not enough. We eventually need to reason about the relationships between words.

Surface Segmenting the Lexicon
Why did we go through the trouble of making our model more complicated? Segmenting words alone is not enough. We eventually need to reason about the relationships between words. When we perform canonical segmentation, it becomes immediately clear, which words share morphemes.

unachievability achievement underachiever achieves
If we consider surface segmenting the lexicon. achieves

un achiev abil ity achieve ment under achiev er achieve s

Are they the same morpheme???
un achiev abil ity achieve ment under achiev er achieve s

Canonically Segmenting the Lexicon
Why did we go through the trouble of making our model more complicated? Segmenting words alone is not enough. We eventually need to reason about the relationships between words. When we perform canonical segmentation, it becomes immediately clear, which words share morphemes.

unachievability achievement underachiever achieves
Now let’s compare canonical segmentation achieves

unachieveableity achievement underachieveer achieves

un achieve able ity achieve ment under achieve er achieve s

Segmentations are now canonicalized across words
un achieve able ity achieve ment under achieve er Better preprocessing, e.g., more meaningful reduction in sparsity and reasoning about compositionality achieve s

unachievability thinkable accessible untouchable
Segmentation does not happen in isolation. Ideally, we would like to analyze all the word's in a language's lexicon untouchable

unachieveableity thinkable accessable untouchable
Segmentation does not happen in isolation. Ideally, we would like to analyze all the word's in a language's lexicon untouchable

un achieve able ity think able access able un touch able

Segmentation is Good for Derivational Morphology
3:00 Before we go further, we would like to emphasize an important point about the role of morphological segmentation in the literature.

Inflectional Morphology (More Paradigmatic)
form features walk INFIN walks 3rd PRES SG walking GERUND Inflectional morphology is easy to represent in a paradigmatic form – a table of word forms and features. Indeed, under this representation, it is not necessary to think about the word as a segmentation of walk + ing, but rather as the form of walk associated with a given feature bundle. walked PAST

Derivational Morphology (More Syntagmatic)
content contented discontented Derivation, on the other hand, is often treated syntagmatically. forms long chains of related words. It is not straight-forward to associate a complex form like discontendedness with its stem “content” plus a bundle of featues. For this reason, a segmentation may be better. discontentedness dis content ed ness

English is Morphologically Rich
English derivational morphology is very complex! Just as complex as derivation in German and Russian Derived forms take affixes from multiple substrata: Germanic and Latinate Stop saying English is morphologically impoverished it’s inflectionally poor! Majority of English words are derivationally complex (Light 1996) Chinese is both inflectionally and derivationally impoverished re vital ize ation You may think English is a morphologically impoverished language, as one of my reviewers did, but I want to set the record straight.

A Joint Model of the Word Form (Cotterell et al. 2016)
3:00 Now, I’m going to discuss a joint model over the word form and its orthography – a model introduced at NAACL 2016.

The First Three Random Variables
This is a distribution over the first three random variables in our tale.

Our distribution takes the following form. We condition on the input-string of a word type – the first random variables, and then jointly model its orthography, morphological decomposition and the meaning of the word, conveniently located in R^n. For the remainder of this talk, I will construct our exact model, and how explain we do inference and learning

unachieveableity unachievability un achieve able ity un achieve able ity unachievability unachieveableity Canonical Segmentation Underlying Form Word (Surface Form)

(s=un achieve able ity, u=unachieveableity)
How good is the segmentation- underlying form pair? (s=un achieve able ity, u=unachieveableity) How good is the underlying form-word pair? We define this model as being proportional the exponential of a linear model, that is, it’s a log-linear model. We can see this as being composed of two difference factors. The first factor scores a canonical segmentation underlying form pair. Basically, it asks how good is this pair? For example, un - achieve - able -ity and achieavility. This a structured factor and can be seem as the score of a semi-Markov model. The second factor scores a surface segmentation, underlying form pair. Basically, it asks how good is this pair? Now, this notation belies a bit of the complexity. This factor is, again, structured. In fact, in general we have to encoder all possible alignment between the two strings. Luckily, we can encode this as a weighted finite-state machine. The paper explains this in detail We put them all together and we get our model. The remaining details such as the feature templates can be found in the paper. (u=unachieveableity, w=unachievability)

The Adventure Continues: The Compositional Semantics of Morphology
Now, back to the main plotline of our tale. I’m going to discuss the compositional semantics of morphology.

un achieve able ity Semantic Coherence
Let’s go back to our running example “ un achieve able ity” Let’s say that we had a vector that represents the meaning of each morpheme. Now we want to find a composition function f, that sitches these vectors together to give us the meaning of the word A function that stiches together the meaning of morphemes

Semantic Coherence un achieve able ity unachievablity
Morphologically complex words obey the Principle of Compositionality un achieve able ity unachievablity Word embedding Morpheme embeddings

Semantic Incoherence hear th

Not a valid morphological decomposition
Semantic Incoherence Not a valid morphological decomposition hear th hearth Now, let’s consider an example of

Intuition Behind Joint Model
Good segmentations preserve semantic coherence How can we exploit that?

A Fourth Random Variable: A Joint Model of the Form and its Meaning
To the best of our knowledge, the fully supervised version of this task has never been considered before in the literature so introduce a novel joint probability model.

unachieveableity unachievability un achieve able ity
{\color[rgb]{ , , }p(}{\color[rgb]{ , , }v}, {\color[rgb]{ , , }s}, {\color[rgb]{ , , }u }{\color[rgb]{ , , }\mid} {\color[rgb]{ , , }w}) Word Vector (Meaning) Canonical Segmentation Underlying Form Word (Surface Form)

(s=un achieve able ity, u=unachieveableity)
How good is the segmentation- underlying form pair? (s=un achieve able ity, u=unachieveableity) How good is the underlying form-word pair? We define this model as being proportional the exponential of a linear model. We can see this as being composed of two difference factors. (u=unachieveableity, w=unachievability)

Distribution over Vectors
Mean Vector p(v, s, u \mid w) = p(v \mid s) \cdot p(s, u \mid w) {\color[rgb]{ , , }p(v \mid s) }\propto \exp\left( \frac{1}{2\sigma^2}||v - {\color[rgb]{ , , }C_{\boldsymbol \beta}}||_2^2 \right) Gaussian Distributed

recurrent neural network
What is ? un achieve able ity unachievablity addition recurrent neural network

How do we Get Morpheme Embeddings?
Jointly train morpheme embeddings with LSTM parameters Objective encourages these to well-approximate word embeddings Similar to retrofitting (Faruqui et al. 2016) While the word embeddings are taken as fixed – the output of a toolkit like word2vec ,we actually train our own morpheme embeddings to best approximate them. Morpheme embeddings Word embedding

Vector Approximation We can approximate vectors for OOVs!
Sum is intractable: so we sample! Open vocabulary word embeddings We have a distribution over embeddings for any word through morpheme composition We believe the idea of open vocablary word embeddings is really important, as it allows us to get a word embedding for any input word w, even one that is unknown. p(v \mid w) = \sum_{s'} \sum_{u'} p(v, s', u' \mid w)

Inference and Learning
Inference is intractable! Approximate inference with importance sampling gives estimate of gradient Sampling-based decoding also with importance sampling Learning AdaGrad See paper for the full derivation Unfortunately, marginal inference in our model is intractable! We explain why in the paper. As the model is globally normalized, even computing a gradient requires inference. To solve this, we rely on an approximation known as importance sampling. At a high-level, importance sampling takes samples from an easy-distribution and lets the model rescore them. Decoding a.k.a. MAP infernece also intractable, but, again, we can approximately solve this with importance sampling. Once we get our approximate gradient, using importance sampling, we train the model with AdaGrad.

Experiments Experiment 1: Canonical Segmentation
Experiment 2: Vector Approximation Experiment 3: Analysis of Derivational Coherence

Experiment 1: Canonical Segmentation
Does modeling semantic coherence help segmentation? Is better than ? Evaluated only on Metrics Accuracy: is the whole segmentation correct F1: morpheme F1 (softer than accuracy) Edit Distance: edit distance (w/ morpheme boundaries)

Experiment 1: English Results
Higher is better Lower is better Add higher and lower is better Cotterell et al. (2016) This Work

Experiment 2: Vector Approximation
Our model can be used to approximate vectors for unknown words How good is it? Baseline: character-level retrofitting Is using morphological information better than just using the characters? MINIMIZE SQUARED ERROR Pinter et al. EMNLP 2017. Very similar to So read that paper, too! objective target vector output of RNN over characters

recurrent neural network
What is ? un achieve able ity unachievablity addition recurrent neural network

Experiment 2: English Results
Cosine Similarity to Gold Vectors

Experiment 3: Derivational Coherence
Which English affixes are the most semantically coherent? How does this relate to morphological productivity?

Conclusion We presented a joint model of word-form analysis and meaning Our model is capable of creating open-vocabulary embedding Empirical validation in three sets of experiments

Fin. Thank You!

{\color[rgb]{ , , }p(}{\color[rgb]{ , , }v}, {\color[rgb]{ , , }s}, {\color[rgb]{ , , }u }{\color[rgb]{ , , }\mid} {\color[rgb]{ , , }w})

Joint Architecture ` `

Open Vocabulary Word Embeddings!
Morphological solutions to your OOV problem {\color[rgb]{ , , }p}({\color[rgb]{ , , }v} \mid {\color[rgb]{ , , }w}) = \sum_{{\color[rgb]{ , , }s'}} \sum_{{\color[rgb]{ , , }u'}} p({\color[rgb]{ , , }v}, {\color[rgb]{ , , }s'}, {\color[rgb]{ , , }u'} \mid {\color[rgb]{ , , }w})

What is ? Deterministic composition function un achieve able ity

Ryan Cotterell and Hinrich Schütze

Similar presentations

Presentation on theme: "Ryan Cotterell and Hinrich Schütze"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ryan Cotterell and Hinrich Schütze

Similar presentations

Presentation on theme: "Ryan Cotterell and Hinrich Schütze"— Presentation transcript:

Similar presentations

About project

Feedback