Lemmatization Tagging LELA 30922
2/20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in a text Related to morphological processing –Lemmatization merely identifies lemma –Morphological processing would (also) try to interpret the inflection etc. –eg running (lemma = run) (analysis: lex=run, form=prespart)
3/20 Lemmatization – how to? Simplest solution would be to have a list of all possible word forms and associated lemma information Somewhat inefficient (for English) and actually impossible for some other languages And not necessary, since there are many regularly formed inflections in English Of course, list of irregularities needed as well
4/20 Lemmatization – how to? Computational morphology quite well established now: various methods Brute force: try every possible segmentation of word and see which ones match known stems and affixes Rule-based (simplistic method): Have list of known affixes, see which ones apply Rule-based (more sophisticated): List of known affixes, and knowledge about allowable combinations, eg -ing can only attach to a verb stem
5/20 Lemmatization – how to? Problem well studied and understood, though that’s not to say it’s trivial Morphological processes can be quite complex, cf running, falling, hopping, hoping, healing, … Need to deal with derivation as well as inflection Not just suffixes, other types of morphological process (prefix, ablaut, etc.) Plenty of ambiguities –ambiguous morphemes, eg fitter, books –ambiguity between single morph and inflected form, eg flower
6/20 POS Tagging POS = part of speech Familiar (?) from school and/or language learning (noun, verb, adjective, etc.) POS tagsets usually identifier more fine- grained distinctions, eg proper noun, common noun, plural noun, etc In fact POS tagsets often have ~60 different categories, even as many as 400!
7/20 POS Tagging Assigning POS tags to individual words involves a degree of analysis –of the word form itself (cf lemmatization) –of the word in context Individual words are often ambiguous (particularly for English, where huge percentage of words are at least 2-ways ambiguous) Disambiguation often depends on context
8/20 What is a tagger? Lack of distinction between … –Software which allows you to create something you can then use to tag input text, e.g. “Brill’s tagger” –The result of running such software, e.g. a tagger for English (based on the such-and-such corpus) Taggers (even rule-based ones) are almost invariably trained on a given corpus “Tagging” usually understood to mean “POS tagging”, but you can have other types of tags (eg semantic tags)
9/20 Simple taggers Default tagger has one tag per word, and assigns it on the basis of dictionary lookup –Tags may indicate ambiguity but not resolve it, e.g. NVB for noun-or-verb Words may be assigned different tags with associated probabilities –Tagger will assign most probable tag unless –there is some way to identify when a less probable tag is in fact correct Tag sequences may be defined, and assigned probabilities (including 0 for illegal sequences – negative rules)
10/20 Rule-based taggers Earliest type of tagging: two stages Stage 1: look up word in lexicon to give list of potential tags Stage 2: Apply rules which certify or disallow tag sequences Rules originally handwritten; more recently Machine Learning methods can be used “Transformation-based tagging” most common example
11/20 Transformation-based tagging Eric Brill (1993) Start from an initial tagging, and apply a series of transformations Transformations are learned as well, from the training data Captures the tagging data in much fewer parameters than statistical models The transformations learned (often) have linguistic “reality”
12/20 Transformation-based tagging Three stages: –Lexical look-up –Lexical rule application for unknown words –Contextual rule application to correct mis-tags
13/20 Transformation-based learning Change tag a to b when: –Internal evidence (morphology) –Contextual evidence One or more of the preceding/following words has a specific tag One or more of the preceding/following words is a specific word One or more of the preceding/following words has a certain form Order of rules is important –Rules can change a correct tag into an incorrect tag, so another rule might correct that “mistake”
14/20 Stochastic taggers Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier... Some primitive algorithms were already published in 60s and 70s)
15/20 How do they work? Tagger must be “trained” Many different techniques, but typically … Small “training corpus” hand-tagged Tagging rules learned automatically Rules define most likely sequence of tags Rules based on –Internal evidence (morphology) –External evidence (context) –Probabilities
16/20 What probabilities do we have to learn? Individual word probabilities –P that a given tag is appropriate for a given word –Learned from corpus evidence –Problem of “sparse data” Tag sequence probabilities –P that a given sequence of tags is appropriate –Again, learned from corpus evidence
17/20 Individual word probability Simple calculation –suppose the word run occurs 4800 times in the training corpus: –3600 times as a verb –1200 times as a noun P(verb|run) = 0.75 P(noun|run) = 0.25
18/20 “Sparse data” What if there is no evidence for a particular combination? Could mean it is impossible, or just that it doesn’t happen to occur Calculations involving and don’t like 0s “Smoothing”: add a tiny amount to all values, so there are no zeros Probabilities are reduced, but not 0.
19/20 Tag sequence probability Probability that a given tag sequence is appropriate for a given word sequence Much too hard to calculate probabilities for all possible sequences Subsequences are more practical Turns out that good accuracy gained just looking at sequences of 2 or 3 tags (bigrams, trigrams)
20/20 Tagging – final word Tagging now quite well understood technology Accuracy typically >97% –Hard to imagine how to get improvements of even as much as 1% Many taggers now available for download Sometimes not clear whether “tagger” means –Software enabling you to build a tagger given a corpus –An already built tagger for a given language Because a given tagger (2 nd sense) will have been trained on some corpus, it will be biased towards that (kind of) corpus –Question of goodness of match between original training corpus and material you want to use the tagger on