Lemmatization Tagging LELA 30922. 2/20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Morphology.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Egyptian Ministry of Communications and Information Technology Research and Development Centers of Excellence Initiative Data Mining and Computer Modeling.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Stemming, tagging and chunking Text analysis short of parsing.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Part-of-Speech (POS) tagging See Eric Brill “Part-of-speech tagging”. Chapter 17 of R Dale, H Moisl & H Somers (eds) Handbook of Natural Language Processing,
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Language, Mind, and Brain by Ewa Dabrowska Chapter 2: Language processing: speed and flexibility.
Part of speech (POS) tagging
1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Ch4 – Features Consider the following data from Mokilese
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
323 Morphology The Structure of Words 3. Lexicon and Rules 3.1 Productivity and the Lexicon The lexicon is in theory infinite, but in practice it is limited.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Formal Semantics Chapter Twenty-ThreeModern Programming Languages, 2nd ed.1.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Grammars Grammars can get quite complex, but are essential. Syntax: the form of the text that is valid Semantics: the meaning of the form – Sometimes semantics.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Natural Language Processing Chapter 2 : Morphology.
Supertagging CMSC Natural Language Processing January 31, 2006.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Utilizing vector models for automatic text lemmatization Ladislav Gallay Supervisor: Ing. Marián Šimko, PhD. Slovak University of Technology Faculty of.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Questionnaire-Part 2. Translating a questionnaire Quality of the obtained data increases if the questionnaire is presented in the respondents’ own mother.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Introduction to Linguistics
Machine Learning in Natural Language Processing
CSCI 5832 Natural Language Processing
Introduction to Text Analysis
Hindi POS Tagger By Naveen Sharma ( )
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Presentation transcript:

Lemmatization Tagging LELA 30922

2/20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in a text Related to morphological processing –Lemmatization merely identifies lemma –Morphological processing would (also) try to interpret the inflection etc. –eg running (lemma = run) (analysis: lex=run, form=prespart)

3/20 Lemmatization – how to? Simplest solution would be to have a list of all possible word forms and associated lemma information Somewhat inefficient (for English) and actually impossible for some other languages And not necessary, since there are many regularly formed inflections in English Of course, list of irregularities needed as well

4/20 Lemmatization – how to? Computational morphology quite well established now: various methods Brute force: try every possible segmentation of word and see which ones match known stems and affixes Rule-based (simplistic method): Have list of known affixes, see which ones apply Rule-based (more sophisticated): List of known affixes, and knowledge about allowable combinations, eg -ing can only attach to a verb stem

5/20 Lemmatization – how to? Problem well studied and understood, though that’s not to say it’s trivial Morphological processes can be quite complex, cf running, falling, hopping, hoping, healing, … Need to deal with derivation as well as inflection Not just suffixes, other types of morphological process (prefix, ablaut, etc.) Plenty of ambiguities –ambiguous morphemes, eg fitter, books –ambiguity between single morph and inflected form, eg flower

6/20 POS Tagging POS = part of speech Familiar (?) from school and/or language learning (noun, verb, adjective, etc.) POS tagsets usually identifier more fine- grained distinctions, eg proper noun, common noun, plural noun, etc In fact POS tagsets often have ~60 different categories, even as many as 400!

7/20 POS Tagging Assigning POS tags to individual words involves a degree of analysis –of the word form itself (cf lemmatization) –of the word in context Individual words are often ambiguous (particularly for English, where huge percentage of words are at least 2-ways ambiguous) Disambiguation often depends on context

8/20 What is a tagger? Lack of distinction between … –Software which allows you to create something you can then use to tag input text, e.g. “Brill’s tagger” –The result of running such software, e.g. a tagger for English (based on the such-and-such corpus) Taggers (even rule-based ones) are almost invariably trained on a given corpus “Tagging” usually understood to mean “POS tagging”, but you can have other types of tags (eg semantic tags)

9/20 Simple taggers Default tagger has one tag per word, and assigns it on the basis of dictionary lookup –Tags may indicate ambiguity but not resolve it, e.g. NVB for noun-or-verb Words may be assigned different tags with associated probabilities –Tagger will assign most probable tag unless –there is some way to identify when a less probable tag is in fact correct Tag sequences may be defined, and assigned probabilities (including 0 for illegal sequences – negative rules)

10/20 Rule-based taggers Earliest type of tagging: two stages Stage 1: look up word in lexicon to give list of potential tags Stage 2: Apply rules which certify or disallow tag sequences Rules originally handwritten; more recently Machine Learning methods can be used “Transformation-based tagging” most common example

11/20 Transformation-based tagging Eric Brill (1993) Start from an initial tagging, and apply a series of transformations Transformations are learned as well, from the training data Captures the tagging data in much fewer parameters than statistical models The transformations learned (often) have linguistic “reality”

12/20 Transformation-based tagging Three stages: –Lexical look-up –Lexical rule application for unknown words –Contextual rule application to correct mis-tags

13/20 Transformation-based learning Change tag a to b when: –Internal evidence (morphology) –Contextual evidence One or more of the preceding/following words has a specific tag One or more of the preceding/following words is a specific word One or more of the preceding/following words has a certain form Order of rules is important –Rules can change a correct tag into an incorrect tag, so another rule might correct that “mistake”

14/20 Stochastic taggers Nowadays, pretty much all taggers are statistics-based and have been since 1980s (or even earlier... Some primitive algorithms were already published in 60s and 70s)

15/20 How do they work? Tagger must be “trained” Many different techniques, but typically … Small “training corpus” hand-tagged Tagging rules learned automatically Rules define most likely sequence of tags Rules based on –Internal evidence (morphology) –External evidence (context) –Probabilities

16/20 What probabilities do we have to learn? Individual word probabilities –P that a given tag is appropriate for a given word –Learned from corpus evidence –Problem of “sparse data” Tag sequence probabilities –P that a given sequence of tags is appropriate –Again, learned from corpus evidence

17/20 Individual word probability Simple calculation –suppose the word run occurs 4800 times in the training corpus: –3600 times as a verb –1200 times as a noun P(verb|run) = 0.75 P(noun|run) = 0.25

18/20 “Sparse data” What if there is no evidence for a particular combination? Could mean it is impossible, or just that it doesn’t happen to occur Calculations involving  and  don’t like 0s “Smoothing”: add a tiny amount to all values, so there are no zeros Probabilities are reduced, but not 0.

19/20 Tag sequence probability Probability that a given tag sequence is appropriate for a given word sequence Much too hard to calculate probabilities for all possible sequences Subsequences are more practical Turns out that good accuracy gained just looking at sequences of 2 or 3 tags (bigrams, trigrams)

20/20 Tagging – final word Tagging now quite well understood technology Accuracy typically >97% –Hard to imagine how to get improvements of even as much as 1% Many taggers now available for download Sometimes not clear whether “tagger” means –Software enabling you to build a tagger given a corpus –An already built tagger for a given language Because a given tagger (2 nd sense) will have been trained on some corpus, it will be biased towards that (kind of) corpus –Question of goodness of match between original training corpus and material you want to use the tagger on