Natural Language Processing (NLP)

Natural Language Processing (NLP)
Morphological Analysis Institute of Southern Punjab Multan Department of Computer Science

Previous Lecture Review

Language Processing Level 1 – Speech sound (Phonetics & Phonology)
Level 2 – Words & their forms (Morphology, Lexicon) Level 3 – Structure of sentences (Syntax, Parsing) Level 4 – Meaning of sentences (Semantics) Level 5 – Meaning in context & for a purpose (Pragmatics) Level 6 – Connected sentence processing in a larger body of text (Discourse)

Levels of Text Processing
Word Level Words Properties Stop-Words Stemming Frequent N-Grams Thesaurus (WordNet) Sentence Level Document Level Document-Collection Level Linked-Document-Collection Level Application Level

Language Processing Pipeline
speech text Phonetic/Phonological Analysis OCR/Tokenization POS tagging Morphological and lexical analysis WSD Shallow parsing Syntactic analysis Deep Parsing Semantic Interpretation Anaphora resolution Discourse Processing Integration

Some Building Blocks Source Language Analysis
Target Language Generation Text Normalization Text Rendering Morphological Analysis Morphological Synthesis POS Tagging Phrase Generation Parsing Role Ordering Semantic Analysis Lexical Choice Discourse Analysis Discourse Planning

Lecture Contents Morphological Analysis Part of Speech Tagging (POS)
NLTK & Python

Morphology Morphology is the branch of linguistics that studies the structure of words. In English and many other languages, many words can be broken down into parts. For example: unhappiness un-happi-ness horses horse-s walking walk-ing

Morphology un - carries a negative meaning
ness - expresses a state or quality s - expresses plurality ing - conveys a sense of duration A word like “yes”, however, has no internal grammatical structure. We can analyze the sounds, but none of them has any meaning in isolation.

Morphology The smallest unit which has a meaning or grammatical function that words can be broken down into are known as morphemes. So to be clear: “un” is a morpheme. “yes” is also a morpheme, but also happens to be a word.

Morphology There are several important distinctions that must be made when it comes to morphemes: (1) – Free vs. Bound morphemes Free morphemes are morphemes which can stand alone. We have already seen the example of “yes”.

Morphology Bound morphemes: never exist as words themselves, but are always attached to some other morpheme. We have already seen the example of “un”. When we identify the number and types of morphemes that a given word consists of, we are looking at what is referred to as the structure of a word.

Morphology Every word has at least one free morpheme, which is referred to as the root, stem, or base. We can further divide bound morphemes into three categories: prefix un-happy suffix happi-ness infix abso-blooming-lutely The general term for all three is affix.

Morphology Derivational vs. Inflectional morphemes
Derivational morphemes create or derive new words by changing the meaning or by changing the word class of the word. For example: happy → unhappy Both words are adjectives, but the meaning changes.

Morphology quick → quickness
The affix changes both meaning and word class - adjective to a noun. In English: Derivational morphemes can be either prefixes or suffixes.

Morphology Inflectional morphemes don’t alter words the meaning or word class of a word; instead they only refine and give extra grammatical information about the word’s already existing meaning. For example: Cat → cats walk → walking

Morphology In English: Inflectional morphemes are all suffixes (by chance, since in other languages this is not true). There are only 8 inflectional morphemes in English:

Morphology Inflectional morphemes are required by syntax. (that is, they indicate syntactic or semantic relations between different words in a sentence). For example: Nim loves bananas. but They love bananas.

Morphology Derivational morphemes are different in that syntax does not require the presence of derivational morphemes; they do, however, indicate sematic relations within a word (that is, they change the meaning of the word). For example: kind → unkind He is unkind They are unkind

What is Morphology? Study of Words Morphology tries to formulate rules
Their internal structure How they are formed? Morphology tries to formulate rules washing wash -ing bat bats rat rats write writer browse browser

Morphology in NLP Analysis vs synthesis Analysis Synthesis
what does dogs mean? vs what is the plural of dog? Analysis Need to identify lexeme Tokenization To access lexical information Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number) Morphology can be ambiguous May need other process to disambiguate (eg German –en) Synthesis Need to generate appropriate inflections from underlying representation

What is Part-of-Speech (POS)
Generally speaking, Word Classes (=POS) : Verb, Noun, Adjective, Adverb, Article, … We can also include inflection: Verbs: Tense, number, … Nouns: Number, proper/common, … Adjectives: comparative, superlative, … …

Parts of Speech 8 (ish) traditional parts of speech
Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc Called: parts-of-speech, lexical categories, word classes, morphological classes, lexical tags... Lots of debate within linguistics about the number, nature, and universality of these We’ll completely ignore this debate.

7 Traditional POS Categories
N noun chair, bandwidth, pacing V verb study, debate, munch ADJ adj purple, tall, ridiculous ADV adverb unfortunately, slowly, P preposition of, by, to PRO pronoun I, me, mine DET determiner the, a, that, those

POS Tagging The process of assigning a part-of-speech or lexical class marker to each word in a collection. WORD tag the DET koala N put V the DET keys N on P table N

Penn TreeBank POS Tag Set
Penn Treebank: hand-annotated corpus of Wall Street Journal, 1M words 46 tags Some particularities: to /TO not disambiguated Auxiliaries and verbs not distinguished

Penn Treebank Tagset

Why POS tagging is useful?
Speech synthesis: How to pronounce “lead”? INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT Stemming for information retrieval Can search for “aardvarks” get “aardvark” Parsing and speech recognition and etc Possessive pronouns (my, your, her) followed by nouns Personal pronouns (I, you, he) likely to be followed by verbs Need to know if a word is an N or V before you can parse Information extraction Finding names, relations, etc. Machine Translation

Open and Closed Classes
Closed class: a small fixed membership Prepositions: of, in, by, … Auxiliaries: may, can, will had, been, … Pronouns: I, you, she, mine, his, them, … Usually function words (short common words which play a role in grammar) Open class: new ones can be created all the time English has 4: Nouns, Verbs, Adjectives, Adverbs Many languages have these 4, but not all!

Open Class Words Nouns Adverbs: tend to modify things Verbs
Proper nouns (Boulder, Granby, Eli Manning) English capitalizes these. Common nouns (the rest). Count nouns and mass nouns Count: have plurals, get counted: goat/goats, one goat, two goats Mass: don’t get counted (snow, salt, communism) (*two snows) Adverbs: tend to modify things Unfortunately, John walked home extremely slowly yesterday Directional/locative adverbs (here,home, downhill) Degree adverbs (extremely, very, somewhat) Manner adverbs (slowly, slinkily, delicately) Verbs In English, have morphological affixes (eat/eats/eaten)

Closed Class Words Examples: prepositions: on, under, over, …
particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, …

Prepositions from CELEX

English Particles

Conjunctions

POS Tagging Choosing a Tagset
There are so many parts of speech, potential distinctions we can draw To do POS tagging, we need to choose a standard set of tags to work with Could pick very coarse tagsets N, V, Adj, Adv. More commonly used set is finer grained, the “Penn TreeBank tagset”, 45 tags PRP$, WRB, WP$, VBG Even more fine-grained tagsets exist

Using the Penn Tagset The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”) Except the preposition/complementizer “to” is just marked “TO”.

POS Tagging Words often have more than one POS: back
The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word. These examples from Dekang Lin

How Hard is POS Tagging? Measuring Ambiguity

Current Performance How many tags are correct? How well do people do?
About 97% currently But baseline is already 90% Baseline algorithm: Tag every word with its most frequent tag Tag unknown words as nouns How well do people do?

Quick Test: Agreement? the students went to class
plays well with others fruit flies like a banana DT: the, this, that NN: noun VB: verb P: prepostion ADV: adverb

Quick Test the students went to class DT NN VB P NN
plays well with others VB ADV P NN NN NN P DT fruit flies like a banana NN NN VB DT NN NN VB P DT NN NN NN P DT NN NN VB VB DT NN

How to do it? History 1960 1970 1980 1990 2000 Trigram Tagger
(Kempe) 96%+ Combined Methods 98%+ DeRose/Church Efficient HMM Sparse Data 95%+ Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Transformation Based Tagging (Eric Brill) Rule Based – 95%+ Greene and Rubin Rule Based - 70% HMM Tagging (CLAWS) 93%-95% Neural Network 96%+ 1960 1970 1980 1990 2000 Brown Corpus Created (EN-US) 1 Million Words Brown Corpus Tagged LOB Corpus Tagged Brown Corpus (1967): Henry Kucera & W. Nelson Francis 1967 1,000,000 words 500 sample texts from about 15 topics About half of the total vocabulary appear once POS tagging was added later, using first Greene and Rubin tagger (70% success rate), and was considered complete (as possible, only by late seventies) 80 POS tags, + indicators for compound form, contractions, foreign words, etc. Klein and Simmons (1963) (RB) Green And Rubin (1971) Rule based, 70% success rate Hindle (1989) (RB) Brill (1992) * Most probable tag + 2 heuristics for unknown words – 7.9% error. Rule based - British National Corpus (tagged by CLAWS) POS Tagging separated from other NLP LOB Corpus Created (EN-UK) 1 Million Words Penn Treebank Corpus (WSJ, 4.5M)

Two Methods for POS Tagging
Rule-based tagging (ENGTWOL) Stochastic Probabilistic sequence models HMM (Hidden Markov Model) tagging MEMMs (Maximum Entropy Markov Models)

Rule-Based Tagging Start with a dictionary
Assign all possible tags to words from the dictionary Write rules by hand to selectively remove tags Leaving the correct tag for each word.

Rule-based taggers Early POS taggers all hand-coded
Most of these (Harris, 1962; Greene and Rubin, 1971) and the best of the recent ones, ENGTWOL (Voutilainen, 1995) based on a two-stage architecture Stage 1: look up word in lexicon to give list of potential POSs Stage 2: Apply rules which certify or disallow tag sequences Rules originally handwritten; more recently Machine Learning methods can be used

Start With a Dictionary
she: PRP promised: VBN,VBD to TO back: VB, JJ, RB, NN the: DT bill: NN, VB Etc… for the ~100,000 words of English with more than 1 tag

Assign Every Possible Tag
NN RB VBN JJ VB PRP VBD TO VB DT NN She promised to back the bill

POS taggers Brill’s tagger TnT tagger Stanford tagger SVMTool
TnT tagger Stanford tagger SVMTool GENIA tagger More complete list at:

Treebanks Treebanks are corpora in which each sentence has been paired with a parse tree (presumably the right one). These are generally created By first parsing the collection with an automatic parser And then having human annotators correct each parse as necessary. This generally requires detailed annotation guidelines that provide a POS tagset, a grammar and instructions for how to deal with particular grammatical constructions.

Penn Treebank Penn TreeBank is a widely used treebank.
Most well known is the Wall Street Journal section of the Penn TreeBank. 1 M words from the Wall Street Journal.

Treebank Grammars Treebanks implicitly define a grammar for the language covered in the treebank. Simply take the local rules that make up the sub-trees in all the trees in the collection and you have a grammar. Not complete, but if you have decent size corpus, you’ll have a grammar with decent coverage.

Treebank Grammars Such grammars tend to be very flat due to the fact that they tend to avoid recursion. To ease the annotators burden For example, the Penn Treebank has 4500 different rules for VPs. Among them...

NLTK tagger classes ▫ DefaultTagger ▫ RegexpTagger ▫ N-gram taggers
▫ AffixTagger ▫ BrillTagger ▫ HMM tagger

DefaultTagger • Assigns the same tag to all words • (Good baseline) tagger = nltk.DefaultTagger('NN') print tagger.tag(„…‟) ('Pierre', 'NN'), ('Vinken', 'NN'), (',', 'NN'), ('61', 'NN'), ('years', 'NN'), ('old', 'NN'), (',', 'NN'), ('will', 'NN'), ('join', 'NN'), …

RegexpTagger • Assigns tags based on regular expressions
▫ E.g. morphological structure • Processed in order, the first one that matches is applied patterns = [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers (r'.*able$', 'JJ'), # adjectives (r'.*ness$', 'NN'), # nouns formed from adjectives (r'.*ly$', 'RB'), # adverbs (r'.*ing$', 'VBG'), # gerunds (r'.*ed$', 'VBD'), # past tense verbs (r'^[A-Z].*s$', 'NNPS'), # plural proper nouns (r'.*s$', 'NNS'), # plural nouns (r'^[A-Z].*$', 'NNP'), # singular proper nouns (r'.*', 'NN')] # singular nouns (default) tagger = nltk.RegexpTagger(patterns) print tagger.tag(„…‟)

END OF LECTURE

Natural Language Processing (NLP)

Similar presentations

Presentation on theme: "Natural Language Processing (NLP)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Natural Language Processing (NLP)

Similar presentations

Presentation on theme: "Natural Language Processing (NLP)"— Presentation transcript:

Similar presentations

About project

Feedback