From Textual Information to Numerical Vectors Chapters 2.7-2.13 Presented by Aaron Hagan.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Statistical NLP: Lecture 3
Chapter 8. Word Classes and Part-of-Speech Tagging From: Chapter 8 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
BİL711 Natural Language Processing
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Word Classes and English Grammar.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
NLP and Speech 2004 English Grammar
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Part of speech (POS) tagging
Artificial Intelligence 2004 Natural Language Processing - Syntax and Parsing - Language Syntax Parsing.
Part-of-Speech Tagging & Sequence Labeling
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
TagHelper & SIDE Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Part-of-Speech Tagging
8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
Some Advances in Transformation-Based Part of Speech Tagging
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Ling 570 Day 17: Named Entity Recognition Chunking.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Word classes and part of speech tagging Chapter 5.
Linguistic Essentials
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Natural Language - General
Natural Language Processing
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
CS621: Artificial Intelligence
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Machine Learning in Practice Lecture 13 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Introduction to Machine Learning and Text Mining
Statistical NLP: Lecture 3
CSCI 5832 Natural Language Processing
Natural Language - General
Linguistic Essentials
Natural Language Processing
Part-of-Speech Tagging Using Hidden Markov Models
Presentation transcript:

From Textual Information to Numerical Vectors Chapters Presented by Aaron Hagan

Text Mining Supplements the human reader with automatic systems undeterred by the text explosion. It involves analyzing a large collection of documents to discover previously unknown information. The information might be relationships or patterns that are buried in the document collection and which would otherwise be extremely difficult, if not impossible, to discover.

What is Covered Part-of-speech tagging classifies words into categories such as noun, verb or adjective Word sense disambiguation identifies the meaning of a word, given its usage, from among the multiple meanings that the word may have Parsing performs a grammatical analysis of a sentence. Shallow parsers identify only the main grammatical elements in a sentence, such as noun phrases and verb phrases, whereas deep parsers generate a complete representation of the grammatical structure of a sentence

Motivation Up until now we have been dealing with individual words and simple-minded (though useful) notions of what sequence of words are likely. Now we turn to the study of how words – Are clustered into classes – Group with their neighbors to form phrases and sentences – Depend on other words Interesting notions: – Word order – Constituency – Grammatical relations Today: syntactic word classes – part of speech tagging

Part-Of-Speech Tagging At the step where text has been broken into tokens and sentences. If no linguistic analysis is necessary, one might proceed directly to feature generation in which the “features” will be obtained from the tokens. If a goal is more specific, such as recognizing names of people, place and organizations, it is usually desirable to perform additional linguistic analyses of the text to extract more sophisticated features. Find POS for each token. Words are organized into grammatical classes or parts of speech. English : nouns, verbs, adjectives, adverbs, prepositions, conjunctions.

History of POS Tagging Research on part-of-speech tagging has been closely tied to corpus linguistics. The first major corpus of English for computer analysis was the Brown Corpus developed at Brown University by Henry Kucera and, in the mid-1960s.corpus linguisticsBrown CorpusBrown UniversityHenry Kucera Consists of about 1,000,000 words of running English prose text, made up of 500 samples from randomly chosen publications. Each sample is 2,000 words. Mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the of British English. HMMs involve counting cases (such as from the Brown Corpus), and making a table of the probabilities of certain sequences. hidden Markov models

CORPUS CORPUS OF CONTEMPORARY AMERICAN ENGLISH (COCA) CORPUS OF CONTEMPORARY AMERICAN ENGLISH The first large, balanced corpus of contemporary American English. The corpus contains more than 385 million words of text, including 20 million words each year from , and it is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.385 million words20 million words each year from The interface allows you to search for exact words or phrases, wildcards, lemmas, part of speech, or any combinations of these. You can search for surrounding words (collocates) within a ten-word window (e.g. all nouns somewhere near chain, all adjectives near woman, or all verbs near key). exact words or phrases, wildcards, lemmas, part of speech, or any combinations of thesesearch for surrounding words (collocates) The corpus also allows you to easily limit searches by frequency and compare the frequency of words, phrases, and grammatical constructions, in at least two main ways:limit searches by frequency and compare the frequency – By genre: comparisons between spoken, fiction, popular magazines, newspapers, and academic, or even between sub-genres (or domains), such as movie scripts, sports magazines, newspaper editorial, or scientific journals – Over time: compare different years from 1990 to the present time

Penn Treebank Tag Set 1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition or subordinating conjunction 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10. LS List item marker 11. MD Modal 12. NN Noun, singular or mass 13. NNS Noun, plural 14. NP Proper noun, singular 15. NPS Proper noun, plural 16. PDT Predeterminer 17. POS Possessive ending 18. PP Personal pronoun 19. PP$ Possessive pronoun 20. RB Adverb 21. RBR Adverb, comparative 22. RBS Adverb, superlative 23. RP Particle 24. SYM Symbol 25. TO to 26. UH Interjection 27. VB Verb, base form 28. VBD Verb, past tense 29. VBG Verb, gerund or present participle 30. VBN Verb, past participle 31. VBP Verb, non-3rd person singular present 32. VBZ Verb, 3rd person singular present 33. WDT Wh-determiner 34. WP Wh-pronoun 35. WP$ Possessive wh- pronoun 36. WRB Wh-adverb

Assigning POS to Tokens Possible to manually tag POS. Ideally want automated system to identify POS. Most successful databases are one generated automatically by machine-learning algorithms from annotated copora. – Example: Wall Street Journal suited well for certain type of data, but may not be ideal for something like messages. A lot of military funding for things such as processing voluminous news source. Not much support for generating large training corpora in other domains.

Part-Of-Speech Dictionaries Dictionaries showing word-POS correspondence can be useful. Difficult do to several parts of speech tied to one word. – Example: Bore – noun - a tiresome person Bore – verb - to pierce with a turning or twisting movement of a tool – Example Book/VB that/DT flight/NN Tagging is a type of disambiguation – Book can be NN or VB – Can I read a book on this flight? – That can be a DT or complementizer – My travel agent said that there would be a meal on this flight The goal of POS tagging is to determine which of these possibilities is realized in a particular text instance.

11 Approaches to POS Tagging Rule-based Approach – Uses handcrafted sets of rules to tag input sentences Statistical approaches – Use training corpus to compute probability of a tag in a context Hybrid systems (e.g. Brill’s transformation- based learning)

12 ENGTWOL (ENGlish TWO Level analysis) Rule-Based Tagger A Two-stage architecture Use lexicon FST (dictionary) to tag each word with all possible POS Apply hand-written rules to eliminate tags. The rules eliminate tags that are inconsistent with the context, and should reduce the list of POS tags to a single POS per word.

13 ENGTWOL Adverbial-that Rule Given input “that” If the next word is adj, adverb, or quantifier, and following that is a sentence boundary, and the previous word is not a verb like “consider” which allows adjs as object complements, Then eliminate non-ADV tags, Else eliminate ADV tag I consider that odd. (that is NOT ADV) It isn’t that strange. (that is an ADV)

14 Det-Noun Rule: If an ambiguous word follows a determiner, tag it as a noun

15 Does it work? This approach does work and produces accurate results. What are the drawbacks? – Extremely labor-intensive

16 Statistical Tagging Statistical (or stochastic) taggers use a training corpus to compute the probability of a tag in a context. For a given word sequence, Hidden Markov Model (HMM) Taggers choose the tag sequence that maximixes P(word | tag) * P(tag | previous-n-tags) A HMM tagger chooses the tag t i for word w i that is most probable given the previous tag, t i-1 t i = argmax j P(t j | t i-1, w i )

HMM Example For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an adjective 40%, and a number 20%. – a program can decide that "can" in "the can" is far more likely to be a noun than a verb or a modal. The same method can of course be used to benefit from knowledge about following words. More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a preposition, article, or noun, but much less likely another verb.

18 Statistical POS Tagging (Example) Use probability theory for POS tagging. Suppose, with no context, we just want to know given the word “flies” whether it should be tagged as a noun or as a verb. We use conditional probability for this: we want to know which is greater PROB(N | flies) or PROB(V | flies) Note definition of conditional probability PROB(a | b) = PROB(a & b) / PROB(b) – Where PROB(a & b) is the probability of the two events a and b occurring simultaneously

19 Calculating POS for “flies” We need to know which is more PROB(N | flies) = PROB(flies & N) / PROB(flies) PROB(V | flies) = PROB(flies & V) / PROB(flies) Use Corpus as reference for finding probablities.

20 Corpus to Estimate 1,273,000 words; 1000 uses of flies; 400 flies in N sense; 600 flies in V sense PROB(flies) ≈ 1000/1,273,000 =.0008 PROB(flies & N) ≈ 400/1,273,000 =.0003 PROB(flies & V) ≈ 600/1,273,000 =.0005 Out best guess is that flies is a V PROB(V | flies) = PROB(V & flies) / PROB(flies) =.0005/.0008 =.625

Phrase Recognition Once tokens have been assigned POS tags, the next step is to group individual tokens into units, called phrases. The idea is for creating a “partial parse” of a sentence and as a step in identifying the “named entities” occurring in a sentence. Text parsing systems are suppose to scan a text and mark the beginning and end of phrases.

Phrase Recognition There are a number of conventions for marking, but the most common : – Mark a word inside a phrase with I- Can be extended with a code for the phrase type: I-NP, I-VP, etc – Mark a word at the beginning of a phrase adjacent to another phrase with B- Can be extended with a code for the phrase type: B-NP, B-VP, etc. – And a word outside any phrase with O Looking for a particular sequence of words that occurs frequently enough in the corpora. Simple statistical approach that looks at multiword tokens.

Named Entity Recognition Specialization of phrase finding Particular noun phrase finding is the recognition of particular types of proper noun phrases, specifically persons, organizations, and locations. Importance of these recognizers for intelligence applications. (More on this in chapter 6).

Parsing into Phrases Usually a full parse of a sentence is done in most sophisticated kind of text processing. Each word in the sentence has a relation to all the other words and the main function (subject, object, etc) in the sentance. There are many different kinds of parses, each associated with linguistic theory of the language.

Context-Free Parses A tree of nodes in which the leaf nodes are words of a sentence, the phrases into which the words are grouped are internal nodes, and there is one top node at the root of the tree, which has the label S. A number of algorithms for producing such a tree from the words of a sentence. With considerable research constructing parsers from a statistical analysis of tree banks of sentences parsed by handle. Provides information that phrase identification or partial parsing cannot provide.

Parse Tree Example S NP - N JOHNSON VP PP AUX was PPART replaced PREP at PNOUN XYZ PNOUN CORP PREP by PNOUN Smith Johnson was replaced at XYZ Corp by Smith. Linear order of phrases in a partial parse, one might conclude that Johnson replaced Smith.

Feature Generation Reason for the linguistic processing is to identify features that can be useful for text mining. Features that might be useful in identifying the POS include: where the first letter is capitalized (indicating a proper noun), if all the characters are digits, periods, or comma (marking a number), if characters alternate case (usually an abrivation). A dictionary as to the possible parts of speech for a token.

Feature Vector The feature vector for a document is assigned a set of classes. Feature Vector Example: – Classifying periods as End-Of-Sentence. – Identifying tokens as instance of titles, such as “Doctor” or “President”

Summary Part-of-Speech Tagging – is an important step in Natural Language Analysis. – is robust and fast. – works with 95-97% accuracy. Parsing (= full syntax analysis) – is more error-prone than PoS-Tagging. – is important to get to the meaning of a sentence.

References / Applications The Penn Treebank Project annotates naturally- occuring text for linguistic structure. Most notably, we produce skeletal parses showing rough syntactic and semantic information -- a bank of linguistic trees. Stanford Natural Language Processing Group -