Albert Gatt LIN3022 Natural Language Processing Lecture 7.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Hidden Markov Model Jianfeng Tang Old Dominion University 03/03/2004.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
數據分析 David Shiuan Department of Life Science Institute of Biotechnology Interdisciplinary Program of Bioinformatics National Dong Hwa University.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Introduction to Machine Learning Approach Lecture 5.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Part-of-Speech Tagging
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Albert Gatt Corpora and Statistical Methods Lecture 10.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Homework 1 Reminder Due date: (till 23:59) Submission: – – Write the names of students in your team.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models BMI/CS 576
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Hidden Markov Models Part 2: Algorithms
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Part-of-Speech Tagging Using Hidden Markov Models
Presentation transcript:

Albert Gatt LIN3022 Natural Language Processing Lecture 7

In this lecture We consider the task of Part of Speech tagging information sources solutions using markov models transformation-based learning

POS Tagging overview Part 1

The task (graphically) Running text Tagset (list of possible tags) TaggerTagged text

The task Assign each word in continuous text a tag indicating its part of speech. Essentially a classification problem. Current state of the art: taggers typically have 96-97% accuracy figure evaluated on a per-word basis in a corpus with sentences of average length 20 words, 96% accuracy can mean one tagging error per sentence

Ingredients for tagging Tagset The list of possible tags Tags indicate part of speech and morphosyntactic info The tagset represents a decision as to what is morphosyntactically relevant Tokenisation Text must be tokenised before it is tagged. This is because it is individual words that are labelled.

Some problems in tokenisation There are a number of decisions that need to be taken, e.g.: Do we treat the definite article as a clitic, or as a token? il-kelb DEF-dog One token or two? Do we split nouns, verbs and prepositions with pronominal suffixes? qalib-ha overturn.3SgM-3SgF “he overturned her/it”

POS Tagging example From here......to here Kien tren Ġ ermani ż, modern u Komdu, Kien_VA3SMP tren_NNSM Ġermaniż_MJSM,_PUN modern_MJSM u_CC komdu_MJSM,_PUN

Sources of difficulty in POS tagging Mostly due to ambiguity when words have more than one possible tag. need context to make a good guess about POS context alone won’t suffice A simple approach which assigns only the most common tag to each word performs with 90% accuracy!

The information sources 1. Syntagmatic information: the tags of other words in the context of w  Not sufficient on its own. E.g. Greene/Rubin 1977 describe a context- only tagger with only 77% accuracy 2. Lexical information (“dictionary”): most common tag(s) for a given word  e.g. in English, many nouns can be used as verbs (flour the pan, wax the car…)  however, their most likely tag remains NN  distribution of a word’s usages across different POSs is uneven: usually, one highly likely, other much less

Tagging in other languages (than English) In English, high reliance on context is a good idea, because of fixed word order Free word order languages make this assumption harder Compensation: these languages typically have rich morphology Good source of clues for a tagger

Some approaches to tagging Fully rule-based Involves writing rules which take into account both context and morphological information. There are a few good rule-based taggers around. Fully statistical Involve training a statistical model on a manually trained corpus. The tagger is then applied to new data. Transformation-based Basically a rule-based approach. However, rules are learned automatically from manually annotated data.

Evaluation Training a statistical POS tagger requires splitting corpus into training and test data. Typically carried out against a gold standard based on accuracy (% correct). Ideal to compare accuracy of our tagger with: baseline (lower-bound): standard is to choose the unigram most likely tag ceiling (upper bound): e.g. see how well humans do at the same task humans apparently agree on 96-7% tags means it is highly suspect for a tagger to get 100% accuracy

Markov Models Some preliminaries Part 2

Talking about the weather Suppose we want to predict tomorrow’s weather. The possible predictions are: sunny foggy rainy We might decide to predict tomorrow’s outcome based on earlier weather if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it had been rainy all week how far back do we want to go to predict tomorrow’s weather?

Statistical weather model Notation: S: the state space, a set of possible values for the weather: {sunny, foggy, rainy} (each state is identifiable by an integer i) X: a sequence of variables, each taking a value from S with a certain probability these model weather over a sequence of days t is an integer standing for time (X 1, X 2, X 3,... X T ) models the value of a series of variables each takes a value from S with a certain probability the entire sequence tells us the weather over T days

Statistical weather model If we want to predict the weather for day t+1, our model might look like this: E.g. P(weather tomorrow = sunny), conditional on the weather in the past t days. Problem: the larger t gets, the more calculations we have to make.

Markov Properties I: Limited horizon The probability that we’re in state s i at time t+1 only depends on where we were at time t: This assumption simplifies life considerably. If we want to calculate the probability of the weather over a sequence of days, X 1,...,X n, all we need to do is calculate: We need initial state probabilities here.

Markov Properties II: Time invariance The probability of being in state s i given the previous state does not change over time:

Concrete instantiation Day tDay t+1 sunnyrainyfoggy sunny rainy foggy This is essentially a transition matrix, which gives us probabilities of going from one state to the other.

Graphical view Components of the model: 1. states (s) 2. transitions 3. transition probabilities 4. initial probability distribution for states Essentially, a non-deterministic finite state automaton.

Example continued If the weather today (X t ) is sunny, what’s the probability that tomorrow (X t+1 ) is sunny and the day after (X t+2 ) is rainy? Markov assumption

A slight variation on the example You’re locked in a room with no windows You can’t observe the weather directly You only observe whether the guy who brings you food is carrying an umbrella or not Need a model telling you the probability of seeing the umbrella, given the weather distinction between observations and their underlying emitting state. Define: O t as an observation at time t K = {+umbrella, -umbrella} as the possible outputs We’re interested in P(O t =k|X t =s i ) i.e. probability of a given observation at t given that the weather at t is in state s

Concrete instantiation Weather (hidden state) Probability of umbrella (observation) sunny0.1 rainy0.8 foggy0.3 This is the hidden model, telling us the probability that O t = k given that X t = s i We call this a symbol emission probability

Using the hidden model Model gives:P(O t =k|X t =s i ) Then, by Bayes’ Rule we can compute: P(X t =s i |O t =k) Generalises easily to an entire sequence

HMM in graphics Circles indicate states Arrows indicate probabilistic dependencies between states

HMM in graphics  Green nodes are hidden states  Each hidden state depends only on the previous state (Markov assumption)

Why HMMs? HMMs are a way of thinking of underlying events probabilistically generating surface events. Parts of speech: a POS is a class or set of words we can think of language as an underlying Markov Chain of parts of speech from which actual words are generated

HMMs in POS Tagging ADJNV DET  Hidden layer (constructed through training)  Models the sequence of POSs in the training corpus

HMMs in POS Tagging ADJ tall N lady V is DET the  Observations are words.  They are “emitted” by their corresponding hidden state.  The state depends on its previous state.

Why HMMs There are efficient algorithms to train HMMs using an algorithm called Expectation Maximisation. General idea: training data is assumed to have been generated by some HMM (parameters unknown) try and learn the unknown parameters in the data Similar idea is used in finding the parameters of some n-gram models.

Why HMMs In tasks such as POS Tagging, we have a sequence of tokens. We need to “discover” or “guess” the most likely sequence of tags that gave rise to those tokens. Given the model learned from data, there are also efficient algorithms for finding the most probable sequence of underlying states (tags) that gave rise to the sequence of observations (tokens).

Crucial ingredients (familiar) Underlying states E.g. Our tags Output alphabet (observations) E.g. The words in our language State transition probabilities: E.g. the probability of one tag following another tag

Crucial ingredients (additional) Initial state probabilities tell us the initial probability of each state E.g. What is the probability that at the start of the sequence we see a DET? Symbol emission probabilities: tell us the probability of seeing an observation, given that we were previously in a state x and are now looking at a state y E.g. Given that I’m looking at the word man, and that previously I had a DET, what is the probability that man is a noun?

HMMs in POS Tagging ADJ tall N lady DET the State transition probabilities Symbol emission probabilities

Markov-model taggers Part 3

Using Markov models Basic idea: sequences of tags are a Markov Chain: Limited horizon assumption: sufficient to look at previous tag for information about current tag Note: this is the Markov assumption already encountered in language models! Time invariance: The probability of a sequence remains the same over time

Implications/limitations Limited horizon ignores long-distance dependences e.g. can’t deal with WH-constructions Chomsky (1957): this was one of the reasons cited against probabilistic approaches Time invariance: e.g. P(finite verb|pronoun) is constant but we may be more likely to find a finite verb following a pronoun at the start of a sentence than in the middle!

Notation We let t i range over tags Let w i range over words Subscripts denote position in a sequence Limited horizon property becomes: (the probability of a tag t i+1 given all the previous tags t 1..t i is assumed to be equal to the probability of t i+1 given just the previous tag)

Basic strategy Training set of manually tagged text extract probabilities of tag sequences (state transitions): e.g. using Brown Corpus, P(NN|JJ) = 0.45, but P(VBP|JJ) = Next step: estimate the word/tag (symbol emission) probabilities:

Training the tagger: basic algorithm 1. Estimate probability of all possible sequences of 2 tags in the tagset from training data 2. For each tag t j and for each word w l estimate P(w l | t j ). 3. Apply smoothing.

Finding the best tag sequence Given: a sentence of n words Find: t 1,...,n = the best n tags (states) that could have given rise to these words. Usually done by comparing the probability of all possible sequences of tags and choosing the sequence with the highest probability. This is computed using the Viterbi Algorithm.

Some observations The model is a Hidden Markov Model we only observe words when we tag In actuality, during training we have a visible Markov Model because the training corpus provides words + tags

Transformation-based error-driven learning Part 4

Transformation-based learning Approach proposed by Brill (1995) uses quantitative information at training stage outcome of training is a set of rules tagging is then symbolic, using the rules Components: a set of transformation rules learning algorithm

Transformations General form: t1  t2 “replace t1 with t2 if certain conditions are satisfied” Examples: Morphological: Change the tag from NN to NNS if the word has the suffix "s" dogs_NN  dogs_NNS Syntactic: Change the tag from NN to VB if the word occurs after "to" go_NN to_TO  go_VB Lexical: Change the tag to JJ if deleting the prefix "un" results in a word. uncool_XXX  uncool_JJ uncle_NN -/-> uncle_JJ

Learning Unannotated text Initial state annotator e.g. assign each word its most frequent tag in a dictionary truth: a manually annotated version of corpus against which to compare Learner: learns rules by comparing initial state to Truth rules

Learning algorithm Simple iterative process: apply a rule to the corpus compare to the Truth if error rate is reduced, keep the results continue A priori specifications: how initial state annotator works the space of possible transformations Brill (1995) used a set of initial templates the function to compare the result of applying the rules to the truth

Non-lexicalised rule templates Take only tags into account, not the shape of words Change tag a to tag b when: 1. The preceding (following) word is tagged z. 2. The word two before (after) is tagged z. 3. One of the three preceding (following) words is tagged z. 4. The preceding (following) word is tagged z and the word two before (after) is tagged w. 5. …

Lexicalised rule templates Take into account specific words in the context Change tag a to tag b when: 1. The preceding (following) word is w. 2. The word two before (after) is w. 3. The current word is w, the preceding (following) word is w 2 and the preceding (following) tag is t. 4. …

Morphological rule templates Usful for completely unknown words. Sensitive to the word’s “shape”. Change the tag of an unknown word (from X) to Y if: 1. Deleting the prefix (suffix) x, |x| ≤ 4, results in a word 2. The first (last) (1,2,3,4) characters of the word are x. 3. Adding the character string x as a prefix (suffix) results in a word (|x| ≤ 4). 4. Word w ever appears immediately to the left (right) of the word. 5. Character z appears in the word. 6. …

Order-dependence of rules Rules are triggered by environments satisfying their conditions E.g. “A  B if preceding tag is A” Suppose our sequence is “AAAA” Two possible forms of rule application: immediate effect: applications of the same transformation can influence eachother result: ABAB delayed effect: results in ABBB the rule is triggered multiple times from the same initial input Brill (1995) opts for this solution

More on Transformation-based tagging Can be used for unsupervised learning like HMM-based tagging, the only info available is the allowable tags for each word takes advantage of the fact that most words have only one tag E.g. word can = NN in context AT ___ BEZ because most other words in this context are NN therefore, learning algorithm would learn the rule “change tag to NN in context AT ___ BEZ” Unsupervised method achieves 95.6% accuracy!!

Summary Statistical tagging relies on Markov Assumptions, just as language modelling does. Today, we introduced two types of Markov Models (visible and hidden) and saw the application of HMMs to tagging. HMM taggers rely on training data to compute probabilities: Probability that a tag follows another tag (state transition) Probability that a word is “generated” by some tag (symbol emission) This approach contrasts with transformation-based learning, where a corpus is used during training, but then the tagging itself is rule-based.