Announcements Main CSE file server went down last night –Hand in your homework using ‘submit_cse467’ as soon as you can – no penalty if handed in today. Friday (10/6) each team must me: –who is on the team –ideas for project, with scale up/down plans Get together as a team to work when I am away: –10/12–10/13: Thursday – Friday next week –10/23–10/27: Monday – Friday
Part of speech (POS) tagging Tagging of words in a corpus with the correct part of speech, drawn from some tagset. Early automatic POS taggers were rule- based. Stochastic POS taggers are reasonably accurate.
Applications of POS tagging Parsing –recovering syntactic structure requires correct POS tags –partial parsing refers to and syntactic analysis which does not result in a full syntactic parse (e.g. finding noun phrases) - “parsing by chunks”
Applications of POS tagging Information extraction –fill slots in predefined templates with information –full parse is not needed for this task, but partial parsing results (phrases) can be very helpful –information extraction tags with grammatical categories to find semantic categories
Applications of POS tagging Question answering –system responds to a user question with a noun phrase Who shot JR? (Kristen Shepard) Where is Starbucks? (UB Commons) What is good to eat here? (pizza)
Background on POS tagging How hard is tagging? –most words have just a single tag: easy –some words have more than one possible tag: harder –many common words are ambiguous Brown corpus: –10.4% of word types are ambiguous –40%+ of word tokens are ambiguous
Disambiguation approaches Rule-based –rely on large set of rules to disambiguate in context –rules are mostly hand-written Stochastic –rely on probabilities of words having certain tags in context –probabilities derived from training corpus Combined –transformation-based tagger: uses stochastic approach to determine initial tagging, then uses a rule-based approach to “clean up” the tags
Determining the appropriate tag for an untagged word Two types of information can be used: syntagmatic information –consider the tags of other words in the surrounding context –tagger using such information correctly tagged approx. 77% of words –problem: content words (which are the ones most likely to be ambiguous) typically have many parts of speech, via productive rules (e.g. N V)
Determining the appropriate tag for an untagged word use information about word (e.g. usage probability) –baseline for tagger performance is given by a tagger that simply assigns the most common tag to ambiguous words –correctly tags 90% of words modern taggers use a variety of information sources
Note about accuracy measures Modern taggers claim accuracy rates of around 96% to 97%. This sounds impressive, but how good are they really? This is a measure of correctness at the level of individual words, not whole corpora. With a 96% accuracy, 1 word out of 25 is tagged incorrectly. This represents roughly one tagging error per sentence.
Rule-based POS tagging Two-stage design: –first stage looks up individual words in a dictionary and tags words with sets of possible tags –second stage uses rules to disambiguate, resulting in singleton tag sets
Stochastic POS tagging Stochastic taggers choose tags that result in the highest probability: P(word | tag) * P(tag | previous n tags) Stochastic taggers generally maximize probabilities for tag sequences for sentences.
Bigram stochastic tagger This kind of tagger “…chooses tag t i for word w i that is most probable given the previous tag t i-1 and the current word w i : t i = argmax j P(t j | t i-1, w i ) (8.2)” [page 303] Bayes law says: P(T|W) = P(T)P(W|T)/P(W) P(t j | t i-1, w i ) = P(t j ) P(t i-1, w i | t j ) / P(t I-1, w i ) Since we take the argmax of this over the t i s, result is the same as using: P(t j | t i-1, w i ) = P(t j ) P(t i-1, w i | t j ) Rewriting: t i = argmax j P(t j | t i-1 )P(w i | t j )
Example (page 304) What tag to we assign to race? –to/TO race/?? –the/DT race/?? In the first case, if we are choosing between NN and VB as tags for race, the equations are: –P(VB|TO)P(race|VB) –P(NN|TO)P(race|NN) Tagger will choose tag for NN which maximizes the probability.
Example For first part – look at tag sequence probability: –P(NN|TO) = –P(VB|TO) = 0.34 For second part – look at lexical likelihood: –P(race|NN) = –P(race|VB) = Combining these: –P(VB|TO)P(race|VB) = –P(NN|TO)P(race|NN) =
English syntax What are some properties of English syntax we might want our formalism to capture? This depends on our goal: –processing written or spoken language? –modeling human behavior or not? Context-free grammar formalism
Things a grammar should capture As we have mentioned repeatedly, human language is an amazingly complex system of communication. Some properties of language which a (computational) grammar should reflect include: –Constituency –Agreement –Subcategorization / selectional restrictions
Constituency Phrases are syntactic equivalence classes: –they can appear in the same contexts –they are not semantic equivalence classes: they can clearly mean different things Ex (noun phrases) –Clifford the big red dog –the man from the city –a lovable little kitten
Constituency tests Can appear before a verb: –a lovable little kitten eats food –the man from the city arrived yesterday Other arbitrary word groupings cannot: –*from the arrived yesterday A string of words which is starred, like the one above, is considered ill-formed. Various gradations can occur, such as ‘?’, ‘?*’, ‘*’, ‘**’. Judgements are subjective.
More tests of constituency They also function as a unit with respect to syntactic processes: –On September seventeenth, I’d like to fly from Atlanta to Denver. –I’d like to fly on September seventeenth from Atlanta to Denver. –I’d like to fly from Atlanta to Denver on September seventeenth. Other groupings of words don’t behave the same: –* On September, I’d like to fly seventeenth from Atlanta to Denver. –* On I’d like to fly September seventeenth from Atlanta to Denver. –* I’d like to fly on September from Atlanta to Denver seventeenth. –* I’d like to fly on from Atlanta to Denver September seventeenth.
Agreement English has subject-verb agreement: –The cats chase that dog all day long. –* The cats chases that dog all day long. –The dog is chased by the cats all day long. –* The dog are chased by the cats all day long. Many languages exhibit much more agreement than English.
Subcategorization Verbs (predicates) require arguments of different types: – The mirage disappears daily. –NPI prefer ice cream. –NP PPI leave Boston in the morning. –NP NPI gave Mary a ticket. –PPI leave on Thursday.
Alternations want can take either an NP and an infinitival VP: –I want a flight … –I want to fly … find cannot take an infinitival VP: –I found a flight … –* I found to fly …
How can we encode rules of language? There are many grammar formalisms. Most are variations on context-free grammars. Context-free grammars are of interest because they –have well-known properties (e.g. can be parsed in polynomial time) –can capture many aspects of language
Basic context-free grammar formalism A CFG is a 4-tuple (N, ,P,S) where –N is a set of non-terminal symbols – is a set of terminal symbols –P is a set of productions, P N X ( N)* –S is a start symbol and N = Each production is of the form A , where A is a non-terminal and is drawn from ( N)*
Problems with basic formalism Consider a grammar rule like S Aux NP VP To handle agreement between subject and verb, we could replace that rule with two new ones: S 3SgAux 3SgNP VP S Non3SgAux Non3SgNP VP Need rules like the following too: 3SgAux does | has | can | … Non3SgAux do | have | can | …
Extensions to formalism Feature structures and unification –feature structures are of the form [ f 1 =v 1, f 2 =v 2, …, f n =v n ] –feature structures can be partially specified: (a) [ Number = Sg, Person = 3, Category = NP ] (b) [ Number = Sg, Category = NP ] (c) [ Person = 3, Category = NP ] –(b) unified with (c) is (a) Feature structures can be used to express feature- value constraints across constituents without rule multiplication.
Other formalisms More powerful: tree adjoining grammars –trees, not rules, are fundamental –trees are either initial or auxiliary –two operations: substitution and adjunction Less powerful: finite-state grammars –cannot handle general recursion –can be sufficient to handle real-world data –recursion spelled out explicitly to some level (large grammar)