LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing
Reminder: Non-deterministic FSA An FSA where there can be multiple paths for a single input (tape). Two basic approaches for recognition: 1.Either take a ND machine and convert it to a D machine and then do recognition with that. 2.Or explicitly manage the process of recognition as a state-space search (leaving the machine as is). LIN3022 Natural Language Processing2
Example LIN3022 Natural Language Processing3 ba a a !\ q0q0 q1q1 q2q2 q2q2 q3q3 q4q4
Non-Deterministic Recognition: Search Given an ND FSA representing some language and given an input string : – If the input string belongs to the language, the ND FSA will contain at least one path for that string; – If the input string does not belong to the language, there will be no path; – Not all paths directed through the machine for an accept string lead to an accept state. A recognition algorithm succeeds if a path is found for the input string; fails otherwise. LIN3022 Natural Language Processing4
Words Finite-state methods are particularly useful in dealing with a lexicon Many devices need access to large lists of words – Spell checkers – Syntactic parsers – Language Generators – Machine translation systems What sort of knowledge should a computational lexicon contain? – What we’re mainly concerned with today is morphology (inflection and derivation) LIN3022 Natural Language Processing5
Computational tasks involving morphology Morphological analysis (parsing): – ommijiet omm+PL Morphological (lexical) generation – omm+PL ommijiet Stemming – Ommijiet omm – Ommijiethom omm – Ommna omm – (map from a word to its stem) LIN3022 Natural Language Processing6
Computational tasks involving morphology Lemmatisation – Convert words to their “basic” form Ommijiet omm Ommijiethom omm... Tokenisation – Split running text into individual tokens – Qtilt il-kelb qtilt, il-, kelb – Xrobt l-ilma xrobt, l-, ilma LIN3022 Natural Language Processing7
Regular, irregular and messy The problem is to cover everything, of course: – Mouse/mice, goose/geese, ox/oxen (PLURAL) – Go/went, fly/flew (PAST) – Solid/solidify, mechanical/mechanise (NOMINALISATION) LIN3022 Natural Language Processing8
Inflection vs Derivation Inflection is typically fairly straightforward – Relatively easy to identify regular and irregular cases Derivational morphology is much messier. – Quasi-systematicity – Irregular meaning change – Changes of word class LIN3022 Natural Language Processing9
Derivational Examples Verbs and Adjectives to Nouns -ationcomputerisecomputerisation -eeappointappointee -erkillkiller -nessfuzzyfuzziness LIN3022 Natural Language Processing10
Derivational Examples Nouns and Verbs to Adjectives -alcomputationcomputational -ableembraceembraceable -lessclueclueless LIN3022 Natural Language Processing11
Example: Compute Many paths are possible… Start with compute – Computer -> computerise -> computerisation – Computer -> computerise -> computerisable But not all paths/operations are equally good – Clue Clue -> *clueable LIN3022 Natural Language Processing12
Morphology, recognition and FSAs We’d like to use the machinery provided by FSAs to capture these facts about morphology – Accept strings that are in the language – Reject strings that are not –... in an efficient way: Without listing every word in the language Capturing some generalisations Enabling fast search LIN3022 Natural Language Processing13
What does a morphological parser or recogniser need? Lexicon – Often not feasible to just list all the words. Maltese has literally thousands of forms for some verbs... Some morphological processes are productive; we’re likely to meet completely new formations. Morphotactics (order of morphemes) – E.g. English plural morpheme after the noun stem E.g. Maltese “accusative” -l before dative pronominal suffixes (e.g Qatilulna) E.g. English –ise before –ation (formalisation) Orthographic rules – Needed to handle variations of the spelling of the stem – E.g. English nouns ending in –y change to –i (city cities) LIN3022 Natural Language Processing14
Start Simple Regular singular nouns are ok Regular plural nouns have an -s on the end Irregulars are ok as is (i.e. treat as atomic for now) LIN3022 Natural Language Processing15
Simple Rules LIN3022 Natural Language Processing16
Substitute words for word classes Idea is to be able to use this kind of FSA for recognition. We’ve replaced classes like “reg-noun” with the actual words. LIN3022 Natural Language Processing17
Derivational Rules LIN3022 Natural Language Processing18 If everything is an accept state how do things ever get rejected?
Lexicons Lexicon can be stored as an FSA. A base lexicon (with baseforms) can be plugged into a larger FSA to capture morphological rules and morphotactics. LIN3022 Natural Language Processing19
Part 2 Finite state transducers Morphological parsing Morphological generation LIN3022 Natural Language Processing20
Parsing/Generation vs. Recognition We can now run strings through these machines to recognize strings in the language But recognition is usually not quite what we need – Often if we find some string in the language we might like to assign a structure to it (parsing) – Or we might have some structure and we want to produce a surface form for it (production/generation) LIN3022 Natural Language Processing21
Finite State Transducers The simple story – Add another tape – Add extra symbols to the transitions – On one tape we read “cats”, on the other we write “cat +N +PL” LIN3022 Natural Language Processing22
What is a finite-state transducer A transducer is a machine that takes input of a certain form and outputs something of a different form. – We can think of morphological analysis as transduction (from a word to stem+features) Ommijietna omm + PL + POSS.1PL – So is morphological generation (from a stem+feature combination to a word) Omm + PL + POSS.1PL ommijietna LIN3022 Natural Language Processing23
Structure of an FST The easiest way to think of an FST is as a variation on the classic FSA A simple FSA has states and transitions, and recognises something on an input tape. An FST has states and transitions, but works with two tapes. – One corresponds to the input. – The other to the output. LIN3022 Natural Language Processing24
The uses of an FST Recognition: – Take a pair of strings (on two tapes) as input, output accept or reject Generation: – Output a pair of strings from the language Translation: – Read one string and output another Relation between two sets – A machine that computes the relation between the set of possible input strings and the set of possible output strings. LIN3022 Natural Language Processing25
FSTs LIN3022 Natural Language Processing26
Morphological parsing Morphological analysis or parsing can either be An important stand-alone component of many applications (spelling correction, information retrieval) Or simply a link in a chain of further linguistic analysis Interestingly, FSTs are bidirectional, i.e. can be used for parsing and generation LIN3022 Natural Language Processing27
Transitions c:c means read a c on one tape and write a c on the other +N: ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s Note the conventions: x:y represents an input symbol x and the output symbol y. LIN3022 Natural Language Processing28 c:ca:at:t +N: ε + PL:s
Typical Uses Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). And we’ll write to the second tape using the other symbols on the transitions. LIN3022 Natural Language Processing29
Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result LIN3022 Natural Language Processing30
Ambiguity What’s the right parse (segmentation) for Unionisable Union-ise-able Un-ion-ise-able Each represents a valid path through the derivational morphology machine. Unlike in an FST, the differences matter! (Some are not legal parses in English) LIN3022 Natural Language Processing31
Ambiguity There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored LIN3022 Natural Language Processing32
The Gory Details Of course, its not as easy as “cat +N +PL” “cats” As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Fox and Foxes LIN3022 Natural Language Processing33
Multi-Tape Machines To deal with these complications, we will add more tapes and use the output of one tape machine as the input to the next – This gives us a cascade of FSTs So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols LIN3022 Natural Language Processing34
Multi-Level Tape Machines We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape LIN3022 Natural Language Processing35 ^ marks a morpheme boundary # marks a word boundary
Lexical to Intermediate Level LIN3022 Natural Language Processing36
Intermediate to Surface The add an “e” rule as in fox^s# foxes# LIN3022 Natural Language Processing37
Foxes LIN3022 Natural Language Processing38
Foxes LIN3022 Natural Language Processing39
Foxes LIN3022 Natural Language Processing40
Note A key feature of this lower machine is that it has to do the right thing for inputs to which it doesn’t really apply. So... – Fox -> foxes but bird -> birds LIN3022 Natural Language Processing41
Overall Scheme We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). – Lexical level to intermediate forms We have a larger set of machines that capture orthographic/spelling rules. – Intermediate forms to surface forms LIN3022 Natural Language Processing42
Overall Scheme LIN3022 Natural Language Processing43
Cascades This is a common architecture Overall processing is divided up into distinct rewrite steps The output of one layer serves as the input to the next The intermediate tapes may or may not wind up being useful in their own right LIN3022 Natural Language Processing44
Overall Plan LIN3022 Natural Language Processing45
Part 3 A brief look at stemming... A brief look at tokenisation LIN3022 Natural Language Processing46
Stemming Stemming is the process of stripping affixes from words to reduce them to their stems – NB The stem is not necessarily the baseform – Examples: Strip “ing” from all word endings – Going go – Stripping stripp –... LIN3022 Natural Language Processing47
Uses of stemming Often used in Information Retrieval – The basic technology underlying search engines – Task: given a query (e.g. Keywords), retrieve documents which match it – Stemming is useful because it increases the likelihood of matches – E.g. Search for kangaroos returns documents containing kangaroo or kangaroos LIN3022 Natural Language Processing48
The Porter stemmer Built by Martin Porter in 1980 Still widely used Very simple FST-based stemmer No lexicon, just rules View a demo online here: LIN3022 Natural Language Processing49
Rules in the Porter Stemmer ATIONAL ATE – E.g. relational relate – But what about rational? ING Є – E.g. Shivering shiver These can be viewed as an FST, but without a lexical layer. LIN3022 Natural Language Processing50
Errors in the Porter stemmer Simple rules like the ones used by the Porter stemmer are error-prone (miss exceptions) Errors of commission: – Rational ration Errors of omission: – European Europe LIN3022 Natural Language Processing51
Tokenisation Defined as the task of splitting running text into component tokens. Related task: sentence segmentation (split running text into sentences) Simplest technique: – Just split on whitespace But what about: – Punctuation – Clitics –... LIN3022 Natural Language Processing52
Tokenisation & Sentence segmentation Often go hand in hand! Example: – Full stop can be a sentence boundary or an intra-word boundary – “Punctuation” marks in numbers: 30, Can get a long way using simple regular expressions, at least in Indo-European languages. LIN3022 Natural Language Processing53
Example from Maltese Treat numbers as tokens, with or without decimals: – \d+(\.\d+)? \d+(\.\d+) Honorifics shouldn’t be broken up at full-stops or apostrophes: – sant['’]|(onor|sra|nru|dott|kap|mons|dr|prof)\. Definite articles shouldn’t be broken up at hyphens: – i?[dtlrnsxzżċ]-...[other exceptions] General rule: – A word is just a sequence of characters: \w+\s LIN3022 Natural Language Processing54
Chinese Words in Chinese are composed of hanzi characters. Each character usually represents one morpheme. Words tend to be quite short (ca. 2.4 characters long on average) There is no whitespace between words! LIN3022 Natural Language Processing55
A simple algorithm for segmenting Chinese Maximum Matching (maxmatch) – Requires a lexicon. – Start by pointing at the beginning of the input string – Choose the longest item in the lexicon that matches the input up to now. – If no word matches, move the pointer one symbol forward. LIN3022 Natural Language Processing56
MaxMatch example Imagine an English string with no whitespace: – The table down there – thetabledownthere The first item found by MaxMatch is theta – Because the algorithm tries to match the longest portion of the input – Then: bled, own, there Result: theta bled own there Luckily, it works better for Chinese, because words there tend to be shorter than in English! LIN3022 Natural Language Processing57