Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007

Lecture 1, 7/21/2005Natural Language Processing2 MORPHOLOGY

Lecture 1, 7/21/2005Natural Language Processing3 Finite State Machines  FSAs are equivalent to regular languages  FSTs are equivalent to regular relations (over pairs of regular languages)  FSTs are like FSAs but with complex labels.  We can use FSTs to transduce between surface and lexical levels.

Lecture 1, 7/21/2005Natural Language Processing4 Simple Rules

Lecture 1, 7/21/2005Natural Language Processing5 Adding in the Words

Lecture 1, 7/21/2005Natural Language Processing6 Derivational Rules

Lecture 1, 7/21/2005Natural Language Processing7 Parsing/Generation vs. Recognition  Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing) Or we have some structure and we want to produce a surface form (production/generation)  Example From “ cats ” to “ cat +N +PL ” and back

Lecture 1, 7/21/2005Natural Language Processing8 Morphological Parsing  Given the input cats, we’d like to output cat +N +Pl, telling us that cat is a plural noun.  Given the Spanish input bebo, we’d like to output beber +V +PInd +1P +Sg telling us that bebo is the present indicative first person singular form of the Spanish verb beber, ‘to drink’.

Lecture 1, 7/21/2005Natural Language Processing9 Morphological Anlayser To build a morphological analyser we need:  lexicon: the list of stems and affixes, together with basic information about them  morphotactics: the model of morpheme ordering (eg English plural morpheme follows the noun rather than a verb)  orthographic rules: these spelling rules are used to model the changes that occur in a word, usually when two morphemes combine (e.g., fly+s = flies)

Lecture 1, 7/21/2005Natural Language Processing10 Lexicon & Morphotactics  Typically list of word parts (lexicon) and the models of ordering can be combined together into an FSA which will recognise the all the valid word forms.  For this to be possible the word parts must first be classified into sublexicons.  The FSA defines the morphotactics (ordering constraints).

Lecture 1, 7/21/2005Natural Language Processing11 Sublexicons to classify the list of word parts reg-nounirreg-pl-nounirreg-sg-nounplural catmicemouse-s foxsheep geesegoose

Lecture 1, 7/21/2005Natural Language Processing12 FSA Expresses Morphotactics (ordering model)

Lecture 1, 7/21/2005Natural Language Processing13 Towards the Analyser  We can use lexc or xfst to build such an FSA  To augment this to produce an analysis we must create a transducer T num which maps between the lexical level and an "intermediate" level that is needed to handle the spelling rules of English.

Lecture 1, 7/21/2005Natural Language Processing14 Three Levels of Analysis

Lecture 1, 7/21/2005Natural Language Processing15 1. T num : Noun Number Inflection multi-character symbols morpheme boundary ^ word boundary #

Lecture 1, 7/21/2005Natural Language Processing16 Intermediate Form to Surface  The reason we need to have an intermediate form is that funny things happen at morpheme boundaries, e.g. cat^s  cats fox^s  foxes fly^s  flies  The rules which describe these changes are called orthographic rules or "spelling rules".

Lecture 1, 7/21/2005Natural Language Processing17 More English Spelling Rules  consonant doubling: beg / begging  y replacement: try/tries  k insertion: panic/panicked  e deletion: make/making  e insertion: watch/watches  Each rule can be stated in more detail...

Lecture 1, 7/21/2005Natural Language Processing18 Spelling Rules  Chomsky & Halle (1968) invented a special notation for spelling rules.  A very similar notation is embodied in the "conditional replacement" rules of xfst. E -> F || L _ R which means replace E with F when it appears between left context L and right context R

Lecture 1, 7/21/2005Natural Language Processing19 A Particular Spelling Rule This rule does e-insertion ^ -> e || x _ s#

Lecture 1, 7/21/2005Natural Language Processing20 e insertion over 3 levels The rule corresponds to the mapping between surface and intermediate levels

Lecture 1, 7/21/2005Natural Language Processing21 e insertion as an FST

Lecture 1, 7/21/2005Natural Language Processing22 Incorporating Spelling Rules  Spelling rules, each corresponding to an FST, can be run in parallel provided that they are "aligned".  The set of spelling rules is positioned between the surface level and the intermediate level.  Parallel execution of FSTs can be carried out: by simulation: in this case FSTs must first be aligned. by first constructing a a single FST corresponding to their intersection.

Lecture 1, 7/21/2005Natural Language Processing23 Putting it all together execution of FST i takes place in parallel

Lecture 1, 7/21/2005Natural Language Processing24 Kaplan and Kay: The Xerox View FSTi are aligned but separate FSTi intersected together

Lecture 1, 7/21/2005Natural Language Processing25 Finite State Transducers  The simple story Add another tape Add extra symbols to the transitions On one tape we read “ cats ”, on the other we write “ cat +N +PL ”, or the other way around.

Lecture 1, 7/21/2005Natural Language Processing26 FSTs

Lecture 1, 7/21/2005Natural Language Processing27 English Plural surfacelexical catcat+N+Sg catscat+N+Pl foxesfox+N+Pl micemouse+N+Pl sheepsheep+N+Pl sheep+N+Sg

Lecture 1, 7/21/2005Natural Language Processing28 Transitions  c:c means read a c on one tape and write a c on the other  +N: ε means read a +N symbol on one tape and write nothing on the other  +PL:s means read +PL and write an s c:ca:at:t +N:ε + PL:s

Lecture 1, 7/21/2005Natural Language Processing29 Typical Uses  Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA).  And we’ll write to the second tape using the other symbols on the transitions.

Lecture 1, 7/21/2005Natural Language Processing30 Ambiguity  Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed  In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result

Lecture 1, 7/21/2005Natural Language Processing31 Ambiguity  What’s the right parse for Unionizable Union-ize-able Un-ion-ize-able  Each represents a valid path through the derivational morphology machine.

Lecture 1, 7/21/2005Natural Language Processing32 Ambiguity  There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored

Lecture 1, 7/21/2005Natural Language Processing33 The Gory Details  Of course, its not as easy as “ cat +N +PL ” “ cats ”  As we saw earlier there are geese, mice and oxen  But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Fox and Foxes

Lecture 1, 7/21/2005Natural Language Processing34 Multi-Tape Machines  To deal with this we can simply add more tapes and use the output of one tape machine as the input to the next  So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols

Lecture 1, 7/21/2005Natural Language Processing35 Generativity  Nothing really privileged about the directions.  We can write from one and read from the other or vice- versa.  One way is generation, the other way is analysis

Lecture 1, 7/21/2005Natural Language Processing36 Multi-Level Tape Machines  We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

Lecture 1, 7/21/2005Natural Language Processing37 Lexical to Intermediate Level

Lecture 1, 7/21/2005Natural Language Processing38 Intermediate to Surface  The add an “e” rule as in fox^s# foxes#

Lecture 1, 7/21/2005Natural Language Processing39 Foxes

Lecture 1, 7/21/2005Natural Language Processing40 Note  A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply.  Meaning that they are written out unchanged to the output tape.  Turns out the multiple tapes aren’t really needed; they can be compiled away.

Lecture 1, 7/21/2005Natural Language Processing41 Overall Scheme  We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). Lexical level to intermediate forms  We have a larger set of machines that capture orthographic/spelling rules. Intermediate forms to surface forms

Lecture 1, 7/21/2005Natural Language Processing42 Overall Scheme

Lecture 1, 7/21/2005Natural Language Processing43 http://nltk.sourceforge.net/index.php/Documentation

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.

Similar presentations

Presentation on theme: "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.

Similar presentations

Presentation on theme: "Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007."— Presentation transcript:

Similar presentations

About project

Feedback