Download presentation
Presentation is loading. Please wait.
Published byHelena Parrish Modified over 9 years ago
1
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007
2
2 Levels of (Formal) Description 6 basic levels (more or less explicitly present in most theories) : and beyond (pragmatics/logic/...) meaning (semantics) (surface) syntax morphology phonology phonetics/orthography Each level has an input and output representation output from one level is the input to the next (upper) level sometimes levels might be skipped (merged) or split
3
3 Phonetics/Orthography Input: acoustic signal (phonetics) / text (orthography) Output: phonetic alphabet (phonetics) / text (orthography) Deals with: Phonetics: consonant & vowel (& others) formation in the vocal tract classification of consonants, vowels,... in relation to frequencies, shape & position of the tongue and various muscles intonation Orthography: normalization, punctuation, etc.
4
4 Phonology Input: sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes] Output: sequence of phonemes (~ (lexical) letters; in an abstract alphabet) Deals with: relation between sounds and phonemes (units which might have some function on the upper level) e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)
5
5 Morphology Input: sequence of phonemes (~ (lexical) letters) Output: sequence of pairs (lemma, (morphological) tag) Deals with: composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding) e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.
6
6 (Surface) Syntax Input: sequence of pairs (lemma, (morphological) tag) Output: sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms Deals with: the relation between lemmas & morphological categories and the sentence structure uses syntactic categories such as Subject, Verb, Object,... e.g.: I/PP1 see/VB a/DT dog/NN ~ ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S
7
7 Meaning (semantics) Input: sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions) Output: sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions) Deals with: relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)
8
8...and Beyond Input: sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions) Output: logical form, which can be evaluated (true/false) Deals with: assignment of objects from the real world to the nodes of the sentence structure e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ see( Mark-Twain[SSN:...],Tom-Sawyer[SSN:...] ) [Time:bef 99/9/27/14:15][Place:39ş19’40”N76ş37’10”W]
9
Lecture 1, 7/21/2005Natural Language Processing9 Three Views Three equivalent formal ways to look at what we’re up to (not including tables) Regular Expressions Regular Languages Finite State Automata
10
Lecture 1, 7/21/2005Natural Language Processing10 Transition Finite-state methods are particularly useful in dealing with a lexicon. Lots of devices, some with limited memory, need access to big lists of words. So we’ll switch to talking about some facts about words and then come back to computational methods
11
Lecture 1, 7/21/2005Natural Language Processing11 MORPHOLOGY
12
Lecture 1, 7/21/2005Natural Language Processing12 Morphology Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes (morph = shape, logos = word) We can usefully divide morphemes into two classes Stems: The core meaning bearing units Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions Prefix: un-, anti-, etc Suffix: -ity, -ation, etc Infix: are inserted inside the stem Tagalog: um + hingi humingi Circumfixes – precede and follow the stem English doesn’t stack more affixes. But Turkish can have words with a lot of suffixes. Languages, such as Turkish, tend to string affixes together are called agglutinative languages.
13
Lecture 1, 7/21/2005Natural Language Processing13 Surface and Lexical Forms The surface level of a word represents the actual spelling of that word. geliyorum eats cats kitabım The lexical level of a word represents a simple concatenation of morphemes making up that word. gel +PROG +1SG eat +AOR cat +PLU kitap +P1SG Morphological processors try to find correspondences between lexical and surface forms of words. Morphological recognition/ analysis – surface to lexical Morphological generation/ synthesis – lexical to surface
14
14 Morphology: Morphemes & Order Handles what is an isolated form in written text Grouping of phonemes into morphemes sequence deliverables deliver, able and s (3 units) Morpheme Combination certain combinations/sequencing possible, other not: deliver+able+s, but not able+derive+s; noun+s, but not noun+ing typically fixed (in any given language)
15
Lecture 1, 7/21/2005Natural Language Processing15 Inflectional & Derivational Morphology We can also divide morphology up into two broad classes Inflectional Derivational Inflectional morphology concerns the combination of stems and affixes where the resulting word Has the same word class as the original Serves a grammatical/semantic purpose different from the original After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. eat / eats pencil / pencils After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. compute / computer do / undo friend / friendly Uygar / uygarlaşkapı / kapıcı The irregular changes may happen with derivational affixes.
16
Lecture 1, 7/21/2005Natural Language Processing16 Morphological Parsing Morphological parsing is to find the lexical form of a word from its surface form. cats -- cat +N +PLU cat -- cat +N +SG goose -- goose +N +SG or goose +V geese -- goose +N +PLU gooses -- goose +V +3SG catch -- catch +V caught -- catch +V +PAST or catch +V +PP There can be more than one lexical level representation for a given word. (ambiguity)
17
Lecture 1, 7/21/2005Natural Language Processing17 Morphological Analysis Analyzing words into their linguistic components (morphemes). Morphemes are the smallest meaningful units of language. carscar+PLU givinggive+PROG AsachhilAmaAsA+PROG+PAST+1st I/We was/were coming Ambiguity: More than one alternatives fliesfly VERB +PROG fly NOUN +PLU mAtAla kare
18
Lecture 1, 7/21/2005Natural Language Processing18 Fly + s flys flies (y i rule) Duckling Go-getter get + er Doer do + er Beer ? What knowledge do we need? How do we represent it? How do we compute with it?
19
Lecture 1, 7/21/2005Natural Language Processing19 Knowledge needed Knowledge of stems or roots Duck is a possible root, not duckl We need a dictionary (lexicon) Only some endings go on some words Do + er ok Be + er – not ok In addition, spelling change rules that adjust the surface form Get + er – double the t getter Fox + s – insert e – foxes Fly + s – insert e – flys – y to i – flies Chase + ed – drop e - chased
20
Lecture 1, 7/21/2005Natural Language Processing20 Put all this in a big dictionary (lexicon) Turkish – approx 600 10 6 forms Finnish – 10 7 Hindi, Bengali, Telugu, Tamil? Besides, always novel forms can be constructed Anti-missile Anti-anti-missile Anti-anti-anti-missile …….. Compounding of words – Sanskrit, German
21
21 Morphology: From Morphemes to Lemmas & Categories Lemma: lexical unit, “pointer” to lexicon typically is represented as the “base form”, or “dictionary headword” possibly indexed when ambiguous/polysemous: state 1 (verb), state 2 (state-of-the-art), state 3 (government) from one or more morphemes (“root”, “stem”, “root+derivation”,...) Categories: non-lexical small number of possible values (< 100, often < 5-10)
22
22 Morphology Level: The Mapping Formally: A + 2 (L,C 1,C 2,...,Cn) A is the alphabet of phonemes (A + denotes any non-empty sequence of phonemes) L is the set of possible lemmas, uniquely identified C i are morphological categories, such as: grammatical number, gender, case person, tense, negation, degree of comparison, voice, aspect,... tone, politeness,... part of speech (not quite morphological category, but...) A, L and C i are obviously language-dependent
23
Lecture 1, 7/21/2005Natural Language Processing23 Morphological Analysis (cont.) Relatively simple for English. But for many Indian languages, it may be more difficult. Examples Inflectional and Derivational Morphology. Common tools: Finite-state transducers
24
Lecture 1, 7/21/2005Natural Language Processing24 Bengali Verb Paradigms
25
Lecture 1, 7/21/2005Natural Language Processing25 Bengali Verb morphology for one of the paradigms
26
Lecture 1, 7/21/2005Natural Language Processing26
27
Lecture 1, 7/21/2005Natural Language Processing27
28
Lecture 1, 7/21/2005Natural Language Processing28 Finite State Machines FSAs are equivalent to regular languages FSTs are equivalent to regular relations (over pairs of regular languages) FSTs are like FSAs but with complex labels. We can use FSTs to transduce between surface and lexical levels.
29
Lecture 1, 7/21/2005Natural Language Processing29 Simple Rules
30
Lecture 1, 7/21/2005Natural Language Processing30 Adding in the Words
31
Lecture 1, 7/21/2005Natural Language Processing31 Derivational Rules
32
Lecture 1, 7/21/2005Natural Language Processing32 Parsing/Generation vs. Recognition Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing) Or we have some structure and we want to produce a surface form (production/generation) Example From “ cats ” to “ cat +N +PL ” and back
33
Lecture 1, 7/21/2005Natural Language Processing33 Morphological Parsing Given the input cats, we’d like to output cat +N +Pl, telling us that cat is a plural noun. Given the Spanish input bebo, we’d like to output beber +V +PInd +1P +Sg telling us that bebo is the present indicative first person singular form of the Spanish verb beber, ‘to drink’.
34
Lecture 1, 7/21/2005Natural Language Processing34 Morphological Anlayser To build a morphological analyser we need: lexicon: the list of stems and affixes, together with basic information about them morphotactics: the model of morpheme ordering (eg English plural morpheme follows the noun rather than a verb) orthographic rules: these spelling rules are used to model the changes that occur in a word, usually when two morphemes combine (e.g., fly+s = flies)
35
Lecture 1, 7/21/2005Natural Language Processing35 Lexicon & Morphotactics Typically list of word parts (lexicon) and the models of ordering can be combined together into an FSA which will recognise the all the valid word forms. For this to be possible the word parts must first be classified into sublexicons. The FSA defines the morphotactics (ordering constraints).
36
Lecture 1, 7/21/2005Natural Language Processing36 Sublexicons to classify the list of word parts reg-nounirreg-pl-nounirreg-sg-nounplural catmicemouse-s foxsheep geesegoose
37
Lecture 1, 7/21/2005Natural Language Processing37 FSA Expresses Morphotactics (ordering model)
38
Lecture 1, 7/21/2005Natural Language Processing38 Towards the Analyser We can use lexc or xfst to build such an FSA (see lex1.lexc) To augment this to produce an analysis we must create a transducer T num which maps between the lexical level and an "intermediate" level that is needed to handle the spelling rules of English.
39
Lecture 1, 7/21/2005Natural Language Processing39 Three Levels of Analysis
40
Lecture 1, 7/21/2005Natural Language Processing40 1. T num : Noun Number Inflection multi-character symbols morpheme boundary ^ word boundary #
41
Lecture 1, 7/21/2005Natural Language Processing41 Intermediate Form to Surface The reason we need to have an intermediate form is that funny things happen at morpheme boundaries, e.g. cat^s cats fox^s foxes fly^s flies The rules which describe these changes are called orthographic rules or "spelling rules".
42
Lecture 1, 7/21/2005Natural Language Processing42 More English Spelling Rules consonant doubling: beg / begging y replacement: try/tries k insertion: panic/panicked e deletion: make/making e insertion: watch/watches Each rule can be stated in more detail...
43
Lecture 1, 7/21/2005Natural Language Processing43 Spelling Rules Chomsky & Halle (1968) invented a special notation for spelling rules. A very similar notation is embodied in the "conditional replacement" rules of xfst. E -> F || L _ R which means replace E with F when it appears between left context L and right context R
44
Lecture 1, 7/21/2005Natural Language Processing44 A Particular Spelling Rule This rule does e-insertion ^ -> e || x _ s#
45
Lecture 1, 7/21/2005Natural Language Processing45 e insertion over 3 levels The rule corresponds to the mapping between surface and intermediate levels
46
Lecture 1, 7/21/2005Natural Language Processing46 e insertion as an FST
47
Lecture 1, 7/21/2005Natural Language Processing47 Incorporating Spelling Rules Spelling rules, each corresponding to an FST, can be run in parallel provided that they are "aligned". The set of spelling rules is positioned between the surface level and the intermediate level. Parallel execution of FSTs can be carried out: by simulation: in this case FSTs must first be aligned. by first constructing a a single FST corresponding to their intersection.
48
Lecture 1, 7/21/2005Natural Language Processing48 Putting it all together execution of FST i takes place in parallel
49
Lecture 1, 7/21/2005Natural Language Processing49 Kaplan and Kay The Xerox View FSTi are aligned but separate FSTi intersected together
50
Lecture 1, 7/21/2005Natural Language Processing50 Finite State Transducers The simple story Add another tape Add extra symbols to the transitions On one tape we read “ cats ”, on the other we write “ cat +N +PL ”, or the other way around.
51
Lecture 1, 7/21/2005Natural Language Processing51 FSTs
52
Lecture 1, 7/21/2005Natural Language Processing52 English Plural surfacelexical catcat+N+Sg catscat+N+Pl foxesfox+N+Pl micemouse+N+Pl sheepsheep+N+Pl sheep+N+Sg
53
Lecture 1, 7/21/2005Natural Language Processing53 Transitions c:c means read a c on one tape and write a c on the other +N: ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s c:ca:at:t +N:ε + PL:s
54
Lecture 1, 7/21/2005Natural Language Processing54 Typical Uses Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). And we’ll write to the second tape using the other symbols on the transitions.
55
Lecture 1, 7/21/2005Natural Language Processing55 Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result
56
Lecture 1, 7/21/2005Natural Language Processing56 Ambiguity What’s the right parse for Unionizable Union-ize-able Un-ion-ize-able Each represents a valid path through the derivational morphology machine.
57
Lecture 1, 7/21/2005Natural Language Processing57 Ambiguity There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored
58
Lecture 1, 7/21/2005Natural Language Processing58 The Gory Details Of course, its not as easy as “ cat +N +PL ” “ cats ” As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Fox and Foxes
59
Lecture 1, 7/21/2005Natural Language Processing59 Multi-Tape Machines To deal with this we can simply add more tapes and use the output of one tape machine as the input to the next So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols
60
Lecture 1, 7/21/2005Natural Language Processing60 Generativity Nothing really privileged about the directions. We can write from one and read from the other or vice- versa. One way is generation, the other way is analysis
61
Lecture 1, 7/21/2005Natural Language Processing61 Multi-Level Tape Machines We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape
62
Lecture 1, 7/21/2005Natural Language Processing62 Lexical to Intermediate Level
63
Lecture 1, 7/21/2005Natural Language Processing63 Intermediate to Surface The add an “e” rule as in fox^s# foxes#
64
Lecture 1, 7/21/2005Natural Language Processing64 Foxes
65
Lecture 1, 7/21/2005Natural Language Processing65 Note A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply. Meaning that they are written out unchanged to the output tape. Turns out the multiple tapes aren’t really needed; they can be compiled away.
66
Lecture 1, 7/21/2005Natural Language Processing66 Overall Scheme We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). Lexical level to intermediate forms We have a larger set of machines that capture orthographic/spelling rules. Intermediate forms to surface forms
67
Lecture 1, 7/21/2005Natural Language Processing67 Overall Scheme
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.