Download presentation
Presentation is loading. Please wait.
Published byEvelyn Freeman Modified over 9 years ago
1
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005
2
2 The Description of Language Language = Words and Rules Dictionary (vocabulary) + Grammar Dictionary set of words defined in the language.open (dynamic) Traditional - paper based Electronic - machine readable dictionaries; can be obtained from paper-based Grammar set of rules which describe what is allowable in a language Classic Grammars meant for humans who know the language definitions and rules are mainly supported by examples no (or almost no) formal description tools; cannot be programmed Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...) formal description can be programmed & tested on data (texts)
3
3 Levels of (Formal) Description 6 basic levels (more or less explicitly present in most theories) : and beyond (pragmatics/logic/...) meaning (semantics) (surface) syntax morphology phonology phonetics/orthography Each level has an input and output representation output from one level is the input to the next (upper) level sometimes levels might be skipped (merged) or split
4
4 Phonetics/Orthography Input: acoustic signal (phonetics) / text (orthography) Output: phonetic alphabet (phonetics) / text (orthography) Deals with: Phonetics: consonant & vowel (& others) formation in the vocal tract classification of consonants, vowels,... in relation to frequencies, shape & position of the tongue and various muscles intonation Orthography: normalization, punctuation, etc.
5
5 Phonology Input: sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes] Output: sequence of phonemes (~ (lexical) letters; in an abstract alphabet) Deals with: relation between sounds and phonemes (units which might have some function on the upper level) e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)
6
6 Morphology Input: sequence of phonemes (~ (lexical) letters) Output: sequence of pairs (lemma, (morphological) tag) Deals with: composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding) e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.
7
7 (Surface) Syntax Input: sequence of pairs (lemma, (morphological) tag) Output: sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms Deals with: the relation between lemmas & morphological categories and the sentence structure uses syntactic categories such as Subject, Verb, Object,... e.g.: I/PP1 see/VB a/DT dog/NN ~ ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S
8
8 Meaning (semantics) Input: sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions) Output: sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions) Deals with: relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)
9
9...and Beyond Input: sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions) Output: logical form, which can be evaluated (true/false) Deals with: assignment of objects from the real world to the nodes of the sentence structure e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ see( Mark-Twain[SSN:...],Tom-Sawyer[SSN:...] ) [Time:bef 99/9/27/14:15][Place:39ş19’40”N76ş37’10”W]
10
Lecture 3, 7/27/2005Natural Language Processing10 Morphology Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes (morph = shape, logos = word) We can usefully divide morphemes into two classes Stems: The core meaning bearing units Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions Prefix: un-, anti-, etc Suffix: -ity, -ation, etc Infix: are inserted inside the stem Tagalog: um + hingi humingi Circumfixes – precede and follow the stem English doesn’t stack more affixes. But Turkish can have words with a lot of suffixes. Languages, such as Turkish, tend to string affixes together are called agglutinative languages.
11
Lecture 3, 7/27/2005Natural Language Processing11 Surface and Lexical Forms The surface level of a word represents the actual spelling of that word. geliyorum eats cats kitabım The lexical level of a word represents a simple concatenation of morphemes making up that word. gel +PROG +1SG eat +AOR cat +PLU kitap +P1SG Morphological processors try to find correspondences between lexical and surface forms of words. Morphological recognition/ analysis – surface to lexical Morphological generation/ synthesis – lexical to surface
12
12 Morphology: Morphemes & Order Handles what is an isolated form in written text Grouping of phonemes into morphemes sequence deliverables deliver, able and s (3 units) Morpheme Combination certain combinations/sequencing possible, other not: deliver+able+s, but not able+derive+s; noun+s, but not noun+ing typically fixed (in any given language)
13
Lecture 3, 7/27/2005Natural Language Processing13 Inflectional & Derivational Morphology We can also divide morphology up into two broad classes Inflectional Derivational Inflectional morphology concerns the combination of stems and affixes where the resulting word Has the same word class as the original Serves a grammatical/semantic purpose different from the original After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. eat / eats pencil / pencils After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. compute / computer do / undo friend / friendly Uygar / uygarlaşkapı / kapıcı The irregular changes may happen with derivational affixes.
14
Lecture 3, 7/27/2005Natural Language Processing14 Morphological Parsing Morphological parsing is to find the lexical form of a word from its surface form. cats -- cat +N +PLU cat -- cat +N +SG goose -- goose +N +SG or goose +V geese -- goose +N +PLU gooses -- goose +V +3SG catch -- catch +V caught -- catch +V +PAST or catch +V +PP There can be more than one lexical level representation for a given word. (ambiguity)
15
Lecture 3, 7/27/2005Natural Language Processing15 Morphological Analysis Analyzing words into their linguistic components (morphemes). Morphemes are the smallest meaningful units of language. carscar+PLU givinggive+PROG AsachhilAmaAsA+PROG+PAST+1st I/We was/were coming Ambiguity: More than one alternatives fliesfly VERB +PROG fly NOUN +PLU mAtAla kare
16
Lecture 3, 7/27/2005Natural Language Processing16 Fly + s flys flies (y i rule) Duckling Go-getter get + er Doer do + er Beer ? What knowledge do we need? How do we represent it? How do we compute with it?
17
Lecture 3, 7/27/2005Natural Language Processing17 Knowledge needed Knowledge of stems or roots Duck is a possible root, not duckl We need a dictionary (lexicon) Only some endings go on some words Do + er ok Be + er – not ok In addition, spelling change rules that adjust the surface form Get + er – double the t getter Fox + s – insert e – foxes Fly + s – insert e – flys – y to i – flies Chase + ed – drop e - chased
18
Lecture 3, 7/27/2005Natural Language Processing18 Put all this in a big dictionary (lexicon) Turkish – approx 600 10 6 forms Finnish – 10 7 Hindi, Bengali, Telugu, Tamil? Besides, always novel forms can be constructed Anti-missile Anti-anti-missile Anti-anti-anti-missile …….. Compounding of words – Sanskrit, German
19
19 Morphology: From Morphemes to Lemmas & Categories Lemma: lexical unit, “pointer” to lexicon typically is represented as the “base form”, or “dictionary headword” possibly indexed when ambiguous/polysemous: state 1 (verb), state 2 (state-of-the-art), state 3 (government) from one or more morphemes (“root”, “stem”, “root+derivation”,...) Categories: non-lexical small number of possible values (< 100, often < 5-10)
20
20 Morphology Level: The Mapping Formally: A + 2 (L,C 1,C 2,...,Cn) A is the alphabet of phonemes (A + denotes any non-empty sequence of phonemes) L is the set of possible lemmas, uniquely identified C i are morphological categories, such as: grammatical number, gender, case person, tense, negation, degree of comparison, voice, aspect,... tone, politeness,... part of speech (not quite morphological category, but...) A, L and C i are obviously language-dependent
21
Lecture 3, 7/27/2005Natural Language Processing21 Morphological Analysis (cont.) Relatively simple for English. But for many Indian languages, it may be more difficult. Examples Inflectional and Derivational Morphology. Common tools: Finite-state transducers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.