Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.

Similar presentations


Presentation on theme: "Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005."— Presentation transcript:

1 Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005

2 2 The Description of Language Language = Words and Rules  Dictionary (vocabulary) + Grammar Dictionary set of words defined in the language.open (dynamic)  Traditional - paper based  Electronic - machine readable dictionaries; can be obtained from paper-based Grammar set of rules which describe what is allowable in a language Classic Grammars  meant for humans who know the language  definitions and rules are mainly supported by examples  no (or almost no) formal description tools; cannot be programmed Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...) formal description can be programmed & tested on data (texts)

3 3 Levels of (Formal) Description 6 basic levels (more or less explicitly present in most theories) : and beyond (pragmatics/logic/...) meaning (semantics) (surface) syntax morphology phonology phonetics/orthography Each level has an input and output representation  output from one level is the input to the next (upper) level  sometimes levels might be skipped (merged) or split

4 4 Phonetics/Orthography Input:  acoustic signal (phonetics) / text (orthography) Output:  phonetic alphabet (phonetics) / text (orthography) Deals with:  Phonetics:  consonant & vowel (& others) formation in the vocal tract  classification of consonants, vowels,... in relation to frequencies, shape & position of the tongue and various muscles  intonation  Orthography: normalization, punctuation, etc.

5 5 Phonology Input:  sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes] Output:  sequence of phonemes (~ (lexical) letters; in an abstract alphabet) Deals with:  relation between sounds and phonemes (units which might have some function on the upper level)  e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)

6 6 Morphology Input:  sequence of phonemes (~ (lexical) letters) Output:  sequence of pairs (lemma, (morphological) tag) Deals with:  composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding)  e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.

7 7 (Surface) Syntax Input:  sequence of pairs (lemma, (morphological) tag) Output:  sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms Deals with:  the relation between lemmas & morphological categories and the sentence structure  uses syntactic categories such as Subject, Verb, Object,...  e.g.: I/PP1 see/VB a/DT dog/NN ~  ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S

8 8 Meaning (semantics) Input:  sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions) Output:  sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions) Deals with:  relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s  e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~  (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)

9 9...and Beyond Input:  sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions) Output:  logical form, which can be evaluated (true/false) Deals with:  assignment of objects from the real world to the nodes of the sentence structure  e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ see( Mark-Twain[SSN:...],Tom-Sawyer[SSN:...] ) [Time:bef 99/9/27/14:15][Place:39ş19’40”N76ş37’10”W]

10 Lecture 3, 7/27/2005Natural Language Processing10 Morphology Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes (morph = shape, logos = word) We can usefully divide morphemes into two classes  Stems: The core meaning bearing units  Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions  Prefix: un-, anti-, etc  Suffix: -ity, -ation, etc  Infix: are inserted inside the stem Tagalog: um + hingi  humingi  Circumfixes – precede and follow the stem English doesn’t stack more affixes. But Turkish can have words with a lot of suffixes. Languages, such as Turkish, tend to string affixes together are called agglutinative languages.

11 Lecture 3, 7/27/2005Natural Language Processing11 Surface and Lexical Forms The surface level of a word represents the actual spelling of that word.  geliyorum eats cats kitabım The lexical level of a word represents a simple concatenation of morphemes making up that word.  gel +PROG +1SG  eat +AOR  cat +PLU  kitap +P1SG Morphological processors try to find correspondences between lexical and surface forms of words.  Morphological recognition/ analysis – surface to lexical  Morphological generation/ synthesis – lexical to surface

12 12 Morphology: Morphemes & Order Handles what is an isolated form in written text Grouping of phonemes into morphemes  sequence deliverables  deliver, able and s (3 units) Morpheme Combination  certain combinations/sequencing possible, other not:  deliver+able+s, but not able+derive+s; noun+s, but not noun+ing  typically fixed (in any given language)

13 Lecture 3, 7/27/2005Natural Language Processing13 Inflectional & Derivational Morphology We can also divide morphology up into two broad classes  Inflectional  Derivational Inflectional morphology concerns the combination of stems and affixes where the resulting word  Has the same word class as the original  Serves a grammatical/semantic purpose different from the original After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change.  eat / eats pencil / pencils After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change.  compute / computer do / undo friend / friendly  Uygar / uygarlaşkapı / kapıcı The irregular changes may happen with derivational affixes.

14 Lecture 3, 7/27/2005Natural Language Processing14 Morphological Parsing Morphological parsing is to find the lexical form of a word from its surface form.  cats -- cat +N +PLU  cat -- cat +N +SG  goose -- goose +N +SG or goose +V  geese -- goose +N +PLU  gooses -- goose +V +3SG  catch -- catch +V  caught -- catch +V +PAST or catch +V +PP There can be more than one lexical level representation for a given word. (ambiguity)

15 Lecture 3, 7/27/2005Natural Language Processing15 Morphological Analysis Analyzing words into their linguistic components (morphemes). Morphemes are the smallest meaningful units of language. carscar+PLU givinggive+PROG AsachhilAmaAsA+PROG+PAST+1st I/We was/were coming Ambiguity: More than one alternatives fliesfly VERB +PROG fly NOUN +PLU mAtAla kare

16 Lecture 3, 7/27/2005Natural Language Processing16 Fly + s  flys  flies (y  i rule) Duckling Go-getter  get + er Doer  do + er Beer  ? What knowledge do we need? How do we represent it? How do we compute with it?

17 Lecture 3, 7/27/2005Natural Language Processing17 Knowledge needed Knowledge of stems or roots  Duck is a possible root, not duckl We need a dictionary (lexicon) Only some endings go on some words  Do + er ok  Be + er – not ok In addition, spelling change rules that adjust the surface form  Get + er – double the t getter  Fox + s – insert e – foxes  Fly + s – insert e – flys – y to i – flies  Chase + ed – drop e - chased

18 Lecture 3, 7/27/2005Natural Language Processing18 Put all this in a big dictionary (lexicon) Turkish – approx 600  10 6 forms Finnish – 10 7 Hindi, Bengali, Telugu, Tamil? Besides, always novel forms can be constructed  Anti-missile  Anti-anti-missile Anti-anti-anti-missile  …….. Compounding of words – Sanskrit, German

19 19 Morphology: From Morphemes to Lemmas & Categories Lemma: lexical unit, “pointer” to lexicon  typically is represented as the “base form”, or “dictionary headword”  possibly indexed when ambiguous/polysemous: state 1 (verb), state 2 (state-of-the-art), state 3 (government)  from one or more morphemes (“root”, “stem”, “root+derivation”,...) Categories: non-lexical  small number of possible values (< 100, often < 5-10)

20 20 Morphology Level: The Mapping Formally: A +    2 (L,C 1,C 2,...,Cn)  A is the alphabet of phonemes (A + denotes any non-empty sequence of phonemes)  L is the set of possible lemmas, uniquely identified  C i are morphological categories, such as:  grammatical number, gender, case  person, tense, negation, degree of comparison, voice, aspect,...  tone, politeness,...  part of speech (not quite morphological category, but...)  A, L and C i are obviously language-dependent

21 Lecture 3, 7/27/2005Natural Language Processing21 Morphological Analysis (cont.) Relatively simple for English. But for many Indian languages, it may be more difficult. Examples Inflectional and Derivational Morphology. Common tools: Finite-state transducers


Download ppt "Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005."

Similar presentations


Ads by Google