5/16/2015 1 ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.

Slides:



Advertisements
Similar presentations
Finite-state automata and Morphology
Advertisements

Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
Morphology.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Brief introduction to morphology
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
Morphology I. Basic concepts and terms Derivational processes
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Morphological analysis
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Some Basic Concepts: Morphology.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
9/11/ Morphology and Finite-state Transducers: Part 1 ICS 482: Natural Language Processing Lecture 5 Husni Al-Muhtaseb.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
Finite State Transducers
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
Natural Language Processing Chapter 2 : Morphology.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
NATURAL LANGUAGE PROCESSING
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
System and the axis of Choice  Systems are list of choices which are available in the grammar of a language.  It could be a list of things b/w which.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
عمادة التعلم الإلكتروني والتعليم عن بعد
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Introduction to Linguistics
Speech and Language Processing
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Chapter 6 Morphology.
Morphology: Parsing Words
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Morphological Parsing
Introduction to English morphology
CSCI 5832 Natural Language Processing
Presentation transcript:

5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009

5/16/ Morphological Analysis Individual words are analyzed into their components Morphological Analysis Individual words are analyzed into their components Syntactic Analysis Linear sequences of words are transformed into structures that show how the words relate to each other Syntactic Analysis Linear sequences of words are transformed into structures that show how the words relate to each other Discourse Analysis Resolving references Between sentences Discourse Analysis Resolving references Between sentences Pragmatic Analysis To reinterpret what was said to what was actually meant Pragmatic Analysis To reinterpret what was said to what was actually meant Semantic Analysis A transformation is made from the input text to an internal representation that reflects the meaning Semantic Analysis A transformation is made from the input text to an internal representation that reflects the meaning Steps of NLP

5/16/ Morphology Morphology: the study of the way words are built up from smaller meaning-bearing units, morphemes. Morpheme: is the minimal meaning-bearing unit in a language Example – fox: One morpheme fox. –cats: Two morphemes, cat and –s.

4 Morpheme Definitions There are two broad classes of morphemes: Stem –The main morpheme of the word, supplying the main meaning. Affix –Add additional meaning of various kinds. It is further divided into: Prefixes, Suffixes, Infixes, and Circumfixes.

5 Morpheme Definitions Prefixes: –Prefixes precede the stem. Suffixes: –Suffixes follow the stem Circumfixes: –Circumfixes do both Infixes: –Infixes are inserted inside the stem.

Examples Eats: composed of a stem eat and the suffix –s Unbuckle: composed of a stem buckle and the prefix un- English doesn’t really have circumfixes, but many other languages do. In Germany, for example, the past participle of some verbs is formed by adding ge- to the beginning of the stem and –t to the end. For example, the past participle of the verb sagen (to say) is gesagt (said).

7 Morpheme Definitions A word can have more than one affix. For example: the word rewrites has the prefix re-, the stem write, and the suffix –s. The word unbelievably has the stem believe plus three affixes (un-, -able, and –ly). English doesn’t use more than five affixes. Other languages like Turkish can have words with ten affixes.

8 Morpheme Definitions There are many ways to combine morphemes to create words. Four of these are: –Inflection –Derivation –Compounding –Cliticization

Inflection It is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem and usually filling some syntactic function like agreement. For example, –English has inflectional morpheme –s for marking the plural on nouns. –The inflectional morpheme –ed for marking the past tense on verbs.

Derivation It is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict exactly. For example, –The verb computerize can take the derivational suffix –ation to produce the noun computerization.

Compounding It is the combination of a multiple word stems together. For example, –The noun doghouse is the concatenation of the morpheme dog with the morpheme house.

Cliticization It is the combination of a word stem with a clitic. A clitic is a morpheme that acts syntactically like a word but is reduced in form and attached to another word. For example, –The English morpheme ‘ve in the word I’ve is a clitic.

Inflection Morphology English nouns have only two kinds of inflection: –an affix that marks plural and –an affix that marks possessive Examples: Regular and irregular plurals.

Inflection Morphology While the regular plural is spelled -s after most nouns, it is spelled -es after words ending in: -s (ibis/ibises), -z (waltz/waltzes), -sh (thrush/thrushes), -ch (finch/finches), and sometimes -x (box/boxes). Nouns ending in -y preceded by a consonant change the -y to -i (butterfly/butterflies).

Inflection Morphology The possessive suffix is realized by: –Apostrophe + -s for regular singular nouns (llama’s) and plural nouns not ending in -s (children’s) and –A lone apostrophe after regular plural nouns (llamas’) and some names ending in -s or -z (Euripides’ comedies).

Inflection Morphology English has three kinds of verbs; –Main verbs, (eat, sleep, impeach), –Modal verbs (can, will, should), and –Primary verbs (be, have, do) We will mostly be concerned with the main and primary verbs, because it is these that have inflectional endings.

Inflection Morphology These regular verbs (e.g. walk, or inspect) have four morphological forms, as follow:

Inflection Morphology Irregular verbs in English can have as many as eight or as few as three forms

19 Derivational Morphology A common kind of derivation in English is the formation of new nouns from verbs or adectives (Nominalization) Adjectives can also be derived from nouns and verbs

20 Cliticization Morphology English clitics include these auxiliary verbal form: Clitics is ambiguous. Example: She’s can mean she is or she has.

21 Compounding Morphology The kind of compound morphology we have discussed so far, in which a word is composed of a string of morphemes concatenated together is often called concatenative morphology. A number of languages have extensive non- concatenative morphology, in which morphemes are combined in more complex ways. Another kind of non-concatenative morphology is called templatic morphology or root-and-pattern morphology. Example: Read Chapter 3.

22 Aggrement We say that the subject noun and the main verb in English have to agree in number, meaning that the two must either be both singular or both plural. There are other kinds of agreement processes. For example nouns, adjectives, and sometimes verbs in many languages are marked for gender. A gender is a kind of equivalence class that is used by the language to categorize the nouns; each noun falls into one class.

23 Aggrement Many languages (for example Romance languages like French, Spanish, or Italian) have 2 genders, which are referred to as masculine and feminine. Other languages (like most Germanic and Slavic languages) have three (masculine, feminine, neuter). Gender is sometimes marked explicitly on a noun; for example Spanish masculine words often end in -o and feminine words in -a.

24 Aggrement But in many cases the gender is not marked in the letters or phones of the noun itself. Instead, it is a property of the word that must be stored in a lexicon as in the table bellow:

5/16/ Parsing Taking a surface input and identifying its components and underlying structure Morphological parsing: parsing a word into stem and affixes and identifying the parts and their relationships –Stem and features: goose  goose +N +SG or goose + V geese  goose +N +PL gooses  goose +V +3SG –Bracketing: indecipherable  [in [[de [cipher]] able]]

5/16/ Why parse words? For spell-checking –Is muncheble a legal word? To identify a word’s part-of-speech (POS) –For sentence parsing, for machine translation, … To identify a word’s stem –For information retrieval Why not just list all word forms in a lexicon?

5/16/ What do we need to build a morphological parser? Lexicon: stems and affixes (w/ corresponding Part of Speech (POS)) Morphotactics of the language: model of how morphemes can be affixed to a stem Orthographic rules: spelling modifications that occur when affixation occurs –in  il in context of l (in- + legal)

28 Syntax and Morphology Phrase-level agreement –Subject-Verb Ali studies hard (STUDY+3SG) Sub-word phrasal structures – ولحاجاتنا – و + لـ + حاجات + نا –and+for+need+PL+Poss:1PL –And for our needs

5/16/ Morphotactic Models English nominal inflection q0q2q1 plural (-s) reg-n irreg-sg-n irreg-pl-n Inputs: cats, goose, geese reg-n: regular noun irreg-pl-n: irregular plural noun irreg-sg-n: irregular singular noun

5/16/ Derivational morphology: adjective fragment q3 q5 q4 q0 q1q2 un- adj-root 1 -er, -ly, -est  adj-root 1 adj-root 2 -er, -est Adj-root 1 : clear, happy, real Adj-root 2 : big, red

5/16/ Using FSAs to Represent the Lexicon and Do Morphological Recognition Lexicon: We can expand each non- terminal in our NFSA into each stem in its class (e.g. adj_root 2 = {big, red}) and expand each such stem to the letters it includes (e.g. red  r e d, big  b i g) q0 q1 r e q2 q4 q3 -er, -est d b g q5 q6i q7 

5/16/ Limitations To cover all of English will require very large FSAs with consequent search problems –Adding new items to the lexicon means re- computing the FSA –Non-determinism FSAs can only tell us whether a word is in the language or not – what if we want to know more? –What is the stem? –What are the affixes? –We used this information to build our FSA: can we get it back?

5/16/ Parsing with Finite State Transducers cats  cat +N +PL Kimmo Koskenniemi’s two-level morphology –Words represented as correspondences between lexical level (the morphemes) and surface level (the orthographic word) –Morphological parsing :building mappings between the lexical and surface levels cat+N+PL cats

5/16/ Finite State Transducers FSTs map between one set of symbols and another using an FSA whose alphabet  is composed of pairs of symbols from input and output alphabets In general, FSTs can be used for –Translator (Hello:مرحبا) –Parser/generator (Hello:How may I help you?) –To map between the lexical and surface levels of Kimmo’s 2-level morphology

5/16/ FST is a 5-tuple consisting of –Q: set of states {q0,q1,q2,q3,q4} –  : an alphabet of complex symbols, each is an i/o pair such that i  I (an input alphabet) and o  O (an output alphabet) and  is in I x O –q0: a start state –F: a set of final states in Q {q4} –  (q,i:o): a transition function mapping Q x  to Q –Emphatic Sheep  Quizzical Cow q0 q4 q1q2q3 b:ma:o !:?

5/16/ FST for a 2-level Lexicon Example Reg-nIrreg-pl-nIrreg-sg-n c a tg o:e o:e s eg o o s e q0q1q2 q3 cat q1q3q4q2 se:o e q0 q5 g

5/16/ FST for English Nominal Inflection q0q7 +PL:^s# Combining (cascade or composition) this FSA with FSAs for each noun type replaces e.g. reg- n with every regular noun representation in the lexicon q1q4 q2q5 q3q6 reg-n irreg-n-sg irreg-n-pl +N:  +PL:-s# +SG:-# +N: 

5/16/ Orthographic Rules and FSTs Define additional FSTs to implement rules such as consonant doubling (beg  begging), ‘e’ deletion (make  making), ‘e’ insertion (watch  watches), etc. Lexical fox+N+PL Intermediate fox^s# Surface foxes

5/16/ Note: These FSTs can be used for generation as well as recognition by simply exchanging the input and output alphabets (e.g. ^s#:+PL)

5/16/ Administration Next Sunday: Quiz 1: 20 Minutes In the class Assignment 2: What was your findings about Python? New Assignment (3)

5/16/ Assignment 3: Part 1 A genre for your Corpora Choose a Domain for your Corpora –Technology and Computers –Management –Weather –Sport –Economics –Politics –Education –Health care –History –Traditional Poems –New Poems –Other suggested fields

5/16/ Assignment 3: Part 1 A genre for your Corpora Put your choice on the discussion list named 'My Corpora'. read other selections before Avoid selecting a topic that has been selected You might need to suggest unlisted field – with the arrangement of the instructor Collect text files and keep them in one directory as your corpora for future use Suggested total size (sum of sizes of all text files) –larger than 10Mbyte of Arabic text

5/16/ Assignment 3: Part 2 List text files in a chosen directory Write a program that allows the user to browse and select a directory, then the program will list the names of the text files in that directory. This program is needed to be used for future assignments and the course project. You can use any language you are mastering. However, Python might be a good choice

5/16/ Assignment 3: Part 3 The most used n words & their frequencies in your corpora After building your corpora, you need to find the most used 100 words in your corpora with their frequencies. You might do that by writing a program that let the user choose the directory of the corpora where the text files are located and find the most use n words. Where n could be 100.

Deliverables A Report that contains the following headings: 5/16/ –Problem faced –Suggested Enhancements –Test Cases and Screen Shots –How to compile and run the source code –Conclusion –Recommendation The source code of the program and the executable version of the same code. Please do not include the source code in the report. –Introduction –Description & Specification –Corpora (genre) and the reason –Design: High Level architecture –Implementation Issues –Accomplished Parts –Unaccomplished Parts

5/16/ Thank you السلام عليكم ورحمة الله