9/11/2015 1 Morphology and Finite-state Transducers: Part 1 ICS 482: Natural Language Processing Lecture 5 Husni Al-Muhtaseb.

Slides:



Advertisements
Similar presentations
12/18/20141 Machine Translation ICS 482 Natural Language Processing Lecture 29-2: Machine Translation Husni Al-Muhtaseb.
Advertisements

Finite-state automata and Morphology
Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
4/14/20151 REGULAR EXPRESSIONS AND AUTOMATA Lecture 3: REGULAR EXPRESSIONS AND AUTOMATA Husni Al-Muhtaseb.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
5/12/20151 N-Gram: Part 1 ICS 482 Natural Language Processing Lecture 7: N-Gram: Part 1 Husni Al-Muhtaseb.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
1 Lexicalized and Probabilistic Parsing – Part 1 ICS 482 Natural Language Processing Lecture 14: Lexicalized and Probabilistic Parsing – Part 1 Husni Al-Muhtaseb.
Brief introduction to morphology
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CMSC 723 / LING 645: Intro to Computational Linguistics September 15, 2004: Dorr More about FSA’s, Finite State Morphology (J&M 3) Prof. Bonnie J. Dorr.
CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted.
6/11/20151 Parsing Context-free grammars – Part 1 ICS 482 Natural Language Processing Lecture 12: Parsing Context-free grammars – Part 1 Husni Al-Muhtaseb.
6/16/20151/1/ Morphology and Finite-state Transducers Part 2 ICS 482: Natural Language Processing Lecture 6 Husni Al-Muhtaseb.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
NLP Credits and Acknowledgment These slides were adapted from presentations of the Authors of the book SPEECH and LANGUAGE PROCESSING: An Introduction.
Morphological analysis
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Some Basic Concepts: Morphology.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Morphology: Words and their Parts CS 4705.
Introduction to English Morphology Finite State Transducers
Semantics: Representing Meaning ICS 482 Natural Language Processing
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
1 N-Gram – Part 2 ICS 482 Natural Language Processing Lecture 8: N-Gram – Part 2 Husni Al-Muhtaseb.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
10/13/20151 Representing Meaning Part 4 & Semantic Analysis ICS 482 Natural Language Processing Lecture 21: Representing Meaning Part 4 & Semantic Analysis.
Finite State Transducers
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
5/31/20161 Lexical Semantic + Students Presentations ICS 482 Natural Language Processing Lecture 26: Lexical Semantic + Students Presentations Husni Al-Muhtaseb.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
1 Parts of Speech Part 1 ICS 482 Natural Language Processing Lecture 9: Parts of Speech Part 1 Husni Al-Muhtaseb.
29 November Parsing Context-free grammars Part 2 ICS 482 Natural Language Processing Lecture 13: Parsing Context-free grammars Part 2 Husni Al-Muhtaseb.
Natural Language Processing Chapter 2 : Morphology.
12/22/20151 Representing Meaning Part 2 ICS 482 Natural Language Processing Lecture 18: Representing Meaning Part 2 Husni Al-Muhtaseb.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
1/31/20161 Lexical Semantic 2 ICS 482 Natural Language Processing Lecture 29-1: Lexical Semantic 2 Husni Al-Muhtaseb.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
3/7/20161 Representing Meaning Part 3 ICS 482 Natural Language Processing Lecture 20: Representing Meaning Part 3 Husni Al-Muhtaseb.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Speech and Language Processing
Morphology: Parsing Words
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Information Extraction ICS 482 Natural Language Processing
Morphological Parsing
Lecture 27: Husni Al-Muhtaseb
Context-free grammars ICS 482 Natural Language Processing
Presentation transcript:

9/11/ Morphology and Finite-state Transducers: Part 1 ICS 482: Natural Language Processing Lecture 5 Husni Al-Muhtaseb

9/11/ بسم الله الرحمن الرحيم Morphology and Finite-state Transducers: Part 1 ICS 482: Natural Language Processing Lecture 5 Husni Al-Muhtaseb

NLP Credits and Acknowledgment These slides were adapted from presentations of the Authors of the book SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition and some modifications from presentations found in the WEB by several scholars including the following

NLP Credits and Acknowledgment If your name is missing please contact me muhtaseb At Kfupm. Edu. sa

NLP Credits and Acknowledgment Husni Al-Muhtaseb James Martin Jim Martin Dan Jurafsky Sandiway Fong Song young in Paula Matuszek Mary-Angela Papalaskari Dick Crouch Tracy Kin L. Venkata Subramaniam Martin Volk Bruce R. Maxim Jan Hajič Srinath Srinivasa Simeon Ntafos Paolo Pirjanian Ricardo Vilalta Tom Lenaerts Heshaam Feili Björn Gambäck Christian Korthals Thomas G. Dietterich Devika Subramanian Duminda Wijesekera Lee McCluskey David J. Kriegman Kathleen McKeown Michael J. Ciaraldi David Finkel Min-Yen Kan Andreas Geyer- Schulz Franz J. Kurfess Tim Finin Nadjet Bouayad Kathy McCoy Hans Uszkoreit Azadeh Maghsoodi Khurshid Ahmad Staffan Larsson Robert Wilensky Feiyu Xu Jakub Piskorski Rohini Srihari Mark Sanderson Andrew Elks Marc Davis Ray Larson Jimmy Lin Marti Hearst Andrew McCallum Nick Kushmerick Mark Craven Chia-Hui Chang Diana Maynard James Allan Martha Palmer julia hirschberg Elaine Rich Christof Monz Bonnie J. Dorr Nizar Habash Massimo Poesio David Goss-Grubbs Thomas K Harris John Hutchins Alexandros Potamianos Mike Rosner Latifa Al-Sulaiti Giorgio Satta Jerry R. Hobbs Christopher Manning Hinrich Schütze Alexander Gelbukh Gina-Anne Levow Guitao Gao Qing Ma Zeynep Altan

9/11/ Previous Lectures 1 Pre-start online questionnaire 1 Introduce yourself 2 Introduction to NLP 2 Phases of an NLP system 2 NLP Applications 3 Chatting with Alice 3 Regular Expressions 3 Finite State Automata 3 Regular languages 3 Assignment #2 4 Regular Expressions & Regular languages 4 Deterministic & Non-deterministic FSAs 4 Accept, Reject, Generate terms

9/11/ Objective of Today’s Lecture Morphology –Inflectional –Derivational –Compounding –Cliticization Parsing Finite State Transducers Assignment 3

9/11/ Morphological Analysis Individual words are analyzed into their components Morphological Analysis Individual words are analyzed into their components Syntactic Analysis Linear sequences of words are transformed into structures that show how the words relate to each other Syntactic Analysis Linear sequences of words are transformed into structures that show how the words relate to each other Discourse Analysis Resolving references Between sentences Discourse Analysis Resolving references Between sentences Pragmatic Analysis To reinterpret what was said to what was actually meant Pragmatic Analysis To reinterpret what was said to what was actually meant Semantic Analysis A transformation is made from the input text to an internal representation that reflects the meaning Semantic Analysis A transformation is made from the input text to an internal representation that reflects the meaning Reminder: Stages of NLP

9/11/ Morphologic al Analysis Syntactic Analysis Semantic Analysis Discourse Analysis Pragmatic Analysis Internal representatio n lexicon user Surface form Perform action stems parse tree Resolve references

9/11/ Introduction Finite-state methods are useful in dealing with the lexicon (words) Present some facts about words and computational methods

9/11/ Morphology Morphology: the study of meaningful parts of words and how they are put together Morphemes: are the smallest meaningful spoken units of language Example – books: two morphemes (book and s) but one syllable –Unladylike: three morphemes, four syllables

12 Morpheme Definitions Root –The portion of the word that: is common to a set of derived or inflection forms, if any, when all affixes are removed is not further analyzable into meaningful elements carries the principle portion of meaning of the words Stem –The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added.

13 Morpheme Definitions Affix –A bound morpheme that is joined before, after, or within a root or stem. Clitic –a morpheme that functions syntactically like a word, but does not appear as an independent phonological word English: I’ve (the morpheme ‘ve is a clitic)

14 Inflectional vs. Derivational Word Classes –Parts of speech: noun, verb, adjectives, etc. –Word class dictates how a word combines with morphemes to form new words Inflection: –Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. Doesn’t change the word class Usually produces a predictable meaning. Derivation: –The formation of a new word or inflectable stem from another word or stem.

15 Inflectional Morphology Adds: –tense, number, person Word class doesn’t change Word serves new grammatical role Example –come is inflected for person and number: The pizza guy comes at noon.

16 Derivational Morphology Nominalization (formation of nouns from other parts of speech, primarily verbs in English): –computerization –appointee –killer –fuzziness Formation of adjectives (primarily from nouns) –computational –clueless –Embraceable

17 Concatinative Morphology Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme – hope+ing  hopinghop  hopping Affixes –Prefixes: Antidisestablishmentarianism - وسنكتب –Suffixes: Antidisestablishmentarianism - كتبوها –Infixes: hingi (borrow) – humingi (borrower) in Tagalog - كاتب –Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages –uygarlaştıramadıklarımızdanmışsınızcasına –uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına –Behaving as if you are among those whom we could not cause to become civilized

9/11/ Templatic Morphology Roots and Patterns مَكتوب ب K T B ?ومَ?? كت كتابة ب ? ?? كت maktoob written kitabah writing اة

9/11/ Templatic Morphology: Root Meaning KTB: writing “stuff” كتب كاتب مكتوب كتاب book مكتبة library مكتب office write writer letter

20 Nouns and Verbs (in English) Nouns –Have simple inflectional morphology –Cat/Cats –Mouse/Mice, Ox, Oxen, Goose, Geese Verbs –More complex morphology –Walk/Walked –Go/Went, Fly/Flew

21 Regular (English) Verbs Morphological Form ClassesRegularly Inflected Verbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing formwalkingmergingtryingmapping Past form or –ed participlewalkedmergedtriedmapped

22 Irregular (English) Verbs Morphological Form ClassesIrregularly Inflected Verbs Stemeatcatchcut -s formeatscatchescuts -ing formeatingcatchingcutting Past formatecaughtcut -ed participleeatencaughtcut

23 “To love” in Spanish

9/11/ To love in Arabic ?

9/11/ Review: What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language) –Stems –Affixes (prefixes, suffixes, circumfixes, infixes) Immaterial Trying Gesagt سنلزمكموهاتكاتبا –Concatenative vs. Templatic (non-concatenative) (e.g. Arabic root-and-pattern)

9/11/ Multiple affixes –Unreadable سنلزمكموهاتكاتبا –Agglutinative languages (e.g. Turkish, Japanese) –vs. inflectional languages (e.g. Latin, Russian) –vs. analytic languages (e.g. Mandarin) Review: What is morphology?

9/11/ English Inflectional Morphology Word stem combines with grammatical morpheme –Usually produces word of same class –Usually serves a syntactic function (e.g. agreement) like  likes or liked bird  birds Nominal morphology –Plural forms s or es Irregular forms Mass vs. count nouns ( or s) –Possessives

9/11/ Verbal inflection –Main verbs (sleep, like, fear) verbs are relatively regular -s, ing, ed And productive: ed, instant-messaged, faxed But eat/ate/eaten, catch/caught/caught –Primary (be, have, do) and modal verbs (can, will, must) are often irregular and not productive Be: am/is/are/were/was/been/being –Irregular verbs few (~250) but frequently occurring –English verbal inflection is much simpler than e.g. Latin Review: What is morphology?

9/11/ English Derivational Morphology Word stem combines with grammatical morpheme –Usually produces word of different class –More complicated than inflectional Example: nominalization –-ize verbs  -ation nouns –generalize, realize  generalization, realization Example: verbs, nouns  adjectives –embrace, pity  embraceable, pitiable –care, wit  careless, witless

9/11/ Example: adjective  adverb –happy  happily More complicated to model than inflection –Less productive: *science-less, *concern-less, *go-able, *sleep-able –Meanings of derived terms harder to predict by rule clueless, careless, nerveless

9/11/ Parsing Taking a surface input and identifying its components and underlying structure Morphological parsing: parsing a word into stem and affixes and identifying the parts and their relationships –Stem and features: goose  goose +N +SG or goose + V geese  goose +N +PL gooses  goose +V +3SG –Bracketing: indecipherable  [in [[de [cipher]] able]]

9/11/ Why parse words? For spell-checking –Is muncheble a legal word? To identify a word’s part-of-speech (POS) –For sentence parsing, for machine translation, … To identify a word’s stem –For information retrieval Why not just list all word forms in a lexicon?

9/11/ What do we need to build a morphological parser? Lexicon: stems and affixes (w/ corresponding Part of Speech (POS)) Morphotactics of the language: model of how morphemes can be affixed to a stem Orthographic rules: spelling modifications that occur when affixation occurs –in  il in context of l (in- + legal)

34 Syntax and Morphology Phrase-level agreement –Subject-Verb Ali studies hard (STUDY+3SG) Sub-word phrasal structures – ولحاجاتنا – و + لـ + حاجات + نا –and+for+need+PL+Poss:1PL –And for our needs

9/11/ Morphotactic Models English nominal inflection q0q2q1 plural (-s) reg-n irreg-sg-n irreg-pl-n Inputs: cats, goose, geese reg-n: regular noun irreg-pl-n: irregular plural noun irreg-sg-n: irregular singular noun

9/11/ Derivational morphology: adjective fragment q3 q5 q4 q0 q1q2 un- adj-root 1 -er, -ly, -est  adj-root 1 adj-root 2 -er, -est Adj-root 1 : clear, happy, real Adj-root 2 : big, red

9/11/ Using FSAs to Represent the Lexicon and Do Morphological Recognition Lexicon: We can expand each non- terminal in our NFSA into each stem in its class (e.g. adj_root 2 = {big, red}) and expand each such stem to the letters it includes (e.g. red  r e d, big  b i g) q0 q1 r e q2 q4 q3 -er, -est d b g q5 q6i q7 

9/11/ Limitations To cover all of English will require very large FSAs with consequent search problems –Adding new items to the lexicon means re- computing the FSA –Non-determinism FSAs can only tell us whether a word is in the language or not – what if we want to know more? –What is the stem? –What are the affixes? –We used this information to build our FSA: can we get it back?

9/11/ Parsing with Finite State Transducers cats  cat +N +PL Kimmo Koskenniemi’s two-level morphology –Words represented as correspondences between lexical level (the morphemes) and surface level (the orthographic word) –Morphological parsing :building mappings between the lexical and surface levels cat+N+PL cats

9/11/ Finite State Transducers FSTs map between one set of symbols and another using an FSA whose alphabet  is composed of pairs of symbols from input and output alphabets In general, FSTs can be used for –Translator (Hello:مرحبا) –Parser/generator (Hello:How may I help you?) –To map between the lexical and surface levels of Kimmo’s 2-level morphology

9/11/ FST is a 5-tuple consisting of –Q: set of states {q0,q1,q2,q3,q4} –  : an alphabet of complex symbols, each is an i/o pair such that i  I (an input alphabet) and o  O (an output alphabet) and  is in I x O –q0: a start state –F: a set of final states in Q {q4} –  (q,i:o): a transition function mapping Q x  to Q –Emphatic Sheep  Quizzical Cow q0 q4 q1q2q3 b:ma:o !:?

9/11/ FST for a 2-level Lexicon Example Reg-nIrreg-pl-nIrreg-sg-n c a tg o:e o:e s eg o o s e q0q1q2 q3 cat q1q3q4q2 se:o e q0 q5 g

9/11/ FST for English Nominal Inflection q0q7 +PL:^s# Combining (cascade or composition) this FSA with FSAs for each noun type replaces e.g. reg- n with every regular noun representation in the lexicon q1q4 q2q5 q3q6 reg-n irreg-n-sg irreg-n-pl +N:  +PL:-s# +SG:-# +N: 

9/11/ Orthographic Rules and FSTs Define additional FSTs to implement rules such as consonant doubling (beg  begging), ‘e’ deletion (make  making), ‘e’ insertion (watch  watches), etc. Lexical fox+N+PL Intermediate fox^s# Surface foxes

9/11/ Note: These FSTs can be used for generation as well as recognition by simply exchanging the input and output alphabets (e.g. ^s#:+PL)

9/11/ Administration Next Sunday: Quiz 1: 20 Minutes In the class Assignment 2: What was your findings about Python? New Assignment (3)

9/11/ Assignment 3: Part 1 A genre for your Corpora Choose a Domain for your Corpora –Technology and Computers –Management –Weather –Sport –Economics –Politics –Education –Health care –Religion –History –Traditional Poems –New Poems –Other suggested fields

9/11/ Assignment 3: Part 1 A genre for your Corpora Put your choice on the discussion list named 'My Corpora'. read other selections before Avoid selecting a topic that has been selected You might need to suggest unlisted field – with the arrangement of the instructor Collect text files and keep them in one directory as your corpora for future use Suggested total size (sum of sizes of all text files) –larger than 10Mbyte of Arabic text

9/11/ Assignment 3: Part 2 List text files in a chosen directory Write a program that allows the user to browse and select a directory, then the program will list the names of the text files in that directory. This program is needed to be used for future assignments and the course project. You can use any language you are mastering. However, Python might be a good choice

9/11/ Assignment 3: Part 3 The most used n words in your corpora After building your corpora, you need to find the most used 100 words in your corpora. You might do that by writing a program that let the user choose the directory of the corpora where the text files are located and find the most use n words. Where n could be 100.

9/11/ Thank you السلام عليكم ورحمة الله