Speech and Language Processing

Slides:



Advertisements
Similar presentations
Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
Advertisements

Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphological analysis
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
Artificial Intelligence: Natural Language
CSA3050: Natural Language Algorithms Finite State Devices.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Natural Language Processing Chapter 2 : Morphology.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
CPSC 503 Computational Linguistics
Lecture 7 Summary Survey of English morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
CPSC 503 Computational Linguistics
Morphology: Parsing Words
CPSC 503 Computational Linguistics
Morphology: Words and their Parts
Natural Language Processing
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Morphological Parsing
CSCI 5832 Natural Language Processing
Presentation transcript:

Speech and Language Processing Chapter 3 of SLP

Today English Morphology Finite-State Transducers Last time: 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Words Finite-state methods are particularly useful in dealing with a lexicon Many devices, most with limited memory, need access to large lists of words need to perform fairly sophisticated tasks with those lists (e.g. information retrieval / web search, spelling correction) also need to handle new words So we’ll first talk about some facts about words and then come back to computational methods 6/3/2018 Speech and Language Processing - Jurafsky and Martin

English vs. other languages Examples in class are from English Regular verbs typically have only 4 forms, irregular up to 8 Can thus accomplish a lot without morphology, by just listing all word forms (e.g., sit, sits, sat) Morphology is much more essential for other languages Finnish verbs have more than 10,000 forms! 6/3/2018 Speech and Language Processing - Jurafsky and Martin

English Morphology Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes We can usefully divide morphemes into two classes Stems: The core meaning-bearing units Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions 6/3/2018 Speech and Language Processing - Jurafsky and Martin

English Morphology We can further divide morphology up into two broad classes Inflectional Derivational 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Word Classes By word class, we have in mind familiar notions like noun and verb We’ll go into the gory details in Chapter 5 Right now we’re concerned with word classes because the way that stems and affixes combine is based to a large degree on the word class of the stem 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Computational Morphology Generation hang V Past hanged hung Recognition/Analysis leaves leaf N Pl leave N Pl leave V Sg3 6/3/2018 Speech and Language Processing - Jurafsky and Martin

To support computation Lexicon (stems and affix list) Morphotactics Words are composed of smaller elements that must be combined in a certain order piti-less-ness is English piti-ness-less is not English Orthographic Alternations Spelling may vary depending on the context pity is realized as piti in pitilessness die becomes dy in dying 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Inflectional Morphology Inflectional morphology concerns the combination of stems and affixes where the resulting word: Has the same word class as the original Serves a grammatical/semantic purpose that is Different from the original But is nevertheless transparently related to the original 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Nouns and Verbs in English Nouns are simple Markers for plural and possessive Verbs are only slightly more complex Markers appropriate to the tense of the verb 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Regulars and Irregulars It is a little complicated by the fact that some words misbehave (refuse to follow the rules) Mouse/mice, goose/geese, ox/oxen Go/went, fly/flew The terms regular and irregular are used to refer to words that follow the rules and those that don’t 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Regular and Irregular Verbs Regulars… Walk, walks, walking, walked Irregulars Eat, eats, eating, ate Catch, catches, catching, caught Cut, cuts, cutting, cut 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Inflectional Morphology So inflectional morphology in English is fairly straightforward But is complicated by the fact that are irregularities 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Derivational Morphology Derivational morphology is the messy stuff that no one ever taught you. Quasi-systematicity Irregular meaning change Changes of word class 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Derivational Examples Verbs and Adjectives to Nouns -ation computerize computerization -ee appoint appointee -er kill killer -ness fuzzy fuzziness 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Derivational Examples Nouns and Verbs to Adjectives -al computation computational -able embrace embraceable -less clue clueless 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Example: Compute Many paths are possible… Start with compute Computer -> computerize -> computerization Computer -> computerize -> computerizable But not all paths/operations are equally good (allowable?) Clue Clue -> *clueable 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Morpholgy and FSAs We’d like to use the machinery provided by FSAs to capture these facts about morphology Accept strings that are in the language Reject strings that are not And do so in a way that doesn’t require us to in effect list all the words in the language 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Start Simple Regular singular nouns are ok Regular plural nouns have an -s on the end Irregulars are ok as is What is the FSA? 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Now Plug in the Words 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Derivational Rules If everything is an accept state how do things ever get rejected? 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Another example Problems? (NPref) Noun (NSuf) | Adj (AdjSuff) | Adv a i f a t a a l e r c v e h a r e s t r v e r f s e e a t h e (NPref) Noun (NSuf) | Adj (AdjSuff) | Adv 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Parsing/Generation vs. Recognition We can now run strings through these machines to recognize strings in the language But recognition is usually not quite what we need Often if we find some string in the language we might like to assign a structure to it (parsing) Or we might have some structure and we want to produce a surface form for it (production/generation) Example From “cats” to “cat +N +PL” 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Finite State Transducers The simple story Add another tape Add extra symbols to the transitions On one tape we read “cats”, on the other we write “cat +N +PL” 6/3/2018 Speech and Language Processing - Jurafsky and Martin

FSTs 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Transitions c:c a:a t:t +N: ε +PL:s c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Typical Uses Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). And we’ll write to the second tape using the other symbols on the transitions. 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Ambiguity What’s the right parse (segmentation) for Unionizable Union-ize-able Un-ion-ize-able Each represents a valid path through the derivational morphology machine. 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Ambiguity There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored 6/3/2018 Speech and Language Processing - Jurafsky and Martin

The Gory Details Of course, its not as easy as “cat +N +PL” <-> “cats” As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Fox and Foxes 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Multi-Tape Machines To deal with these complications, we will add more tapes and use the output of one tape machine as the input to the next So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Multi-Level Tape Machines We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Lexical to Intermediate Level 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Intermediate to Surface The add an “e” rule as in fox^s# <-> foxes# 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Foxes 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Exercise: Change to transducer f a t a a l e r c v e h a r e s t r v e r f s e e a t h e (NPref) Noun (NSuf) | Adj (AdjSuff) | Adv 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Cascades This is an architecture that we’ll see again and again Overall processing is divided up into distinct rewrite steps The output of one layer serves as the input to the next The intermediate tapes may or may not wind up being useful in their own right 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Related Applications Stemming Word and Sentence Tokenization English Chinese via maxmatch Spelling Error Detection (non-word) Correction via minimum edit distance, and alignment tweek Human Morphological Processing 6/3/2018 Speech and Language Processing - Jurafsky and Martin

Toolkits OpenFST Library (Google and NYU) Carmel Toolkit FSA Toolkit http://www.openfst.org Carmel Toolkit http://www.isi.edu/licensed-sw/carmel FSA Toolkit http://www-i6.informatik.rwth-aachen.de/~kanthak/fsa.html Current research – learning the knowledge 6/3/2018 Speech and Language Processing - Jurafsky and Martin