CSCI 5832 Natural Language Processing

Slides:



Advertisements
Similar presentations
Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphological analysis
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CSA3050: Natural Language Algorithms Finite State Devices.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Natural Language Processing Chapter 2 : Morphology.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
CPSC 503 Computational Linguistics
Lecture 7 Summary Survey of English morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Speech and Language Processing
CPSC 503 Computational Linguistics
Morphology: Parsing Words
CPSC 503 Computational Linguistics
Morphology: Words and their Parts
Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Morphological Segmentation Inside-Out
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CSCI 5832 Natural Language Processing
CSC NLP - Regex, Finite State Automata
CSCI 5832 Natural Language Processing
CS4705 Natural Language Processing
CSCI 5832 Natural Language Processing
CPSC 503 Computational Linguistics
Língua Inglesa - Aspectos Morfossintáticos
CPSC 503 Computational Linguistics
CSCI 5832 Natural Language Processing
CPSC 503 Computational Linguistics
Morphological Parsing
CSCI 5832 Natural Language Processing
Presentation transcript:

CSCI 5832 Natural Language Processing Lecture 4 Jim Martin 11/23/2018 CSCI 5832 Spring 2006

Today 1/25 More English Morphology FSAs and Morphology Break FSTs 11/23/2018 CSCI 5832 Spring 2007

English Morphology Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes We can usefully divide morphemes into two classes Stems: The core meaning bearing units Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions 11/23/2018 CSCI 5832 Spring 2007

Inflectional Morphology Inflectional morphology concerns the combination of stems and affixes where the resulting word Has the same word class as the original Serves a grammatical/semantic purpose different from the original 11/23/2018 CSCI 5832 Spring 2007

Nouns and Verbs (English) Nouns are simple (not really) Markers for plural and possessive Verbs are only slightly more complex Markers appropriate to the tense of the verb 11/23/2018 CSCI 5832 Spring 2007

FSAs and the Lexicon First we’ll capture the morphotactics The rules governing the ordering of affixes in a language. Then we’ll add in the actual words 11/23/2018 CSCI 5832 Spring 2007

Simple Rules 11/23/2018 CSCI 5832 Spring 2007

Adding the Words 11/23/2018 CSCI 5832 Spring 2007

Derivational Rules 11/23/2018 CSCI 5832 Spring 2007

Parsing/Generation vs. Recognition Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing) Or we have some structure and we want to produce a surface form (production/generation) Example From “cats” to “cat +N +PL” and back 11/23/2018 CSCI 5832 Spring 2007

Homework How big is your vocabulary? 11/23/2018 CSCI 5832 Spring 2007

Projects 2 styles of projects Something no one has done… You might ask yourself why no one has done it. Tasks that have benchmarks and current best results from bakeoffs To get ideas about the latter go to acl.ldc.upenn.edu and poke around. 11/23/2018 CSCI 5832 Spring 2007

Projects Other ideas… Anything to do with blogs Machine learning applied to X Clustering (unsupervised) Classification (supervised) Bioinformatic language sources Search engines (getting old) Semantic tagging (getting hot) 11/23/2018 CSCI 5832 Spring 2007

Applications The kind of parsing we’re talking about is normally called morphological analysis It can either be An important stand-alone component of an application (spelling correction, information retrieval) Or simply a link in a chain of processing 11/23/2018 CSCI 5832 Spring 2007

Finite State Transducers The simple story Add another tape Add extra symbols to the transitions On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around. 11/23/2018 CSCI 5832 Spring 2007

FSTs 11/23/2018 CSCI 5832 Spring 2007

Transitions c:c means read a c on one tape and write a c on the other a:a t:t +N:ε +PL:s c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s 11/23/2018 CSCI 5832 Spring 2007

Typical Uses Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). And we’ll write to the second tape using the other symbols on the transitions. 11/23/2018 CSCI 5832 Spring 2007

Ambiguity Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. Didn’t matter which path was actually traversed In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result 11/23/2018 CSCI 5832 Spring 2007

Ambiguity What’s the right parse for Unionizable Union-ize-able Un-ion-ize-able Each represents a valid path through the derivational morphology machine. 11/23/2018 CSCI 5832 Spring 2007

Ambiguity There are a number of ways to deal with this problem Simply take the first output found Find all the possible outputs (all paths) and return them all (without choosing) Bias the search so that only one or a few likely paths are explored 11/23/2018 CSCI 5832 Spring 2007

The Gory Details Of course, its not as easy as “cat +N +PL” <-> “cats” As we saw earlier there are geese, mice and oxen But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes Cats vs Dogs Fox and Foxes 11/23/2018 CSCI 5832 Spring 2007

Multi-Tape Machines To deal with this we can simply add more tapes and use the output of one tape machine as the input to the next So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols 11/23/2018 CSCI 5832 Spring 2007

Generativity Nothing really privileged about the directions. We can write from one and read from the other or vice-versa. One way is generation, the other way is analysis 11/23/2018 CSCI 5832 Spring 2007

Multi-Level Tape Machines We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape 11/23/2018 CSCI 5832 Spring 2007

Lexical to Intermediate Level 11/23/2018 CSCI 5832 Spring 2007

Intermediate to Surface The add an “e” rule as in fox^s# <-> foxes# 11/23/2018 CSCI 5832 Spring 2007

Foxes 11/23/2018 CSCI 5832 Spring 2007

Note A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply. Meaning that they are written out unchanged to the output tape. Turns out the multiple tapes aren’t really needed; they can be compiled away. 11/23/2018 CSCI 5832 Spring 2007

Overall Scheme We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). Lexical level to intermediate forms We have a larger set of machines that capture orthographic/spelling rules. Intermediate forms to surface forms 11/23/2018 CSCI 5832 Spring 2007

Overall Scheme 11/23/2018 CSCI 5832 Spring 2007

Next Time Finish Chapter 3 11/23/2018 CSCI 5832 Spring 2007