CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.

Slides:



Advertisements
Similar presentations
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Advertisements

LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25.
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/19.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Midterm Review CS4705 Natural Language Processing.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/23.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Morphological analysis
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
Spelling Checkers Daniel Jurafsky and James H. Martin, Prentice Hall, 2000.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 14: 10/12.
Topics Non-Determinism (NFSAs) Recognition of NFSAs Proof that regular expressions = FSAs Very brief sketch: Morphology, FSAs, FSTs Very brief sketch:
Introduction to English Morphology Finite State Transducers
Evidence from Content INST 734 Module 2 Doug Oard.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
LING 438/538 Computational Linguistics
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong. Administrivia Next Monday – guest lecture from Dr. Jerry Ball of the Air Force Research Labs to be continued.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Data Structure. Two segments of data structure –Storage –Retrieval.
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Minimum Edit Distance Definition of Minimum Edit Distance.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Intro to NLP - J. Eisner1 Phonology [These slides are missing most examples and discussion from class …]
Minimum Edit Distance Definition of Minimum Edit Distance.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Definition of Minimum Edit Distance
Speech and Language Processing
CS 430: Information Discovery
Definition of Minimum Edit Distance
CSCI 5832 Natural Language Processing
Speech and Language Processing
Single-Source All-Destinations Shortest Paths With Negative Costs
CSCI 5832 Natural Language Processing
Token generation - stemming
CSCI 5832 Natural Language Processing
CSC NLP - Regex, Finite State Automata
CPSC 503 Computational Linguistics
Finite-State and the Noisy Channel
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Morphological Parsing
CSCI 5832 Natural Language Processing
Presentation transcript:

CSCI 5832 Natural Language Processing Lecture 5 Jim Martin

Today 1/29 Review FSTs/Composing cascades Break FST Examples  Porter Stemmer  Soundex Segmentation Edit distance

FST Review FSTs allow us to take an input and deliver a structure based on it Or… take a structure and create a surface form Or take a structure and create another structure

FST Review What does the transition table for an FSA consist of? How does that change with an FST?

Review In many applications it’s convenient to decompose the problem into a set of cascaded transducers where  The output of one feeds into the input of the next.  We’ll see this scheme again for deeper semantic processing.

English Spelling Changes We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

Overall Plan

Final Scheme

Composition 1.Create a set of new states that correspond to each pair of states from the original machines (New states are called (x,y), where x is a state from M1, and y is a state from M2) 2.Create a new FST transition table for the new machine according to the following intuition…

Composition There should be a transition between two states in the new machine if it’s the case that the output for a transition from a state from M1, is the same as the input to a transition from M2 or…

Composition δ 3 ((x a,y a ), i:o) = (x b,y b ) iff  There exists c such that  δ 1 (x a, i:c) = x b AND  δ 2 (y a, c:o) = y b

Light Weight Morphology Sometimes you just need to know the stem of a word and you don’t care about the structure. In fact you may not even care if you get the right stem, as long as you get a consistent string. This is stemming… it most often shows up in IR applications

Stemming for Information Retrieval Run a stemmer on the documents to be indexed Run a stemmer on users’ queries Match  This is basically a form of hashing, where you want collisions.

Porter No lexicon needed Basically a set of staged sets of rewrite rules that strip suffixes Handles both inflectional and derivational suffixes Doesn’t guarantee that the resulting stem is really a stem (see first bullet) Lack of guarantee doesn’t matter for IR

Porter Example Computerization  ization -> -ize computerize  ize -> ε computer

Porter The original exposition of the Porter stemmer did not describe it as a transducer but…  Each stage is separate transducer  The stages can be composed to get one big transducer 

Next HW Part 1: Alter your code to take newswire text and segment it into paragraphs, sentences and words. As in… The quick brown fox. This is sentence two.

Readings/Quiz First quiz is 2/12 (2 weeks from today) It will cover Chapters 2,3,4,5 and parts of 6 Lectures are based on the assumption that you’ve read the text before class. Quizzes are based on the contents of the lectures and the readings.

Segmentation A significant part of the morphological analysis we’ve been doing involves segmentation  Sometimes getting the segments is all we want  Morpheme boundaries, words, sentences, paragraphs

Segmentation Segmenting words in running text Segmenting sentences in running text Why not just periods and white-space?  Mr. Sherwood said reaction to Sea Containers’ proposal has been "very positive." In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents.  “I said, ‘what’re you? Crazy?’ “ said Sadowsky. “I can’t afford to do that.’’ Words like:  cents.said,positive.”Crazy?

Can’t Just Segment on Punctuation Word-internal punctuation  M.p.h  Ph.D.  AT&T  01/02/06  Google.com  555, Expanding clitics  What’re -> what are  I’m -> I am Multi-token words  New York  Rock ‘n’ roll

Sentence Segmentation !, ? relatively unambiguous Period “.” is quite ambiguous  Sentence boundary  Abbreviations like Inc. or Dr. General idea:  Build a binary classifier:  Looks at a “.”  Decides EndOfSentence/NotEOS  Could be hand-written rules, or machine-learning

Word Segmentation in Chinese Some languages don’t have spaces  Chinese, Japanese, Thai, Khmer Chinese:  Words composed of characters  Characters are generally 1 syllable and 1 morpheme.  Average word is 2.4 characters long.  Standard segmentation algorithm:  Maximum Matching (also called Greedy)

Maximum Matching Word Segmentation Given a wordlist of Chinese, and a string. 1)Start a pointer at the beginning of the string 2)Find the longest word in dictionary that matches the string starting at pointer 3)Move the pointer over the word in string 4)Go to 2

English example (Palmer 00) the table down there Theta bled own there Works astonishingly well in Chinese Far better than this English example suggests Modern algorithms better still:  Probabilistic segmentation

Spelling Correction and Edit Distance Non-word error detection:  detecting “graffe” Non-word error correction:  figuring out that “graffe” should be “giraffe” Context-dependent error detection and correction:  Figuring out that “war and piece” should be peace

Non-Word Error Detection Any word not in a dictionary is a spelling error Need a big dictionary! What to use?  FST dictionary!!

Isolated Word Error Correction How do I fix “graffe”?  Search through all words:  graf  craft  grail  giraffe  Pick the one that’s closest to graffe  What does “closest” mean?  We need a distance metric.  The simplest one: edit distance  I.e. Unix diff  (More sophisticated probabilistic ones: noisy channel)

Edit Distance The minimum edit distance between two strings is the minimum number of editing operations  Insertion  Deletion  Substitution Needed to transform one string into the other

Minimum Edit Distance If each operation has cost of 1 Distance between these is 5 If substitutions cost 2 (Levenshtein) Distance between these is 8

Min Edit Example

Min Edit As Search But that generates a huge search space  Initial state is the word we’re transforming  Operators are insert, delete, substitute  Goal state is the word we’re trying to get to  Path cost is what we’re trying to minimize Navigating that space in a naïve backtracking fashion would be incredibly wasteful Why? Lots of paths wind up at the same states. But there’s no need to keep track of the them all. We only care about the shortest path to each of those revisited states.

Min Edit Distance

Min Edit Example

Summary Minimum Edit Distance A “dynamic programming” algorithm We will see a probabilistic version of this called “Viterbi”

Summary Finite State Automata Deterministic Recognition of FSAs Non-Determinism (NFSAs) Recognition of NFSAs FSAs, FSTs and Morphology Very brief sketch: Tokenization Minimum Edit Distance

Next Time On to Chapter 4 (N-grams) for Thursday