LING 438/538 Computational Linguistics

Slides:



Advertisements
Similar presentations
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25.
Advertisements

LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 13: 10/9.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 15 th.
LING 388: Language and Computers Sandiway Fong Lecture 9: 9/27.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/19.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
LING 388: Language and Computers Sandiway Fong Lecture 28: 12/6.
Dynamic Programming Solving Optimization Problems.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 11: 10/3.
LING 388: Language and Computers Sandiway Fong Lecture 11: 10/3.
LING/C SC/PSYC 438/538 Computational Linguistics Sandiway Fong Lecture 16: 10/23.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 12: 10/5.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Basic Scientific Writing in English Lecture 3 Professor Ralph Kirby Faculty of Life Sciences Extension 7323 Room B322.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
Spelling Checkers Daniel Jurafsky and James H. Martin, Prentice Hall, 2000.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 14: 10/12.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
CS276 – Information Retrieval and Web Search Checking in. By the end of this week you need to have: Watched the online videos corresponding to the first.
Word Processing. ► This is using a computer for:  Writing  EditingTEXT  Printing  Used to write letters, books, memos and produce posters etc.  A.
Any questions on the Section 5.7 homework?
Bell Work InOut Find the pattern and fill in the numbers then write the function rule. Function rule: y= x
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning, Pandu Nayak.
Mathcad Variable Names A string of characters (including numbers and some “special” characters (e.g. #, %, _, and a few more) Cannot start with a number.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Please close your laptops and turn off and put away your cell phones, and get out your note-taking materials.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
LING/C SC/PSYC 438/538 Lecture 19 Sandiway Fong. Administrivia Next Monday – guest lecture from Dr. Jerry Ball of the Air Force Research Labs to be continued.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Input, Output, and Processing
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
CS276: Information Retrieval and Web Search
Minimum Edit Distance Definition of Minimum Edit Distance.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 2 Input, Processing, and Output.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
CS 415 – A.I. Slide Set 6. Chapter 4 – Heuristic Search Heuristic – the study of the methods and rules of discovery and invention State Space Heuristics.
12/7/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.
LING/C SC/PSYC 438/538 Lecture 8 Sandiway Fong. Adminstrivia Homework 4 not yet graded …
1 CS161 Introduction to Computer Science Topic #8.
Weighted Minimum Edit Distance
Foundations of (Theoretical) Computer Science Chapter 2 Lecture Notes (Section 2.2: Pushdown Automata) Prof. Karen Daniels, Fall 2010 with acknowledgement.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
CES 592 Theory of Software Systems B. Ravikumar (Ravi) Office: 124 Darwin Hall.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
January 2012Spelling Models1 Human Language Technology Spelling Models.
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.
Minimum Edit Distance Definition of Minimum Edit Distance.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
© 2004 The McGraw-Hill Companies, Inc. All rights reserved. The Advantage Series Microsoft Office Word 2003 CHAPTER 2 Modifying a Document.
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Dynamic Programming for the Edit Distance Problem.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Definition of Minimum Edit Distance
Topics Designing a Program Input, Processing, and Output
Definition of Minimum Edit Distance
CSA3180: Natural Language Processing
CPSC 503 Computational Linguistics
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Topics Designing a Program Input, Processing, and Output
Bioinformatics Algorithms and Data Structures
Topics Designing a Program Input, Processing, and Output
CS621/CS449 Artificial Intelligence Lecture Notes
Presentation transcript:

LING 438/538 Computational Linguistics Sandiway Fong Lecture 15: 10/17

Administrivia Hang on... haven’t sent out homework acknowledgements yet

Last Time Prolog implementation of Finite State Transducers (FST) for morphological processing Let’s revisit the implementation of: (Context-Sensitive) Spelling Rule: (3.5)   e / {x,s,z}^__ s# left right

Stage 2: Intermediate  Surface Levels context-sensitive spelling rule also can be implemented using a FST implements context-sensitive rule q0 to q2 : left context q3 to q0 : right context

Stage 2: Intermediate  Surface Levels Transition table for FST in 3.14 Problems: Transition table does not match machine e.g. q2 ^:  -> q0 Interpretation of “other”: (catch-all case) safely pass through any parts of words that don’t play a role in the E-insertion rule (p.78) “other” is global to the machine: not local to the state (as I asserted last time to make the machine diagram work)

Stage 2: Intermediate  Surface Levels in Prolog (simplified) q0([],[]). % final state q0([^|L1],L2) :- !, q0(L1,L2). % ^:  q0([z|L1],[z|L2]) :- !, q1(L1,L2). % repeat for s,x q0([#|L1],[#|L2]) :- !, q0(L1,L2). q0([X|L1],[X|L2]) :- \+ mentioned(X), q0(L1,L2). % other ! is known as the “cut” predicate it affects how Prolog searches it means “cut” the search off Prolog will not try any other compatible rule on backtracking

Today’s Topic Chapter 5 Spelling errors and correction Error Correction correct Bayesian Probability Minimum Edit Distance Computation Dynamic Programming

Spelling Errors Textbook cites (Kukich, 1992): Non-word detection (easiest) graffe (giraffe) Isolated-word (context-free) error correction graffe (giraffe,…) graffed (gaffed,…) by definition cannot correct when error word is a valid word Context-dependent error detection and correction (hardest) your an idiot  you’re an idiot (Microsoft Word corrects this by default)

Spelling Errors OCR Typing Graffiti (many HCI studies) visual similarity hb, ec , jumpjurnps Typing keyboard distance smallsmsll, spellspel; Graffiti (many HCI studies) stroke similarity Common error characters are: V, T, 4, L, E, Q, K, N, Y, 9, P, G, X Two stroke characters: B, D, P (error: two characters) Cognitive Errors bad spellers separateseperate

correct Kernighan et al. (correct) example (5.2) textbook section 5.5 (we will go through again later in more detail when we do chapter 6) Kernighan et al. (correct) take typo t (not a word) mutate t minimally by deleting, inserting, substituting or transposing (swapping) a letter look up “mutated t“ in a dictionary candidates are “mutated t“ that are real words example (5.2) t = acress C = {actress,cress, caress, access, across, acres, acres}

correct formula Prior: P(c) = argmax c  C P(t|c) P(c) (Bayesian Inference) C = {actress, cress, caress, access, across, acres, acres} Prior: P(c) estimated using frequency information over a large corpus (N words) P(c) = freq(c)/N P(c) = freq(c)+0.5/(N+0.5V) avoid zero counts (non-occurrences) (add fractional part 0.5) add one (0.5) smoothing see chapter 6 V is vocabulary size of corpus

correct 26 x 26 matrix Likelihood: P(t|c) Figure 5.3 using some corpus of errors compute following 4 confusion matrices del[x,y] = freq(correct xy mistyped as x) ins[x,y] = freq(correct x mistyped as xy) sub[x,y] = freq(correct x mistyped as y) trans[x,y] = freq(correct xy mistyped as yx) P(t|c) = del[x,y]/f(xy) if c related to t by deletion of y P(t|c) = ins[x,y]/f(x) if c related to t by insertion of y etc… probability of typo t given candidate word c 26 x 26 matrix a–z Figure 5.3

correct what does Microsoft Word use? example t = acress = acres (44%) despite all the math wrong result for was called a stellar and versatile acress what does Microsoft Word use? was called a stellar and versatile acress

Minimum Edit Distance general string comparison textbook section 5.7 general string comparison edit operations are insertion, deletion and substitution not just limited to distance defined by a single operation away (correct) we can ask how different is string a from b by the minimum edit distance

Minimum Edit Distance applications could be used for multi-typo correction used in Machine Translation Evaluation (MTEval) example Source: 生産工程改善について Translations: (Standard) For improvement of the production process (MT-A) About a production process betterment (MT-B) About the production process improvement method compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.

Minimum Edit Distance cost models Levenshtein Levenshtein (alternate) insertion, deletion and substitution all have unit cost Levenshtein (alternate) insertion, deletion have unit cost substitution is twice as expensive substitution = one insert followed by one delete Typewriter modified by key proximity

Minimum Edit Distance Dynamic Programming divide-and-conquer to solve a problem we divide it into sub-problems sub-problems may be repeated don’t want to re-solve a sub-problem the 2nd time around idea: put solutions to sub-problems in a table and just look up the solution 2nd time around, thereby saving time memoization

Fibonacci Series definition sequence computation F(0)=0 F(1)=1 F(n) = F(n-1) + F(n-2) for n≥2 sequence 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946... computation F(5) F(4)+F(3) F(3)+F(2)+F(3) F(2)+F(1)+F(2)+F(2)+F(1)

Minimum Edit Distance intent execu intent execut inten execu inten can be defined incrementally example: intention execution suppose we have the following minimum costs insertion, deletion have unit cost, substitution is twice as expensive intent  execu (9) inten  execut (9) inten  execu (8) Minimum cost for intent  execut (8) is given by computing the possibilities intent  execu (9)  execut (insert t) (10) intent (delete t)  inten  execut (9) (10) inten  execu (8): substitute t for t (zero cost) (8) one edit operation away intent execu intent execut inten execu inten execut

Minimum Edit Distance formula is right, table (figure 5.6) is wrong! Generally formula is right, table (figure 5.6) is wrong!

Minimum Edit Distance example for inte e min cost should be 3 Generally example for inte e min cost should be 3

Minimum Edit Distance Computation Microsoft Excel implementation of the formula: try it! $ in a cell reference means don’t change when copied from cell to cell e.g. in C$1 1 stays the same in $A3 A stays the same