Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.

Similar presentations


Presentation on theme: "Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising."— Presentation transcript:

1 Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising regular expressions

2 Spelling dictionaries aim? given a sequence of symbols: 1. identify misspelled strings 2. generate a list of possible ‘candidate’ correct strings 2. select most probable candidate from the list

3 Spelling dictionaries Implementation: Probabilistic framework bayesian rule noisy channel model

4 Spelling dictionaries Types of spelling error actual word errors non-word errors

5 Spelling dictionaries Types of spelling error actual word errors /piece/ instead of /peace/ /there/ instead of /their/ non-word errors

6 Spelling dictionaries Types of spelling error actual word errors /piece/ instead of /peace/ /there/ instead of /their/ non-word errors /graffe/ instead of /giraffe/

7 Spelling dictionaries Types of spelling error actual word errors /piece/ instead of /peace/ /there/ instead of /their/ non-word errors /graffe/ instead of /giraffe/ of all errors in type written texts, 80% are non- word errors

8 Spelling dictionaries non-word errors Cognitive errors /seperate/ instead of /separate/ phonetically equivalent sequence of symbols has been substituted due to lack of knowledge about spelling conventions

9 Spelling dictionaries non-word errors Cognitive errors Typographic (‘typo’) errors influenced by keyboard e.g. substitution of /w/ for /e/ due to its adjacency on the keyboard /thw/ instead of /the/

10 Spelling dictionaries non-word errors noisy channel model The actual word has been passed through a noisy communication channel This has distorted the word, thereby changing it in some way The misspelled word is the distorted version of the actual word Aim: recover the actual word by hypothesising about the possible ways in which it could have been distorted

11 Spelling dictionaries non-word errors noisy channel model What are the possible distortions? insertion deletion substitution transposition all of these viewed as transformations that take place in the noisy channel

12 Spelling dictionaries Implementing spelling identification and correction algorithm

13 Spelling dictionaries Implementing spelling identification and correction algorithm STAGE 1: compare each string in document with a list of legal strings; if no corresponding string in list mark as misspelled STAGE 2: generate list of candidates Apply any single transformation to the typo string Filter the list by checking against a dictionary STAGE 3: assign probability values to each candidate in the list STAGE 4: select best candidate

14 Spelling dictionaries STAGE 3 prior probability given all the words in English, is this candidate more likely to be what the typist meant than that candidate? P(c) = c/N where N is the number of words in a corpus likelihood Given, the possible errors, or transformation, how likely is it that error y has operated on candidate x to produce the typo? P(t/c), calculated using a corpus of errors, or transformations Bayesian rule: get the product of the prior probability and the likelihood P(c) X P(t/c)

15 Spelling dictionaries non-word errors Implementing spelling identification and correction algorithm STAGE 1: identify misspelled words STAGE 2: generate list of candidates STAGE 3: rank candidates for probability STAGE 4: select best candidate Implement: noisy channel model Bayesian Rule

16 Spelling dictionaries Main reference Jurafsky and Martin (2000), chapter 5


Download ppt "Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising."

Similar presentations


Ads by Google