Presentation is loading. Please wait.

Presentation is loading. Please wait.

Problem 1: Word Segmentation whatdoesthisreferto.

Similar presentations


Presentation on theme: "Problem 1: Word Segmentation whatdoesthisreferto."— Presentation transcript:

1 Problem 1: Word Segmentation whatdoesthisreferto

2 Application: Chinese Text

3 Application: Internet Domain Names www.visitbritain.com Visit Britain

4 Statistical Machine Learning Best segmentation = one with highest probability Probability of a segmentation = P(first word) × P(rest of segmentation) P(word) = estimated by counting

5 Statistical Machine Learning choosespain Choose Spain Chooses pain P( “Choose Spain” ) > P( “Chooses Pain” )

6 Example segment(“nowisthetime…”) P f (“n”) × P r (“owisthetime…”) P f (“no”) × P r (“wisthetime…”) P f (“now”) × P r (“isthetime…”) P f (“nowi”) × P r (“sthetime…”) ……

7 Example segment(“nowisthetime…”)

8 The Complete Program

9 Performance Accuracy = 98% Trained on 1.7B words (English) Typical errors: baseratesoughtto smallandinsignificant ginormousego

10 Some Results whorepresents.com  [“who”, “represents”] therapistfinder.com  [“therapist”, “finder”] expertsexchange.com  [“experts”, “exchange”] speedofart.net  [“speed”, “of”, “art”] penisland.com  error: expected [“pen”, “island”]

11 Problem 2: Spelling Correction Mehran Salami Typical word processor:  Tehran Salami But Google can …

12

13 Statistical Machine Learning Best correction = one with highest probability Probability of a spelling correction c = P(c as a word) × P(original is a typo for c) P(c as a word) = estimated by counting P(original is a typo for c) = proportional to number of changes

14 The Complete Program

15 Problem 3: Speech Recognition An informal, incomplete grammar of the English language runs over 1,700 pages. Invariably, simple models and a lot of data trump more elaborate models based on less data.

16 Problem 3: Speech Recognition If you have a lot of data, memorisation is a good policy. For many tasks such as speech recognition, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without general rules.

17 Problem 3: Speech Recognition

18

19 “Every time I fire a linguist, the performance of our speech recognition system goes up.” --- Fred Jelinek

20 Problem 4: Machine Translation

21 Conclusion (Statistical) [Machine] Learning Is The Ultimate Agile Development Tool Peter Norvig (Director of Research, Google)


Download ppt "Problem 1: Word Segmentation whatdoesthisreferto."

Similar presentations


Ads by Google