Problem 1: Word Segmentation whatdoesthisreferto
Application: Chinese Text
Application: Internet Domain Names Visit Britain
Statistical Machine Learning Best segmentation = one with highest probability Probability of a segmentation = P(first word) × P(rest of segmentation) P(word) = estimated by counting
Statistical Machine Learning choosespain Choose Spain Chooses pain P( “Choose Spain” ) > P( “Chooses Pain” )
Example segment(“nowisthetime…”) P f (“n”) × P r (“owisthetime…”) P f (“no”) × P r (“wisthetime…”) P f (“now”) × P r (“isthetime…”) P f (“nowi”) × P r (“sthetime…”) ……
Example segment(“nowisthetime…”)
The Complete Program
Performance Accuracy = 98% Trained on 1.7B words (English) Typical errors: baseratesoughtto smallandinsignificant ginormousego
Some Results whorepresents.com [“who”, “represents”] therapistfinder.com [“therapist”, “finder”] expertsexchange.com [“experts”, “exchange”] speedofart.net [“speed”, “of”, “art”] penisland.com error: expected [“pen”, “island”]
Problem 2: Spelling Correction Mehran Salami Typical word processor: Tehran Salami But Google can …
Statistical Machine Learning Best correction = one with highest probability Probability of a spelling correction c = P(c as a word) × P(original is a typo for c) P(c as a word) = estimated by counting P(original is a typo for c) = proportional to number of changes
The Complete Program
Problem 3: Speech Recognition An informal, incomplete grammar of the English language runs over 1,700 pages. Invariably, simple models and a lot of data trump more elaborate models based on less data.
Problem 3: Speech Recognition If you have a lot of data, memorisation is a good policy. For many tasks such as speech recognition, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without general rules.
Problem 3: Speech Recognition
“Every time I fire a linguist, the performance of our speech recognition system goes up.” --- Fred Jelinek
Problem 4: Machine Translation
Conclusion (Statistical) [Machine] Learning Is The Ultimate Agile Development Tool Peter Norvig (Director of Research, Google)