Statistical Natural Language Processing Advanced AI - Part II Luc De Raedt University of Freiburg WS 2005/2006 Many slides taken from Helmut Schmid
Topic Statistical Natural Language Processing Applies Machine Learning / Statistics to Learning : the ability to improve one’s behaviour at a specific task over time - involves the analysis of data (statistics) Natural Language Processing Following parts of the book Statistical NLP (Manning and Schuetze), MIT Press, 1999.
Rationalism versus Empiricism Rationalist Noam Chomsky - innate language structures AI : hand coding NLP Dominant view Cf. e.g. Steven Pinker’s The language instinct. (popular science book) Empiricist Ability to learn is innate AI : language is learned from corpora Dominant and becoming increasingly important
Rationalism versus Empiricism Noam Chomsky: But it must be recognized that the notion of “probability of a sentence” is an entirely useless one, under any known interpretation of this term Fred Jelinek (IBM 1988) Every time a linguist leaves the room the recognition rate goes up. (Alternative: Every time I fire a linguist the recognizer improves)
This course Empiricist approach Focus will be on probabilistic models for learning of natural language No time to treat natural language in depth ! (though this would be quite useful and interesting) Deserves a full course by itself Covered in more depth in Logic, Language and Learning (SS 05, prob. SS 06)
Ambiguity
Statistical Disambiguation Define a probability model for the data Compute the probability of each alternative Choose the most likely alternative NLP and Statistics
Statistical Methods deal with uncertainty. They predict the future behaviour of a system based on the behaviour observed in the past. Statistical Methods require training data. The data in Statistical NLP are the Corpora NLP and Statistics
Corpus: text collection for linguistic purposes Tokens How many words are contained in Tom Sawyer? Types How many different words are contained in T.S.? Hapax Legomena words appearing only once Corpora
The most frequent words are function words wordfreqwordfreq the3332in906 and2972that877 a1775he877 to1725I783 of1440his772 was1161you686 it1027Tom679 Word Counts
f n f > How many words appear f times? Word Counts About half of the words occurs just once About half of the text consists of the 100 most common words ….
Word Counts (Brown corpus)
wordfr f*rwordfr f*r the turned and you‘ll a name he comes but group be lead there friends one begin about family more brushed never sins Oh Could two Applausive Zipf‘s Law: f~1/r (f*r = const) Zipf‘s Law Minimize effort
Some probabilistic models N-grams Predicting the next word Artificial intelligence and machine …. Statistical natural language …. Probabilistic Regular (Markov Models) Hidden Markov Models Conditional Random Fields Context-free grammars (Stochastic) Definite Clause Grammars
Illustration Wall Street Journal Corpus words Correct parse tree for sentences known Constructed by hand Can be used to derive stochastic context free grammars SCFG assign probability to parse trees Compute the most probable parse tree
Conclusions Overview of some probabilistic and machine learning methods for NLP Also very relevant to bioinformatics ! Analogy between parsing A sentence A biological string (DNA, protein, mRNA, …)