Presentation is loading. Please wait.

Presentation is loading. Please wait.

Language and Statistics

Similar presentations


Presentation on theme: "Language and Statistics"— Presentation transcript:

1 11-761 Language and Statistics
Spring 2010 Roni Rosenfeld

2 Course Goals and Style Teaching statistical techniques for language technologies Plugging gaping holes in LTI grad student education in probability, statistics and information theory. 26 December 2018 © Roni Rosenfeld, 2010

3 Course philosophy Socratic Method Highly interactive Highly adaptable
participation strongly encouraged (pls state your name) Highly interactive Highly adaptable based on how fast we move Lots of Probability, Statistics, Information theory not in the abstract, but rather as the need arises Lectures emphasize intuition, not rigor or detail background reading will have rigor & detail 26 December 2018 © Roni Rosenfeld, 2010

4 Course Mechanics Highly recommended: learn & use a text processing language like perl, python, awk… Can you derive Bayes equation in your sleep? 26 December 2018 © Roni Rosenfeld, 2010

5 Background Material No single book exists which covers the course material. “Foundations of Statistical NLP”, Manning & Schutze Computational Linguistics perspective “Statistical Methods in Speech Recognition”, Jelinek “Text Compression”, Bell, Cleary & Witten first 4 chapters; rest is mostly text compression “Probability and Statistics”, DeGroot “All of Statistics” & “All of nonparametric Statistics”, Wasserman Lots of individual articles 26 December 2018 © Roni Rosenfeld, 2010

6 Syllabus (subject to change)
Overview and Grand Thoughts What Is All This Good For? source-channel formulation Words, Words, Words type vs, token, Zipf, Mandlebrot, heterogeneity of langauge Modeling Word distributions - the unigram: [estimators, ML, zero frequency, G-T] N-grams: Deleted Interpolation Model, backoff, toolkit Measuring Success: perplexity [entropy, KL-div, MI], the entropy of English, alternatives 26 December 2018 © Roni Rosenfeld, 2010

7 Syllabus (continued) Clustering: Latent Variable Models, EM
class-based N-grams, hierarchical clustering Latent Variable Models, EM Hidden Markov Models, revisiting interpolated and class n-grams Part-Of-Speech tagging, Word Sense Disambiguation Decision & Regression Trees Stochastic Grammars (SCFG, inside-outside alg., Link grammar) Maximum Entropy Modeling exponential models, ME principle, feature induction... 26 December 2018 © Roni Rosenfeld, 2010

8 Syllabus (continued) Language Model Adaptation
caches, backoff Dimensionality reduction latent semantic analysis Statistical Parsing Statistical Machine Translation Statistical Text Segmentation Statistical Information Retrieval Statistical Information Extraction 26 December 2018 © Roni Rosenfeld, 2010


Download ppt "Language and Statistics"

Similar presentations


Ads by Google