Probabilistic models in Phonology John Goldsmith University of Chicago Tromsø: CASTL August 2005.

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Dealing with Random Phenomena A random phenomenon is a situation in which we know what outcomes could happen, but we don’t know which particular outcome.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
The Estimation Problem How would we select parameters in the limiting case where we had ALL the data? k → l  l’ k→ l’ Intuitively, the actual frequencies.
Optimal redundancy allocation for information technology disaster recovery in the network economy Benjamin B.M. Shao IEEE Transaction on Dependable and.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Chapter 4: Linear Models for Classification
Hidden Markov Models Theory By Johan Walters (SR 2003)
Visual Recognition Tutorial
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
Probabilistic models in phonology John Goldsmith LSA Institute 2003.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
Chapter 7 Probability. Definition of Probability What is probability? There seems to be no agreement on the answer. There are two broad schools of thought:
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
Visual Recognition Tutorial
Modelling the perceptual development of phonological contrasts with OT and the GLA Paola Escudero Paul Boersma
Market Risk VaR: Historical Simulation Approach
Lecture 1 Introduction: Linguistic Theory and Theories
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
CS324e - Elements of Graphics and Visualization Java Intro / Review.
Principles of Pattern Recognition
EM and expected complete log-likelihood Mixture of Experts
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
1 LING 696B: Gradient phonotactics and well- formedness.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
ENM 503 Lesson 1 – Methods and Models The why’s, how’s, and what’s of mathematical modeling A model is a representation in mathematical terms of some real.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
The Minimalist Program
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
ECE-7000: Nonlinear Dynamical Systems Overfitting and model costs Overfitting  The more free parameters a model has, the better it can be adapted.
M180: Data Structures & Algorithms in Java Algorithm Analysis Arab Open University 1.
Introduction Chapter 1 Foundations of statistical natural language processing.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Week 41 How to find estimators? There are two main methods for finding estimators: 1) Method of moments. 2) The method of Maximum likelihood. Sometimes.
1Causal Performance Models Causal Models for Performance Analysis of Computer Systems Jan Lemeire TELE lab May 24 th 2006.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Bridging the gap between L2 speech perception research and phonological theory Paola Escudero & Paul Boersma (March 2002) Presented by Paola Escudero.
Chapter 1 Introduction PHONOLOGY (Lane 335). Phonetics & Phonology Phonetics: deals with speech sounds, how they are made (articulatory phonetics), how.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Algorithm Analysis with Big Oh ©Rick Mercer. Two Searching Algorithms  Objectives  Analyze the efficiency of algorithms  Analyze two classic algorithms.
Conditional Expectation
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Usage-Based Phonology Anna Nordenskjöld Bergman. Usage-Based Phonology overall approach What is the overall approach taken by this theory? summarize How.
Statistical Machine Translation Part II: Word Alignments and EM
Chapter 4 Basic Estimation Techniques
Lecture 15: Text Classification & Naive Bayes
Rank Aggregation.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Mathematical Analysis of Algorithms
Optimization under Uncertainty
Presentation transcript:

Probabilistic models in Phonology John Goldsmith University of Chicago Tromsø: CASTL August 2005

Mange takk!

Probabilistic phonology Why a phonologist should be interested in probabilistic tools for understanding phonology, and analyzing phonological data… –Because probabilistic models are very powerful, and can tell us much about data even without recourse to structural assumptions, and –Probabilistic models can be used to teach us about phonological structure. The two parts of today’s talk will address each of these.

Probabilistic models Are well-understood mathematically; Have powerful methods associated with them for learning parameters from data; Are the ultimate formal model for understanding competition.

Essence of probabilistic models: Whenever there is a choice-point in a grammar, we must assign degrees of expectedness of each of the different choices. And we do this in a way such that these quantitites add up to 1.0

Frequencies and probabilities Frequencies are numbers that we observe (or count); Probabilities are parameters in a theory. We can set our probabilities on the basis of the (observed) frequencies; but we do not need to do so. We often do so for one good reason:

Maximum likelihood A basic principle of empirical success is this: –Find the probabilistic model that assigns the highest probability to a (pre-established) set of data (observations). Maximize the probability of the data.

Digression on Minimum Description Length (MDL) analysis Maximizing the probability of the data is not an entirely satisfactory goal: we also need to seek economy of description. Otherwise we risk overfitting the data. We can actually define a better quantity to optimize: this is the description length.

Description Length The description length of the analysis A of a set of data D is the sum of 2 things: –The length of the grammar in A (in “bits”); –The (base 2) logarithm of the probability assigned to the data D, by analysis A, times -1 (“log probability of the data”). When the probability is high, the “log probability” is small; when the probability is low, the log probability gets large.

MDL (suite) If we aim to minimize the sum of the description length ( = length of the grammar, as in 1 st generation generative grammar) + log probability (data), then we will seek the best overall grammatical account of the data.

Morphology Much of my work over the last 8 years has been on applying this framework to the discovery of morphological structure. See Today: phonology.

Assume structure? The standard argument for assuming structure in linguistics is to point out that there are empirical generalizations in the data that cannot be accounted for without assuming the existence of the structure.

Probabilistic models are capable of modeling a great deal of information without assuming (much) structure, and They are also capable of measuring exactly how much information they capture, thanks to information theory.

Simple segmental representations “Unigram” model for French (English, etc.) Captures only information about segment frequencies. The probability of a word is the product of the probabilities of its segments. Better measure: the complexity of a word is its average log probability:

Let’s look at that graphically… Because log probabilities are much easier to visualize. And because the log probability of a whole word is (in this case) just the sum of the log probabilities of the individual phones.

Add (1 st order) conditional probabilities The probability of a segment is conditioned by the preceding segment. Surprisingly, this is mathematically equivalent to adding something to the “unigram log probabilities” we just looked at: we add the “mutual information” of each successive phoneme.

Let’s look at that

Complexity = average log probability Find the model that makes this equation work the best. Rank words from a language by complexity: –Words at the top are the “best”; –Words at the bottom are borrowings, onomatopeia, etc.

Nativization of a word Gasoil [gazojl] or [gazọl] Compare average log probability (bigram model) –[gazojl] –[gazọl] This is a huge difference. Nativization decreases the average log probability of a word.

Phonotactics Phonotactics include knowledge of 2 nd order conditional probabilities.

Part 2: Categories So far we have made no assumptions about categories. Except that there are “phonemes” of some sort in a language, and that they can be counted. We have made no assumption about phonemes being sorted into categories.