A Bayesian approach to word segmentation: Theoretical and experimental results Sharon Goldwater Department of Linguistics Stanford University.

Slides:

Advertisements

Similar presentations

Bayesian Inference for Signal Detection Models of Recognition Memory Michael Lee Department of Cognitive Sciences University California Irvine

Advertisements

Psych 156A/ Ling 150: Acquisition of Language II Lecture 6 Words in Fluent Speech I.

Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,

Psych 156A/ Ling 150: Psychology of Language Learning Lecture 8 Words in Fluent Speech.

Probabilistic inference in human semantic memory Mark Steyvers, Tomas L. Griffiths, and Simon Dennis 소프트컴퓨팅연구실오근현 TRENDS in Cognitive Sciences vol. 10,

What is Statistical Modeling

And you can too! SBS  Introduction  Evidence for Statistics  Bays Law  Informative Priors  Joint Models  Inference  Conclusion.

Introduction  Bayesian methods are becoming very important in the cognitive sciences  Bayesian statistics is a framework for doing inference, in a principled.

Psych 156A/ Ling 150: Acquisition of Language II Lecture 6 Words in Fluent Speech II.

Psych 156A/ Ling 150: Psychology of Language Learning Lecture 4 Words in Fluent Speech.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Exploring subjective probability distributions using Bayesian statistics Tom Griffiths Department of Psychology Cognitive Science Program University of.

Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/

Evaluating Hypotheses

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.

Psych 156A/ Ling 150: Acquisition of Language II Lecture 5 Words in Fluent Speech I.

Bayesian models as a tool for revealing inductive biases Tom Griffiths University of California, Berkeley.

Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.

Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.

McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)

1 A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla.

A chicken-and-egg problem

Building a Lexicon Statistical learning & recognizing words.

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

Iowa State University Developmental Robotics Laboratory Unsupervised Segmentation of Audio Speech using the Voting Experts Algorithm Matthew Miller, Alexander.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Essential Statistics Chapter 131 Introduction to Inference.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Statistical Learning in Infants (and bigger folks)

1 The Scientist Game Chris Slaughter, DrPH (courtesy of Scott Emerson) Dept of Biostatistics Vanderbilt University © 2002, 2003, 2006, 2008 Scott S. Emerson,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Statistical NLP Winter 2009 Lecture 5: Unsupervised Learning I Unsupervised Word Segmentation Roger Levy [thanks to Sharon Goldwater for many slides]

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Psych 156A/ Ling 150: Acquisition of Language II

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

What infants bring to language acquisition Limitations of Motherese & First steps in Word Learning.

Supertagging CMSC Natural Language Processing January 31, 2006.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Ensemble Methods in Machine Learning

Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Machine Learning 5. Parametric Methods.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Models for Automatic Speech Recognition Lukáš Burget.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Psych 156A/ Ling 150: Psychology of Language Learning Lecture 9 Words in Fluent Speech II.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Psych 156A/ Ling 150: Psychology of Language Learning

Statistical Models for Automatic Speech Recognition

Revealing priors on category structures through iterated learning

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Presentation transcript:

A Bayesian approach to word segmentation: Theoretical and experimental results Sharon Goldwater Department of Linguistics Stanford University

Word segmentation One of the first problems infants must solve when learning language. Infants make use of many different cues.  Phonotactics, allophonic variation, metrical (stress) patterns, effects of coarticulation, and statistical regularities in syllable sequences. Statistics may provide initial bootstrapping.  Used very early (Thiessen & Saffran, 2003).  Language-independent.

Modeling statistical segmentation Previous work often focuses on how statistical information (e.g., transitional probabilities) can be used to segment speech. Bayesian approach asks what information should be used by a successful learner.  What statistics should be collected?  What assumptions (by the learner) constrain possible generalizations?

Outline 1. Computational model and theoretical results  What are the consequences of using different sorts of information for optimal word segmentation? (joint work with Tom Griffiths and Mark Johnson) 2. Modeling experimental data  Do humans behave optimally? (joint work with Mike Frank, Vikash Mansinghka, Tom Griffiths, and Josh Tenenbaum)

Statistical segmentation Work on statistical segmentation often discusses transitional probabilities (Saffran et al. 1996; Aslin et al. 1998, Johnson & Jusczyk, 2001).  P(syl i | syl i-1 ) is often lower at word boundaries. What do TPs have to say about words?  A word is a unit whose beginning predicts its end, but it does not predict other words.  A word is a unit whose beginning predicts its end, and it also predicts future words. Or…

Interpretation of TPs Most previous work assumes words are statistically independent.  Experimental work: Saffran et al. (1996), many others.  Computational work: Brent (1999). What about words predicting other words? tupiro golabu bidaku padoti golabubidakugolabutupiropadotibidakupadotitupi…

Questions If a learner assumes that words are independent units, what is learned (from more realistic input)?  Unigram model: Generate each word independently. What if the learner assumes that words are units that help predict other units?  Bigram model: Generate each word conditioned on the previous word. Approach: use a Bayesian ideal observer model to examine the consequences of each assumption.

Bayesian learning The Bayesian learner seeks to identify an explanatory linguistic hypothesis that  accounts for the observed data.  conforms to prior expectations. Focus is on the goal of computation, not the procedure (algorithm) used to achieve the goal.

Bayesian segmentation In the domain of segmentation, we have:  Data: unsegmented corpus (transcriptions).  Hypotheses: sequences of word tokens. Optimal solution is the segmentation with highest prior probability. = 1 if concatenating words forms corpus, = 0 otherwise. Encodes unigram or bigram assumption (also others).

Brent (1999) Describes a Bayesian unigram model for segmentation.  Prior favors solutions with fewer words, shorter words. Problems with Brent’s system:  Learning algorithm is approximate (non-optimal).  Difficult to extend to incorporate bigram info.

A new unigram model (Dirichlet process) Assume word w i is generated as follows: 1. Is w i a novel lexical item? Fewer word types = Higher probability

A new unigram model (Dirichlet process) Assume word w i is generated as follows: 2. If novel, generate phonemic form x 1 …x m : If not, choose lexical identity of w i from previously occurring words: Shorter words = Higher probability Power law = Higher probability

Unigram model: simulations Same corpus as Brent (Bernstein-Ratner, 1987) :  9790 utterances of phonemically transcribed child- directed speech (19-23 months).  Average utterance length: 3.4 words.  Average word length: 2.9 phonemes. Example input: yuwanttusiD6bUk lUkD*z6b7wIThIzh&t &nd6dOgi yuwanttulUk&tDIs...

Example results

Comparison to previous results Proposed boundaries are more accurate than Brent’s, but fewer proposals are made. Result: word tokens are less accurate. Boundary Precision Boundary Recall Brent GGJ Token F-score Brent.68 GGJ.54 Precision: #correct / #found = [= hits / (hits + false alarms)] Recall: #found / #true = [= hits / (hits + misses)] F-score: an average of precision and recall.

What happened? Model assumes (falsely) that words have the same probability regardless of context. Positing amalgams allows the model to capture word-to-word dependencies. P( D&t ) =.024 P( D&t | WAts ) =.46 P( D&t | tu ) =.0019

What about other unigram models? Brent’s learning algorithm is insufficient to identify the optimal segmentation.  Our solution has higher probability under his model than his own solution does.  On randomly permuted corpus, our system achieves 96% accuracy; Brent gets 81%. Formal analysis shows undersegmentation is the optimal solution for any (reasonable) unigram model.

Bigram model (hierachical Dirichlet process) Assume word w i is generated as follows: 1. Is ( w i-1,w i ) a novel bigram? 2. If novel, generate w i using unigram model (almost). If not, choose lexical identity of w i from words previously occurring after w i-1.

Example results

Quantitative evaluation Compared to unigram model, more boundaries are proposed, with no loss in accuracy: Accuracy is higher than previous models: Boundary Precision Boundary Recall GGJ (unigram) GGJ (bigram) Token F-scoreType F-score Brent (unigram) GGJ (bigram).77.63

Summary Two different assumptions about what defines a word are consistent with behavioral evidence. Different assumptions lead to different results.  Beginning of word predicts end of word: Optimal solution undersegments, finding common multi-word units.  Word also predicts next word: Segmentation is more accurate, adult-like.

Remaining questions Is unigram segmentation sufficient to start bootstrapping other cues (e.g., stress)? How prevalent are multi-word chunks in infant vocabulary? Are humans able to segment based on bigram statistics? Is there any evidence that human performance is consistent with Bayesian predictions?

Testing model predictions Goal: compare our model (and others) to human performance in a Saffran-style experiment. Problem: all models have near-perfect accuracy on experimental stimuli. Solution: compare changes in model performance relative to humans as task difficulty is varied. tupiro golabu bidaku padoti golabubidakugolabutupiropadotibidakupadotitupiro…

Experimental method Examine segmentation performance under different utterance length conditions. Example lexicon: lagi dazu tigupi bavulu kabitudu kipavazi Conditions: # wds/utt# uttstot # wds

Procedure Training: adult subjects listened to synthesized utterances in one length condition.  No pauses between syllables within utterances.  500 ms pauses between utterances. Testing: 2AFC between words and part-word distractors. lagi dazu tigupi bavulu kabitudu kipavazi lagitigupibavulukabitudulagikipavazi dazukipavazibavululagitigupikabitudu kipavazitigupidazukabitudulagitigupi … Lexicon:

Human performance

Model comparison Evaluated six different models. Each model trained and tested on same stimuli as humans. To simulate 2AFC, produce a score s(w) for each word in choice pair and use Luce choice rule: Compute best linear fit of each model to human data, then calculate correlation.

Models used Three local statistic models, all similar to transitional probabilities (TP)  Segment at minima of P(syl i | syl i-1 ).  s(w) = minimum TP in w. Swingley (2005)  Builds lexicon using local statistic and frequency thresholds.  s(w) = max threshold at which w appears in lexicon. PARSER (Perruchet and Vinter, 1998)  Incorporates principles of lexical competition and memory decay.  s(w) = P(w) as defined by model. GGJ (our Bayesian model)  s(w) = P(w) as defined by model.

Results: linear fit

Results: words vs. part-words

Summary Statistical segmentation is more difficult when utterances contain more words. Gradual decay in performance is predicted by Bayesian model, but not by others tested. Bayes predicts difficulty is primarily due to effects of competition.  In longer utterances, correct words are less probable because more other possibilities exist.  Local statistic approaches don’t model competition.

Continuing work Experiments with other task modifications will further test our model’s predictions. Vary the length of exposure to training stimulus:  Bayes: longer exposure => better performance.  TPs: no effect of exposure. Vary the number of lexical items:  Bayes: larger lexicon => worse performance.  TPs: larger lexicon => better performance.

Conclusions Computer simulations and experimental work suggest that Unigram assumption causes ideal learners to undersegment fluent speech. Human word segmentation may approximate Bayesian ideal learning.

Bayesian segmentation whatsthat thedoggie wheresthedoggie... whatsthat thedoggie wheresthedoggie wh at sth at thedo ggie wh eres thedo ggie whats that the doggie wheres the doggie w h a t s t h a t t h e d o g g i e w h e r e s t h e d o g g i e Input data:Some hypotheses:

Search algorithm Model defines a distribution over hypotheses. We use Gibbs sampling to find a good hypothesis. Iterative procedure produces samples from the posterior distribution of hypotheses. A batch algorithm (but online algorithms are possible, e.g., particle filtering). P(h|d) h

Gibbs sampler 1. Consider two hypotheses differing by a single word boundary: 2. Calculate probabilities of words that differ, given current analysis of all other words.  Model is exchangeable: probability of a set of outcomes does not depend on ordering. 3. Sample one of the two hypotheses according to the ratio of probabilities. whats.that the.doggie wheres.the.doggie whats.that the.dog.gie wheres.the.doggie

Models used Transitional probabilities (TP)  Segment at minima of P(syl i | syl i-1 ).  s(w) = minimum TP in w. (Equivalently, use product). Smoothed transitional probabilities  Avoid 0 counts by using add-λ smoothing. Mutual information (MI)  Segment where MI between syllables is lowest.  s(w) = minimum MI in w. (Equivalently, use sum).

Models used Swingley (2005)  Builds a lexicon, including syllable sequences above some threshold for both MI and n-gram frequency.  s(w) = max threshold at which w appears in lexicon. PARSER (Perruchet and Vinter, 1998)  Lexicon-based model incorporating principles of lexical competition and memory decay.  s(w) = P(w) as defined by model. GGJ (our Bayesian model)  s(w) = P(w) as defined by model.

Results: linear fit

Continuing work Comparisons to human data and other models: Which words/categories are most robust?  Compare to Frequent Frames predictions (Mintz, 2003).  Compare to corpus data from children’s production. Modeling cue combination: Integrate morphology into syntactic model. Model experimental work on cue combination in category learning.