Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Review bootstrap and permutation
Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Psych 156A/ Ling 150: Acquisition of Language II Lecture 6 Words in Fluent Speech I.
1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.
Psych 156A/ Ling 150: Psychology of Language Learning Lecture 8 Words in Fluent Speech.
Proportion Priors for Image Sequence Segmentation Claudia Nieuwenhuis, etc. ICCV 2013 Oral.
PaPI 2005 (Barcelona, June) The perception of stress patterns by Spanish and Catalan infants Ferran Pons (University of British Columbia) Laura Bosch.
And you can too! SBS  Introduction  Evidence for Statistics  Bays Law  Informative Priors  Joint Models  Inference  Conclusion.
From linear sequences to abstract structures: Distributional information in infant-direct speech Hao Wang & Toby Mintz Department of Psychology University.
Psych 156A/ Ling 150: Acquisition of Language II Lecture 6 Words in Fluent Speech II.
Psych 156A/ Ling 150: Psychology of Language Learning Lecture 4 Words in Fluent Speech.
Lecture 2: Thu, Jan 16 Hypothesis Testing – Introduction (Ch 11)
Evaluating Hypotheses
Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Psych 156A/ Ling 150: Acquisition of Language II Lecture 5 Words in Fluent Speech I.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Bayesian models as a tool for revealing inductive biases Tom Griffiths University of California, Berkeley.
Chapter 10: Estimating with Confidence
Albert Gatt Corpora and Statistical Methods Lecture 9.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Approximate Randomization tests February 5 th, 2013.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Hypothesis Testing Quantitative Methods in HPELS 440:210.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
A chicken-and-egg problem
Building a Lexicon Statistical learning & recognizing words.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Iowa State University Developmental Robotics Laboratory Unsupervised Segmentation of Audio Speech using the Voting Experts Algorithm Matthew Miller, Alexander.
The New Studies of Religion Syllabus Implementation Package: Session Two.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
Experimental Evaluation of Learning Algorithms Part 1.
Discriminative Models for Spoken Language Understanding Ye-Yi Wang, Alex Acero Microsoft Research, Redmond, Washington USA ICSLP 2006.
Psychology of Thinking: Embedding Artifice in Nature.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Statistical Learning in Infants (and bigger folks)
CHAPTER 17: Tests of Significance: The Basics
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Statistical NLP Winter 2009 Lecture 5: Unsupervised Learning I Unsupervised Word Segmentation Roger Levy [thanks to Sharon Goldwater for many slides]
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Psych 156A/ Ling 150: Acquisition of Language II
+ DO NOW. + Chapter 8 Estimating with Confidence 8.1Confidence Intervals: The Basics 8.2Estimating a Population Proportion 8.3Estimating a Population.
A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.
A Critique and Improvement of an Evaluation Metric for Text Segmentation A Paper by Lev Pevzner (Harvard University) Marti A. Hearst (UC, Berkeley) Presented.
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Investigating a Physically-Based Signal Power Model for Robust Low Power Wireless Link Simulation Tal Rusak, Philip Levis MSWIM 2008.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Bridging the gap between L2 speech perception research and phonological theory Paola Escudero & Paul Boersma (March 2002) Presented by Paola Escudero.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
A Bayesian approach to word segmentation: Theoretical and experimental results Sharon Goldwater Department of Linguistics Stanford University.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses pt.1.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Psych 156A/ Ling 150: Psychology of Language Learning Lecture 9 Words in Fluent Speech II.
© 2010 Pearson Prentice Hall. All rights reserved Chapter Hypothesis Tests Regarding a Parameter 10.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Poverty of stimulus in the context of language Second Semester.
Biointelligence Laboratory, Seoul National University
Psych 156A/ Ling 150: Psychology of Language Learning
Statistical Models for Automatic Speech Recognition
The Nature of Learner Language (Chapter 2 Rod Ellis, 1997) Page 15
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presentation transcript:

Distributional Cues to Word Boundaries: Context Is Important Sharon Goldwater Stanford University Tom Griffiths UC Berkeley Mark Johnson Microsoft Research/ Brown University

Word segmentation One of the first problems infants must solve when learning language. Infants make use of many different cues.  Phonotactics, allophonic variation, metrical (stress) patterns, effects of coarticulation, and statistical regularities in syllable sequences. Statistics may provide initial bootstrapping.  Used very early (Thiessen & Saffran, 2003).  Language-independent.

Distributional segmentation Work on distributional segmentation often discusses transitional probabilities (Saffran et al. 1996; Aslin et al. 1998, Johnson & Jusczyk, 2001). What do TPs have to say about words? 1. A word is a unit whose beginning predicts its end, but it does not predict other words. 2. A word is a unit whose beginning predicts its end, and it also predicts future words. Or…

Interpretation of TPs Most previous work assumes words are statistically independent.  Experimental work: Saffran et al. (1996), many others.  Computational work: Brent (1999). What about words predicting other words? tupiro golabu bidaku padoti golabubidakugolabutupiropadotibidakupadotit upirobidakugolabutupiropadotibidakutupiro…

Questions If a learner assumes that words are independent units, what is learned (from more realistic input)? What if the learner assumes that words are units that help predict other units? Approach: use a Bayesian “ideal observer” model to examine the consequences of making these different assumptions. What kinds of words are learned?

Two kinds of models Unigram model: words are independent.  Generate a sentence by generating each word independently. look.1 that.2 at.4 … lookatthat look.1 that.2 at.4 …

Two kinds of models Bigram model: words predict other words.  Generate a sentence by generating each word, conditioned on the previous word. look.1 that.3 at.5 … look.4 that.2 at.1 … lookatthat look.1 that.5 at.1 …

Bayesian learning The Bayesian learner seeks to identify an explanatory linguistic hypothesis that  accounts for the observed data.  conforms to prior expectations. Focus is on the goal of computation, not the procedure (algorithm) used to achieve the goal.

Bayesian segmentation In the domain of segmentation, we have:  Data: unsegmented corpus (transcriptions).  Hypotheses: sequences of word tokens. Optimal solution is the segmentation with highest prior probability. = 1 if concatenating words forms corpus, = 0 otherwise. Encodes unigram or bigram assumption (also others).

Brent (1999) Describes a Bayesian unigram model for segmentation.  Prior favors solutions with fewer words, shorter words. Problems with Brent’s system:  Learning algorithm is approximate (non- optimal).  Difficult to extend to incorporate bigram info.

A new unigram model Assumes word w i is generated as follows: 1. Is w i a novel lexical item? Fewer word types = Higher probability

A new unigram model Assume word w i is generated as follows: 2. If novel, generate phonemic form x 1 …x m : If not, choose lexical identity of w i from previously occurring words: Shorter words = Higher probability Power law = Higher probability

Advantages of our model Unigram?Bigram?Algorithm? Brent  GGJ 

Unigram model: simulations Same corpus as Brent:  9790 utterances of phonemically transcribed child-directed speech (19-23 months).  Average utterance length: 3.4 words.  Average word length: 2.9 phonemes. Example input: yuwanttusiD6bUk lUkD*z6b7wIThIzh&t &nd6dOgi yuwanttulUk&tDIs...

Example results

Comparison to previous results Proposed boundaries are more accurate than Brent’s, but fewer proposals are made. Result: word tokens are less accurate. Boundary Precision Boundary Recall Brent GGJ Token F-score Brent.68 GGJ.54 Precision: #correct / #found Recall: #found / #true F-score: an average of precision and recall.

What happened? Model assumes (falsely) that words have the same probability regardless of context. Positing amalgams allows the model to capture word-to-word dependencies. P( D&t ) =.024 P( D&t | WAts ) =.46 P( D&t | tu ) =.0019

What about other unigram models? Brent’s learning algorithm is insufficient to identify the optimal segmentation.  Our solution has higher probability under his model than his own solution does.  On randomly permuted corpus, our system achieves 96% accuracy; Brent gets 81%. Formal analysis shows undersegmentation is the optimal solution for any (reasonable) unigram model.

Bigram model Assume word w i is generated as follows: 1. Is ( w i-1,w i ) a novel bigram? 2. If novel, generate w i using unigram model. If not, choose lexical identity of w i from words previously occurring after w i-1.

Example results

Quantitative evaluation Compared to unigram model, more boundaries are proposed, with no loss in accuracy: Accuracy is higher than previous models: Boundary Precision Boundary Recall GGJ (unigram) GGJ (bigram) Token F-scoreType F-score Brent (unigram) GGJ (bigram).77.63

Conclusion Different assumptions about what defines a word lead to different segmentations.  Beginning of word predicts end of word: Optimal solution undersegments, finding common multi-word units.  Word also predicts next word: Segmentation is more accurate, adult-like. Important to consider how transitional probabilities and other statistics are used.

Constraints on learning Algorithms can impose implicit constraints.  Implication: learning process prevents the learner from identifying the best solutions.  Specifics of algorithm are critical, but hard to determine their effect. Prior imposes explicit constraints.  State general expectations about the nature of language.  Assume humans are good at learning.

Algorithmic constraints Venkataraman (2001) and Batchelder (2002) describe unigram model-based approaches to segmentation, with no prior.  Venkataraman algorithm penalizes novel words.  Batchelder algorithm penalizes long words. Without algorithmic constraints, these models would memorize every utterance whole (insert no word boundaries).

Remaining questions Are multi-word chunks sufficient as an initial bootstrapping step in humans? (cf. Swingley, 2005) Do children go through a stage with many chunks like these? (cf. MacWhinney, ??) Are humans able to segment based on bigram statistics?