Download presentation
Presentation is loading. Please wait.
Published byΚαλλιόπη Κουβέλης Modified over 5 years ago
1
Modeling infant word segmentation: Another example of discovery fueled by CHILDES
Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique @Language Emergence: Competition, Usage, and Analyses,
2
No overt & unambiguous word/morpheme boundaries in the input…
The segmentation Problem ! There are no silences between words when we speech. Unlike written language, there are no spaces between words when we speak. This means that children, to learn the phonological forms of words (which is crucial to learn word meaning) they have to be able to find word boundaries. “no silences” Kuhl 2004
3
… yet by the end of the first year, infants know some words/morphemes
‘Feet’ ‘mommy’ ‘baby’ ‘alldone’ ‘tobed’ The segmentation Problem ! There are no silences between words when we speech. Unlike written language, there are no spaces between words when we speak. This means that children, to learn the phonological forms of words (which is crucial to learn word meaning) they have to be able to find word boundaries. “no silences” Kuhl 2004 Tincoff & Jusczyk 2012; Bergelson & Swingley 2012; Ngon et al. 2014
4
How to study segmentability?
mommy talking … cute … something shiny go by? Let’s just get to the facts.
5
Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies
6
Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies
7
Input representation Acoustic Symbolic (‘Phonological text’)
+ realistic… … provided representations match babies’ few appropriate corpora (natural discourse & good quality audio) only one (reproducible) algorithm + lots of corpora can be used + lots of algorithms proposed + algorithms represent a wide range of strategies assumes babies represent input abstract, with zero errors
8
Example Phonologize Remove word boundaries & unitize
*MOT: look at the doggie lUk At D2 dOgi Phonologize l U k A t D 2 d O g i Remove word boundaries & unitize Evaluate Precision = 1 of the 5 words found were words in the input = .2 Recall = 1 of the 4 words in the input was recovered = .25 lU kAt D2 dO gi Segment with some algorithm Token F-score = 2* (Precision * Recall) Precision + Recall Note -- one can also unitize at the syllable level: lUk At D2 dO gi (input) lUk At D2 dO gi (output)
9
Bernard et al. 2019 Beh Res Meth
Example algorithms Every sentence is a word (SentBase) Every syllable is a word (SyllBase) Simplest strategies 1. Baseline Lignos 2012 Goal is to “cut” using local cues Transitional Probabilities (TP) TP_abs TP_rel 2. Sub-lexical x Absolute/Relative threshold Diphone-Based Segmentation (DiBS) Daland ; Saksida Goal is to learn a set of “minimal recombinable units” 3. Lexical Adaptor Grammar (AG) Phonotactics from Utterances Determine Distributional Lexical Elements (Puddle) Johnson ; Monaghan Package: wordseg.readthedocs.io Preprint: Bernard et al Beh Res Meth
10
Bernard et al. 2019 Beh Res Meth
The process in WordSeg Package: wordseg.readthedocs.io Preprint: Bernard et al Beh Res Meth
11
Sample results: precision, recall, & F-score are correlated
Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES
12
Sample results: Effects of algorithm and input represent-ation
Naima, in Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES Sample results: Effects of algorithm and input represent-ation
13
Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies
14
Why look at register? In child-directed speech, probably…
More utterances consist of a single word (+ all models) Utterances are overall shorter in length (+ all models) *MOT: Attends! *MOT: Ouaistuvastemettreausoleilpourtesecherlescheveux!
15
Why look at register? In child-directed speech, probably…
More utterances consist of a single word (+ all models) Utterances are overall shorter in length (+ all models) Utterances are more repetitious (+? lexical models) *MOT: coucoucoucousitufaisaisdespetitssourirestoi. *MOT: tumefaisdespetitssouriresXXXcoucoumongrand. *MOT: coucoutumefaisdessouriresoupas.
16
(Ask me about crosslinguistic extensions if curious!)
French English Japanese LENA-Lyon corpus (LeNormand et al. HomeBank) Collected with child-worn device worn whole day adult-directed speech is among caregivers Winnipeg corpus Collected with child-worn device worn whole day adult-directed speech is among caregivers Riken corpus Collected in the lab adult-directed speech is with experimenter Bogdan Ludusan Georgia Loukatou
17
on Le Normand, Canault, & Van Thai’s
French “wild” ADS on Le Normand, Canault, & Van Thai’s LENA-Lyon corpus Loukatou Proc Cog Sci
18
CDS-ADS: Conclusions Overall trend for better performance for child- than adult-directed speech But: reversed for some algorithms effect of register < 15% (in the best controlled cases, 2%)
19
Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies
20
Why study word segmentation in a bilingual setting?
‘pié’ ‘mamá’ ‘bebé’ … Bilinguals need to: Learn words, like monolinguals do, but in two languages Overall less input in each language ‘Feet’ ‘mommy’ ‘baby’ ‘alldone’ ‘tobed’ The segmentation Problem ! There are no silences between words when we speech. Unlike written language, there are no spaces between words when we speak. This means that children, to learn the phonological forms of words (which is crucial to learn word meaning) they have to be able to find word boundaries. Hoff Fibla & Cristia (submitted very soon, I hope)
21
Questions & predictions
Are segmentation strategies equally successful when applied to bilingual and monolingual corpora? → Measure the performance of previously studied segmentation algorithms in a controlled monolingual versus bilingual corpus. Possible outcomes: The confusion hypothesis: variable and inconsistent input → Poorer performance for the bilingual than for the monolingual The resistant hypothesis: (if switching only at utterance edges) local statistical and lexical are still reliable → Similar performance for the bilingual and the monolingual The resistant hypothesis: our algorithms use transitional information to segment and we only mix languages at sentence boundaries = the segmentation strategies could still be robust Fibla & Cristia (submitted very soon, I hope)
22
Creating bilingual corpora
24
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora
25
Three cases of bilingual < monolingual
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora
26
Three cases of bilingual < monolingual
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora 11 cases of bilingual ‘in between’ monolingual
27
Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies
28
Effects of algorithm and input represent-ation
size of algorithm x level effect = 40-60%? Effects of algorithm and input represent-ation Cristia Open Mind
29
Effect of register Size of register effect < 10%?
on LENA-Lyon corpus Loukatou Proc Cog Sci
30
Effect of bilingualism
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora Size of bilingualism effect ~ 0%? Fibla & Cristia (submitted very soon, I hope)
31
Today’s menu A methodology for studying word form segmentation using models Segmentability differences as a function of language properties … child-directed versus adult-directed register (in Japanese, English, & French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies
32
What may babies be doing? Using CDI results & frequency effects
Larsen Interspeech & in prep
33
What may babies be doing? Using CDI results & frequency effects
Coefficient of determination R2=.1 Larsen Interspeech & in prep
37
phoneme-based models
38
syllable-based models
phoneme-based models
39
Cut only at utterance edges frequency of words in isolation
40
To be continued…
41
Thanks to... Families who agree to be recorded & for their data to be shared Researchers who record them and share on TalkBank TalkBank ~ Brian MacWhinney & you!
42
on Reiko Mazuka’s RIKEN corpus
Japanese “lab” ADS on Reiko Mazuka’s RIKEN corpus much of this is in Ludusan et al ACL (now working on journal paper with more material)
43
on Melanie Soderstrom’s Winnipeg corpus
English “wild” ADS on Melanie Soderstrom’s Winnipeg corpus Cristia Open Mind
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.