Modeling infant word segmentation: Another example of discovery fueled by CHILDES Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique.

Modeling infant word segmentation: Another example of discovery fueled by CHILDES
Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique @Language Emergence: Competition, Usage, and Analyses,

No overt & unambiguous word/morpheme boundaries in the input…
The segmentation Problem ! There are no silences between words when we speech. Unlike written language, there are no spaces between words when we speak. This means that children, to learn the phonological forms of words (which is crucial to learn word meaning) they have to be able to find word boundaries. “no silences” Kuhl 2004

… yet by the end of the first year, infants know some words/morphemes
‘Feet’ ‘mommy’ ‘baby’ ‘alldone’ ‘tobed’ The segmentation Problem ! There are no silences between words when we speech. Unlike written language, there are no spaces between words when we speak. This means that children, to learn the phonological forms of words (which is crucial to learn word meaning) they have to be able to find word boundaries. “no silences” Kuhl 2004 Tincoff & Jusczyk 2012; Bergelson & Swingley 2012; Ngon et al. 2014

How to study segmentability?
mommy talking … cute … something shiny go by? Let’s just get to the facts.

Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies

Input representation Acoustic Symbolic (‘Phonological text’)
+ realistic… … provided representations match babies’ few appropriate corpora (natural discourse & good quality audio) only one (reproducible) algorithm + lots of corpora can be used + lots of algorithms proposed + algorithms represent a wide range of strategies assumes babies represent input abstract, with zero errors

Example Phonologize Remove word boundaries & unitize
*MOT: look at the doggie lUk At D2 dOgi Phonologize l U k A t D 2 d O g i Remove word boundaries & unitize Evaluate Precision = 1 of the 5 words found were words in the input = .2 Recall = 1 of the 4 words in the input was recovered = .25 lU kAt D2 dO gi Segment with some algorithm Token F-score = 2* (Precision * Recall) Precision + Recall Note -- one can also unitize at the syllable level: lUk At D2 dO gi (input) lUk At D2 dO gi (output)

Bernard et al. 2019 Beh Res Meth
Example algorithms Every sentence is a word (SentBase) Every syllable is a word (SyllBase) Simplest strategies 1. Baseline Lignos 2012 Goal is to “cut” using local cues Transitional Probabilities (TP) TP_abs TP_rel 2. Sub-lexical x Absolute/Relative threshold Diphone-Based Segmentation (DiBS) Daland ; Saksida Goal is to learn a set of “minimal recombinable units” 3. Lexical Adaptor Grammar (AG) Phonotactics from Utterances Determine Distributional Lexical Elements (Puddle) Johnson ; Monaghan Package: wordseg.readthedocs.io Preprint: Bernard et al Beh Res Meth

Bernard et al. 2019 Beh Res Meth
The process in WordSeg Package: wordseg.readthedocs.io Preprint: Bernard et al Beh Res Meth

Sample results: precision, recall, & F-score are correlated
Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES

Sample results: Effects of algorithm and input represent-ation
Naima, in Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES Sample results: Effects of algorithm and input represent-ation

Why look at register? In child-directed speech, probably…
More utterances consist of a single word (+ all models) Utterances are overall shorter in length (+ all models) *MOT: Attends! *MOT: Ouaistuvastemettreausoleilpourtesecherlescheveux!

Why look at register? In child-directed speech, probably…
More utterances consist of a single word (+ all models) Utterances are overall shorter in length (+ all models) Utterances are more repetitious (+? lexical models) *MOT: coucoucoucousitufaisaisdespetitssourirestoi. *MOT: tumefaisdespetitssouriresXXXcoucoumongrand. *MOT: coucoutumefaisdessouriresoupas.

(Ask me about crosslinguistic extensions if curious!)
French English Japanese LENA-Lyon corpus (LeNormand et al. HomeBank) Collected with child-worn device worn whole day  adult-directed speech is among caregivers Winnipeg corpus Collected with child-worn device worn whole day  adult-directed speech is among caregivers Riken corpus Collected in the lab  adult-directed speech is with experimenter Bogdan Ludusan Georgia Loukatou

on Le Normand, Canault, & Van Thai’s
French “wild” ADS on Le Normand, Canault, & Van Thai’s LENA-Lyon corpus Loukatou Proc Cog Sci

CDS-ADS: Conclusions Overall trend for better performance for child- than adult-directed speech But: reversed for some algorithms effect of register < 15% (in the best controlled cases, 2%)

Why study word segmentation in a bilingual setting?
‘pié’ ‘mamá’ ‘bebé’ … Bilinguals need to: Learn words, like monolinguals do, but in two languages Overall less input in each language ‘Feet’ ‘mommy’ ‘baby’ ‘alldone’ ‘tobed’ The segmentation Problem ! There are no silences between words when we speech. Unlike written language, there are no spaces between words when we speak. This means that children, to learn the phonological forms of words (which is crucial to learn word meaning) they have to be able to find word boundaries. Hoff Fibla & Cristia (submitted very soon, I hope)

Questions & predictions
Are segmentation strategies equally successful when applied to bilingual and monolingual corpora? → Measure the performance of previously studied segmentation algorithms in a controlled monolingual versus bilingual corpus. Possible outcomes: The confusion hypothesis: variable and inconsistent input → Poorer performance for the bilingual than for the monolingual The resistant hypothesis: (if switching only at utterance edges) local statistical and lexical are still reliable → Similar performance for the bilingual and the monolingual The resistant hypothesis: our algorithms use transitional information to segment and we only mix languages at sentence boundaries = the segmentation strategies could still be robust Fibla & Cristia (submitted very soon, I hope)

Creating bilingual corpora

We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora

Three cases of bilingual < monolingual
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora

Three cases of bilingual < monolingual
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora 11 cases of bilingual ‘in between’ monolingual

Effects of algorithm and input represent-ation
size of algorithm x level effect = 40-60%? Effects of algorithm and input represent-ation Cristia Open Mind

Effect of register Size of register effect < 10%?
on LENA-Lyon corpus Loukatou Proc Cog Sci

Effect of bilingualism
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora Size of bilingualism effect ~ 0%? Fibla & Cristia (submitted very soon, I hope)

Today’s menu A methodology for studying word form segmentation using models Segmentability differences as a function of language properties … child-directed versus adult-directed register (in Japanese, English, & French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies

What may babies be doing? Using CDI results & frequency effects
Larsen Interspeech & in prep

What may babies be doing? Using CDI results & frequency effects
Coefficient of determination R2=.1 Larsen Interspeech & in prep

phoneme-based models

syllable-based models
phoneme-based models

Cut only at utterance edges  frequency of words in isolation

To be continued…

Thanks to... Families who agree to be recorded & for their data to be shared Researchers who record them and share on TalkBank TalkBank ~ Brian MacWhinney & you!

on Reiko Mazuka’s RIKEN corpus
Japanese “lab” ADS on Reiko Mazuka’s RIKEN corpus much of this is in Ludusan et al ACL (now working on journal paper with more material)

on Melanie Soderstrom’s Winnipeg corpus
English “wild” ADS on Melanie Soderstrom’s Winnipeg corpus Cristia Open Mind

Modeling infant word segmentation: Another example of discovery fueled by CHILDES Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique.

Similar presentations

Presentation on theme: "Modeling infant word segmentation: Another example of discovery fueled by CHILDES Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling infant word segmentation: Another example of discovery fueled by CHILDES Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique.

Similar presentations

Presentation on theme: "Modeling infant word segmentation: Another example of discovery fueled by CHILDES Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique."— Presentation transcript:

Similar presentations

About project

Feedback