Presentation is loading. Please wait.

Presentation is loading. Please wait.

Modeling infant word segmentation: Another example of discovery fueled by CHILDES Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique.

Similar presentations


Presentation on theme: "Modeling infant word segmentation: Another example of discovery fueled by CHILDES Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique."— Presentation transcript:

1 Modeling infant word segmentation: Another example of discovery fueled by CHILDES
Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique @Language Emergence: Competition, Usage, and Analyses,

2 No overt & unambiguous word/morpheme boundaries in the input…
The segmentation Problem ! There are no silences between words when we speech. Unlike written language, there are no spaces between words when we speak. This means that children, to learn the phonological forms of words (which is crucial to learn word meaning) they have to be able to find word boundaries. “no silences” Kuhl 2004

3 … yet by the end of the first year, infants know some words/morphemes
‘Feet’ ‘mommy’ ‘baby’ ‘alldone’ ‘tobed’ The segmentation Problem ! There are no silences between words when we speech. Unlike written language, there are no spaces between words when we speak. This means that children, to learn the phonological forms of words (which is crucial to learn word meaning) they have to be able to find word boundaries. “no silences” Kuhl 2004 Tincoff & Jusczyk 2012; Bergelson & Swingley 2012; Ngon et al. 2014

4 How to study segmentability?
mommy talking … cute … something shiny go by? Let’s just get to the facts.

5 Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies

6 Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies

7 Input representation Acoustic Symbolic (‘Phonological text’)
+ realistic… … provided representations match babies’ few appropriate corpora (natural discourse & good quality audio) only one (reproducible) algorithm + lots of corpora can be used + lots of algorithms proposed + algorithms represent a wide range of strategies assumes babies represent input abstract, with zero errors

8 Example Phonologize Remove word boundaries & unitize
*MOT: look at the doggie lUk At D2 dOgi Phonologize l U k A t D 2 d O g i Remove word boundaries & unitize Evaluate Precision = 1 of the 5 words found were words in the input = .2 Recall = 1 of the 4 words in the input was recovered = .25 lU kAt D2 dO gi Segment with some algorithm Token F-score = 2* (Precision * Recall) Precision + Recall Note -- one can also unitize at the syllable level: lUk At D2 dO gi (input) lUk At D2 dO gi (output)

9 Bernard et al. 2019 Beh Res Meth
Example algorithms Every sentence is a word (SentBase) Every syllable is a word (SyllBase) Simplest strategies 1. Baseline Lignos 2012 Goal is to “cut” using local cues Transitional Probabilities (TP) TP_abs TP_rel 2. Sub-lexical x Absolute/Relative threshold Diphone-Based Segmentation (DiBS) Daland ; Saksida Goal is to learn a set of “minimal recombinable units” 3. Lexical Adaptor Grammar (AG) Phonotactics from Utterances Determine Distributional Lexical Elements (Puddle) Johnson ; Monaghan Package: wordseg.readthedocs.io Preprint: Bernard et al Beh Res Meth

10 Bernard et al. 2019 Beh Res Meth
The process in WordSeg Package: wordseg.readthedocs.io Preprint: Bernard et al Beh Res Meth

11 Sample results: precision, recall, & F-score are correlated
Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES

12 Sample results: Effects of algorithm and input represent-ation
Naima, in Providence corpus (Demuth, Culbertson, & Alter, 2006) on CHILDES Sample results: Effects of algorithm and input represent-ation

13 Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies

14 Why look at register? In child-directed speech, probably…
More utterances consist of a single word (+ all models) Utterances are overall shorter in length (+ all models) *MOT: Attends! *MOT: Ouaistuvastemettreausoleilpourtesecherlescheveux!

15 Why look at register? In child-directed speech, probably…
More utterances consist of a single word (+ all models) Utterances are overall shorter in length (+ all models) Utterances are more repetitious (+? lexical models) *MOT: coucoucoucousitufaisaisdespetitssourirestoi. *MOT: tumefaisdespetitssouriresXXXcoucoumongrand. *MOT: coucoutumefaisdessouriresoupas.

16 (Ask me about crosslinguistic extensions if curious!)
French English Japanese LENA-Lyon corpus (LeNormand et al. HomeBank) Collected with child-worn device worn whole day  adult-directed speech is among caregivers Winnipeg corpus Collected with child-worn device worn whole day  adult-directed speech is among caregivers Riken corpus Collected in the lab  adult-directed speech is with experimenter Bogdan Ludusan Georgia Loukatou

17 on Le Normand, Canault, & Van Thai’s
French “wild” ADS on Le Normand, Canault, & Van Thai’s LENA-Lyon corpus Loukatou Proc Cog Sci

18 CDS-ADS: Conclusions Overall trend for better performance for child- than adult-directed speech But: reversed for some algorithms effect of register < 15% (in the best controlled cases, 2%)

19 Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies

20 Why study word segmentation in a bilingual setting?
‘pié’ ‘mamá’ ‘bebé’ … Bilinguals need to: Learn words, like monolinguals do, but in two languages Overall less input in each language ‘Feet’ ‘mommy’ ‘baby’ ‘alldone’ ‘tobed’ The segmentation Problem ! There are no silences between words when we speech. Unlike written language, there are no spaces between words when we speak. This means that children, to learn the phonological forms of words (which is crucial to learn word meaning) they have to be able to find word boundaries. Hoff Fibla & Cristia (submitted very soon, I hope)

21 Questions & predictions
Are segmentation strategies equally successful when applied to bilingual and monolingual corpora? → Measure the performance of previously studied segmentation algorithms in a controlled monolingual versus bilingual corpus. Possible outcomes: The confusion hypothesis: variable and inconsistent input → Poorer performance for the bilingual than for the monolingual The resistant hypothesis: (if switching only at utterance edges) local statistical and lexical are still reliable → Similar performance for the bilingual and the monolingual The resistant hypothesis: our algorithms use transitional information to segment and we only mix languages at sentence boundaries = the segmentation strategies could still be robust Fibla & Cristia (submitted very soon, I hope)

22 Creating bilingual corpora

23

24 We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora

25 Three cases of bilingual < monolingual
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora

26 Three cases of bilingual < monolingual
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora 11 cases of bilingual ‘in between’ monolingual

27 Today’s menu A methodology for studying word form segmentation using models Segmentability differences for child-directed versus adult-directed register (in French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies

28 Effects of algorithm and input represent-ation
size of algorithm x level effect = 40-60%? Effects of algorithm and input represent-ation Cristia Open Mind

29 Effect of register Size of register effect < 10%?
on LENA-Lyon corpus Loukatou Proc Cog Sci

30 Effect of bilingualism
We had predicted that, if there were a bilingual disadvantage, then scores for the bilingual corpus should be below both of the matching monolingual corpora. In fact, this nearly never happens, as each bilingual corpus yields scores that are in between those of the corresponding monolingual corpora. There are only three exceptions to this general pattern. While DiBS for Catalan-Spanish followed that same pattern, the scores for the English-Spanish bilingual corpus were significantly lower than those for both English and Spanish monolingual. In TP-rel, the score for the two bilingual corpora overlapped with that for Spanish, and all three yielded lower Precision than Catalan and English. Nonetheless, the scores for DiBS and TP-rel vary by little, and thus a different measure of noise may come to reveal that even these differences are non-significant. Moreover, these 3 cases are exceptional, with all other 11 combinations (i.e., 7 algorithms times 2 bilingual corpora) fitting with the generalization that the bilingual score lies in between those found for the matching monolingual corpora Size of bilingualism effect ~ 0%? Fibla & Cristia (submitted very soon, I hope)

31 Today’s menu A methodology for studying word form segmentation using models Segmentability differences as a function of language properties … child-directed versus adult-directed register (in Japanese, English, & French) … bilingual versus monolingual settings (English, Spanish, & Catalan) Implications for infant studies

32 What may babies be doing? Using CDI results & frequency effects
Larsen Interspeech & in prep

33 What may babies be doing? Using CDI results & frequency effects
Coefficient of determination R2=.1 Larsen Interspeech & in prep

34

35

36

37 phoneme-based models

38 syllable-based models
phoneme-based models

39 Cut only at utterance edges  frequency of words in isolation

40 To be continued…

41 Thanks to... Families who agree to be recorded & for their data to be shared Researchers who record them and share on TalkBank TalkBank ~ Brian MacWhinney & you!

42 on Reiko Mazuka’s RIKEN corpus
Japanese “lab” ADS on Reiko Mazuka’s RIKEN corpus much of this is in Ludusan et al ACL (now working on journal paper with more material)

43 on Melanie Soderstrom’s Winnipeg corpus
English “wild” ADS on Melanie Soderstrom’s Winnipeg corpus Cristia Open Mind


Download ppt "Modeling infant word segmentation: Another example of discovery fueled by CHILDES Alejandrina Cristia Laboratoire de Sciences Cognitives et Psycholinguistique."

Similar presentations


Ads by Google