Michael C. Frank Stanford University The role of data sharing in studying language learning: WordBank and childes-db Michael C. Frank Stanford University
The original “big data” for child language … in a shared research environment with open tools and resources open Childes created 1984
An explosion of language 18 mo: “happy-b” 19 mo: “blue ball” 23 mo: “spike doggy no food eat dirt” 26 mo: “dada move own body, my need lilbit more space” (spike the dog doesn’t eat food, he eats dirt”)
Frank, Goodman, & Tenenbaum (2008), NIPS “doggie” > doggie Frank, Goodman, & Tenenbaum (2008), NIPS McMurray et al. (2012), Psych Review Fazly et al. (2010), Cognitive Science Understanding this process will require quantitative theories of how learning operates over children’s perceptual input and within their cognitive limitations. In much of my previous work I’ve attempted to develop probabilistic models that instantiate these quantitative theories.
Evaluating learning models scientific hypothesis about learning Learning Model Input predictors Outcomes A handful of binary experimental results 20 min annotated video Theory building requires more data!
Small scale experiments may not be replicable Many folks here are likely aware of the recent study by the Open Science Collaboration. This study reported the results from 100 independent replications of high-profile studies from 2008. Disappointingly, fewer than half of these replications (by a variety of criteria) produced the same result as the original. I am proud that four of these replications were contributed by students in the first iteration of my graduate lab class. The open science collaboration paper is an important study, but I’ve come to believe that its interpretation is more limited than the interpretations that some have given. In the remainder of this talk, I want to lay out some of the ways that I and my students and collaborators have been thinking about this kind of cumulative enterprise. Open Science Collaboration (2015)
Even “simple” analyses may not be reproducible Sampled 35 articles at Cognition. 13 reproducible from data. The other 22: Even “simple” analyses may not be reproducible Hardwicke et al. (2018), Royal Soc Open Sci
The MacArthur-Bates Communicative Development Inventory (CDI) Spanish French Polish Slovak Japanese The MacArthur-Bates Communicative Development Inventory (CDI)
Frank et al. (2016), Journal of Child Language http://wordbank.stanford.edu Frank et al. (2016), Journal of Child Language
Cross-linguistic generalizations about early language The first words are very consistent across languages Across children, variability is a constant in early language There is a noun bias in early language, but verbs vary by language The growth of grammar is linked to vocabulary growth Hi!
https://langcog.github.io/wordbank-book
A framework for evaluating learning models Outcomes Learning Model Input predictors
Predicting when words are learned Braginsky et al. (in press), Open Mind
Predictors (Mapped across languages through hand-checked translation equivalents) Form: number of phonemes Meaning Concreteness (Brysbaert, Warriner, & Kuperman, 2013) Arousal & valence (Warriner, Kuperman, & Brysbaert, 2013) Babiness (Perry, Perlman, & Lupyan, 2015) Input Need corpora of child-directed language… Introducing…. http://childes-db.stanford.edu
Sanchez*, Meylan* et al. (2016), Behavior Research Methods
Predictors of production A “first pass” predictive model – baseline for future modeling work Coefficient Estimate Braginsky et al. (in press), Open Mind
Consistency across languages Average correlation (r) between predictors random baseline Braginsky et al. (in press), Open Mind
Lexical networks Edges derived from semantic or phonological features (can be word embeddings as well) Fourtassi, Bian, & Frank (2018; under review)
Adding network predictors Mixed effects regression, with language as a random effect PAC = preferential acquisition (growth based on full network) 10 language sample Fourtassi, Bian, & Frank (2018; under review)
A framework for evaluating learning models Input predictors Outcomes Learning Model An invitation to develop new models!
Klaus W. Jacobs Foundation http://wordbank.stanford.edu http://childes-db.stanford.edu https://langcog.github.io/wordbank-book Klaus W. Jacobs Foundation http://langcog.stanford.edu