Predicting Children’s Reading Ability using Evaluator-Informed Features Matthew Black, Joseph Tepperman, Sungbok Lee, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory (SAIL) University of Southern California Los Angeles, CA, USA
Child N Child 1 “The Big Picture” Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”2 / 16 … … Children administered reading test map rip met rub zap left must frog flip snack mop lot fell must frog flip snack fine rope rake ± ± 0.36 … Evaluators rated each child’s overall reading ability – From 1 (“poor”) to 7 (“excellent”) Children assigned average score – Deviation from average also computed Feature Extraction Feature Selection Supervised Learning … Evaluators listed the high-level cues they used when making judgments – Informed feature extraction
Why? Reading assessment is important in children’s early education – Objective and subjective component Advantages of automatic reading assessment – Consistent evaluations across different subjects and over time – Save human evaluators time Previous work – Concentrated mostly on detecting errors at phoneme- and word-level – Predicting a child’s overall performance is relatively less researched Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”3 / 16
Corpus: TBALL Project Technology-Based Assessment of Language and Literacy Project – Collected in kindergarten to second grade classrooms in Los Angeles – Both native English and Spanish speakers 42 children – Isolated English word reading task – Word order identical for each child – Increasing order of difficulty – Number of words per child in this study ranged from 10 to Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”4 / 16
11 engineering graduate students – No significant difference between students and experts [Tepperman ‘06] Did not instruct evaluators how to grade or offer examples – We wanted evaluators to use own grading criteria Lastly, evaluators listed the criteria they used when judging Human evaluation: Explanation Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”5 / 16 Maintained chronological word order for each child Maintained chronological word order for each child Randomized child order for each evaluator Randomized child order for each evaluator
Human evaluation: Analysis Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”6 / 16 Mean pairwise evaluator correlation: Dependent variable: calculated ground-truth (GT) scores for each child by averaging across all 11 evaluators’ scores 3 metrics to compare each evaluator’s scores with GT scores 3 most cited cues that 11 evaluators used – Pronunciation correctness, speaking rate, and fluency of speech – Exact grading method unknown, but informs potentially useful features MetricMean Correlation0.899 Mean Error0.624 Maximum Error … … Pearson’s Correlation … Evaluator 1 Mean(Evaluators 2:11) Absolute Difference 2. Mean Error, 3. Max Error Used later to compare automatic results
Features: Preliminary work Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”7 / 16 Acoustic Models – 12 hours of held-out children’s word-level speech – Three-state 16 Gaussian HMMs 38 monophone Background (BG) – silence and noise Word-level “garbage” (GG) DictionaryAvg. #Entries for “map” Acceptable1.18/m ae p/ Reading Error2.13/m ey p/ Spanish Confusion1.09/m aa p/ Recognition3.71/m ae p/, /m ey p/, /m aa p/ Created 4 dictionaries with various phonemic pronunciations Extracted features correlated with cues at word-level – Natural since children reading a list of words in isolation
Features: Pronunciation correctness Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”8 / 16 Speech recognition with Recognition dictionary: BG | GG /m ey p/ BG | GG Start End /m aa p/ /m ae p/ 4 forced alignments over portion end-pointed as target word Extracted pronunciation correctness features for each word – 3 binary features (e.g., was “acceptable pronunciation recognized?”) – 10 continuous features (e.g., LL acc – max{LL read, LL Span, LL GG }) /m ae p/ /m ey p/ /m aa p/ GG LL acc LL read LL Span LL GG
Features: Speaking rate Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”9 / 16 Extracted several continuous word-level temporal features – Target word start time relative to when word first displayed – Target word total length in time (L target ) – L target / Number of syllables spoken – Number of syllables spoken / L target – L target / Number of phones spoken – Number of phones spoken / L target Also included the square root of all features listed above – Resulted in less-skewed distribution
Features: Fluency of speech Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”10 / 16 Re-ran speech recognition with disfluency-specialized grammars BG | GG Partial: /m ey p/ Partial: /m ey p/ BG | GG Start End Partial: /m aa p/ Partial: /m aa p/ Whole: /m ey p/ Whole: /m ey p/ Whole: /m aa p/ Whole: /m aa p/ Whole: /m ae p/ Whole: /m ae p/ Partial: /m ae p/ Partial: /m ae p/ m- BG ae- p- BG Partial word structure for /m ae p/ Presence of disfluencies significantly negatively correlated with perception of fluency of children’s speech [Black ‘07] – e.g., hesitations, repetitions, sounding-out the words Extracted 11 word-level features from the output transcription – e.g., the number of disfluent phones recognized – e.g., square root of the total time endpointed as disfluent speech
Child-Level Features Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”11 / 16 Needed to map word-level features to “child-level” features 13 descriptive statistics calculated across words for each child – Mean, standard deviation, skewness, kurtosis, minimum, minimum location (normalized), maximum, maximum location (normalized), range, lower quartile, median, upper quartile, and interquartile Final set of 481 child-level features – Redundancy in feature set – Not obvious which will be best Machine learning algorithm needed – Feature selection to eliminate redundant/irrelevant/noisy features – Supervised learning to accurately predict ground-truth overall scores
Chose to use linear supervised regression techniques – Simplicity, interpretability, small dataset (42 data points) – Leave-one-out cross-validation (LOO) to separate train and test sets – All learning parameters optimized using LOO on each cross-val train set Baseline: simple linear regression with each child-level feature, x 3 feature selection methods within linear regression framework – Forward selection – Stepwise linear regression – Lasso (least absolute shrinkage and selection operator); used LARS (least angle regression) algorithm to efficiently implement the lasso Learning Methods Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”12 / 16
Results Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”13 / 16 Lasso/LR method had higher correlation than all 3 baselines (p < 0.05) Lasso/LR method had higher correlation than all 3 baselines (p < 0.05) Lasso/LR method better than mean evaluator in all 3 metrics (not sig.) Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Type Child-level feature # times selected Mean coeff. Word-level featureStatistic Correctnessacceptable_pron_recognizedMean42 / Temporaltarget_word_start_sqrt_timeUpper Quart.42 / Type Child-level feature # times selected Mean coeff. Word-level featureStatistic Correctnessacceptable_pron_recognizedMean42 / Temporaltarget_word_start_sqrt_timeUpper Quart.42 / Fluencyvoiced_disfluencies_sqrt_timeMaximum24 / Fluencydisfluencies_sqrt_timeUpper Quart.18 /
Regression Plot: Lasso/LR method Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”14 / 16 Two-thirds of predictions had errors less than avg. human evaluator errors
Error Analysis: Lasso/LR method Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”15 / 16 3 outliers = Maximum outlier error < Mean maximum human error = 2.227
Conclusions & Future Work Accurately predicted judgments about children’s overall reading ability for one specific reading assessment task Extracted features correlated with cues evaluators used – Pronunciation correctness, speaking rate, and fluency of speech Used lasso algorithm to select the most relevant features and linear regression to learn a ground-truth evaluator’s scores Final model: – Chose, on average, one feature from each of the three feature classes – Significantly beat all baselines methods that used single features – Predicted scores within the mean human error for 28 out of 42 children Future work: – Improve feature extraction robustness – Model/predict the subjective judgments of individual expert evaluators – Extend to different reading assessment tasks Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”16 / 16
* This work was supported in part by the National Science Foundation * Thank you! Questions?