Download presentation
Presentation is loading. Please wait.
Published byBritton Doyle Modified over 8 years ago
1
Predicting Children’s Reading Ability using Evaluator-Informed Features Matthew Black, Joseph Tepperman, Sungbok Lee, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory (SAIL) University of Southern California Los Angeles, CA, USA
2
Child N Child 1 “The Big Picture” 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”2 / 16 … … Children administered reading test map rip met rub zap left must frog flip snack mop lot fell must frog flip snack fine rope rake 3 5 5 5 6 6 6 6 6 6 7 2 2 3 3 3 3 3 3 3 3 4 5.55 ± 0.84 2.91 ± 0.36 … Evaluators rated each child’s overall reading ability – From 1 (“poor”) to 7 (“excellent”) Children assigned average score – Deviation from average also computed Feature Extraction Feature Selection Supervised Learning … Evaluators listed the high-level cues they used when making judgments – Informed feature extraction
3
Why? Reading assessment is important in children’s early education – Objective and subjective component Advantages of automatic reading assessment – Consistent evaluations across different subjects and over time – Save human evaluators time Previous work – Concentrated mostly on detecting errors at phoneme- and word-level – Predicting a child’s overall performance is relatively less researched 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”3 / 16
4
Corpus: TBALL Project Technology-Based Assessment of Language and Literacy Project – Collected in kindergarten to second grade classrooms in Los Angeles – Both native English and Spanish speakers 42 children – Isolated English word reading task – Word order identical for each child – Increasing order of difficulty – Number of words per child in this study ranged from 10 to 23 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”4 / 16
5
11 engineering graduate students – No significant difference between students and experts [Tepperman ‘06] Did not instruct evaluators how to grade or offer examples – We wanted evaluators to use own grading criteria Lastly, evaluators listed the criteria they used when judging Human evaluation: Explanation 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”5 / 16 Maintained chronological word order for each child Maintained chronological word order for each child Randomized child order for each evaluator Randomized child order for each evaluator..................
6
Human evaluation: Analysis 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”6 / 16 Mean pairwise evaluator correlation: 0.825 Dependent variable: calculated ground-truth (GT) scores for each child by averaging across all 11 evaluators’ scores 3 metrics to compare each evaluator’s scores with GT scores 3 most cited cues that 11 evaluators used – Pronunciation correctness, speaking rate, and fluency of speech – Exact grading method unknown, but informs potentially useful features MetricMean Correlation0.899 Mean Error0.624 Maximum Error2.227 2 4 3 … 7 4 2.7 1.9 2.9 … 6.6 4.5 1. Pearson’s Correlation 0.7 2.1 0.1 … 0.4 0.5 Evaluator 1 Mean(Evaluators 2:11) Absolute Difference 2. Mean Error, 3. Max Error Used later to compare automatic results
7
Features: Preliminary work 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”7 / 16 Acoustic Models – 12 hours of held-out children’s word-level speech – Three-state 16 Gaussian HMMs 38 monophone Background (BG) – silence and noise Word-level “garbage” (GG) DictionaryAvg. #Entries for “map” Acceptable1.18/m ae p/ Reading Error2.13/m ey p/ Spanish Confusion1.09/m aa p/ Recognition3.71/m ae p/, /m ey p/, /m aa p/ Created 4 dictionaries with various phonemic pronunciations Extracted features correlated with cues at word-level – Natural since children reading a list of words in isolation
8
Features: Pronunciation correctness 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”8 / 16 Speech recognition with Recognition dictionary: BG | GG /m ey p/ BG | GG Start End /m aa p/ /m ae p/ 4 forced alignments over portion end-pointed as target word Extracted pronunciation correctness features for each word – 3 binary features (e.g., was “acceptable pronunciation recognized?”) – 10 continuous features (e.g., LL acc – max{LL read, LL Span, LL GG }) /m ae p/ /m ey p/ /m aa p/ GG LL acc LL read LL Span LL GG
9
Features: Speaking rate 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”9 / 16 Extracted several continuous word-level temporal features – Target word start time relative to when word first displayed – Target word total length in time (L target ) – L target / Number of syllables spoken – Number of syllables spoken / L target – L target / Number of phones spoken – Number of phones spoken / L target Also included the square root of all features listed above – Resulted in less-skewed distribution
10
Features: Fluency of speech 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”10 / 16 Re-ran speech recognition with disfluency-specialized grammars BG | GG Partial: /m ey p/ Partial: /m ey p/ BG | GG Start End Partial: /m aa p/ Partial: /m aa p/ Whole: /m ey p/ Whole: /m ey p/ Whole: /m aa p/ Whole: /m aa p/ Whole: /m ae p/ Whole: /m ae p/ Partial: /m ae p/ Partial: /m ae p/ m- BG ae- p- BG Partial word structure for /m ae p/ Presence of disfluencies significantly negatively correlated with perception of fluency of children’s speech [Black ‘07] – e.g., hesitations, repetitions, sounding-out the words Extracted 11 word-level features from the output transcription – e.g., the number of disfluent phones recognized – e.g., square root of the total time endpointed as disfluent speech
11
Child-Level Features 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”11 / 16 Needed to map word-level features to “child-level” features 13 descriptive statistics calculated across words for each child – Mean, standard deviation, skewness, kurtosis, minimum, minimum location (normalized), maximum, maximum location (normalized), range, lower quartile, median, upper quartile, and interquartile Final set of 481 child-level features – Redundancy in feature set – Not obvious which will be best Machine learning algorithm needed – Feature selection to eliminate redundant/irrelevant/noisy features – Supervised learning to accurately predict ground-truth overall scores
12
Chose to use linear supervised regression techniques – Simplicity, interpretability, small dataset (42 data points) – Leave-one-out cross-validation (LOO) to separate train and test sets – All learning parameters optimized using LOO on each cross-val train set Baseline: simple linear regression with each child-level feature, x 3 feature selection methods within linear regression framework – Forward selection – Stepwise linear regression – Lasso (least absolute shrinkage and selection operator); used LARS (least angle regression) algorithm to efficiently implement the lasso Learning Methods 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”12 / 16
13
Results 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”13 / 16 Lasso/LR method had higher correlation than all 3 baselines (p < 0.05) Lasso/LR method had higher correlation than all 3 baselines (p < 0.05) Lasso/LR method better than mean evaluator in all 3 metrics (not sig.) Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness)0.7830.7462.852 LR (best temporal)0.7570.8322.669 LR (best fluency)0.5861.0773.270 LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator0.8990.6242.227 Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness)0.7830.7462.852 LR (best temporal)0.7570.8322.669 LR (best fluency)0.5861.0773.270 LR (forward 2 features)0.8760.6162.107 LR (forward 3 features)0.8600.6512.187 LR (forward 4 features)0.8270.7122.275 Stepwise LR Lasso Lasso, then LR Mean human evaluator0.8990.6242.227 Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness)0.7830.7462.852 LR (best temporal)0.7570.8322.669 LR (best fluency)0.5861.0773.270 LR (forward 2 features)0.8760.6162.107 LR (forward 3 features)0.8600.6512.187 LR (forward 4 features)0.8270.7122.275 Stepwise LR0.8800.6042.107 Lasso Lasso, then LR Mean human evaluator0.8990.6242.227 Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness)0.7830.7462.852 LR (best temporal)0.7570.8322.669 LR (best fluency)0.5861.0773.270 LR (forward 2 features)0.8760.6162.107 LR (forward 3 features)0.8600.6512.187 LR (forward 4 features)0.8270.7122.275 Stepwise LR0.8800.6042.107 Lasso0.8860.8152.326 Lasso, then LR Mean human evaluator0.8990.6242.227 Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness)0.7830.7462.852 LR (best temporal)0.7570.8322.669 LR (best fluency)0.5861.0773.270 LR (forward 2 features)0.8760.6162.107 LR (forward 3 features)0.8600.6512.187 LR (forward 4 features)0.8270.7122.275 Stepwise LR0.8800.6042.107 Lasso0.8860.8152.326 Lasso, then LR Mean human evaluator0.8990.6242.227 Type Child-level feature # times selected Mean coeff. Word-level featureStatistic Correctnessacceptable_pron_recognizedMean42 / 420.803 Temporaltarget_word_start_sqrt_timeUpper Quart.42 / 42-0.508 Type Child-level feature # times selected Mean coeff. Word-level featureStatistic Correctnessacceptable_pron_recognizedMean42 / 420.803 Temporaltarget_word_start_sqrt_timeUpper Quart.42 / 42-0.508 Fluencyvoiced_disfluencies_sqrt_timeMaximum24 / 42-0.379 Fluencydisfluencies_sqrt_timeUpper Quart.18 / 42-0.340
14
Regression Plot: Lasso/LR method 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”14 / 16 Two-thirds of predictions had errors less than avg. human evaluator errors
15
Error Analysis: Lasso/LR method 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”15 / 16 3 outliers 1.971 = Maximum outlier error < Mean maximum human error = 2.227
16
Conclusions & Future Work Accurately predicted judgments about children’s overall reading ability for one specific reading assessment task Extracted features correlated with cues evaluators used – Pronunciation correctness, speaking rate, and fluency of speech Used lasso algorithm to select the most relevant features and linear regression to learn a ground-truth evaluator’s scores Final model: – Chose, on average, one feature from each of the three feature classes – Significantly beat all baselines methods that used single features – Predicted scores within the mean human error for 28 out of 42 children Future work: – Improve feature extraction robustness – Model/predict the subjective judgments of individual expert evaluators – Extend to different reading assessment tasks 8.09.2009Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”16 / 16
17
* This work was supported in part by the National Science Foundation * Thank you! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.