Predicting Children’s Reading Ability using Evaluator-Informed Features Matthew Black, Joseph Tepperman, Sungbok Lee, and Shrikanth Narayanan Signal Analysis.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Descriptive Measures MARE 250 Dr. Jason Turner.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
Evaluating Search Engine
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
TBALL ASR Work Summary USC group: Joseph Tepperman, Matt Black, Abe Kazemzadeh, Matteo Gerosa, Sungbok Lee, Shri Narayanan.
Chapter In Chapter 3… … we used stemplots to look at shape, central location, and spread of a distribution. In this chapter we use numerical summaries.
Lasso regression. The Goals of Model Selection Model selection: Choosing the approximate best model by estimating the performance of various models Goals.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Presented By Wanchen Lu 2/25/2013
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Automated Scoring of Picture- based Story Narration Swapna Somasundaran Chong Min Lee Martin Chodorow Xinhao Wang.
Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Copyright © 2015 by Educational Testing Service. 1 Feature Selection for Automated Speech Scoring Anastassia Loukina, Klaus Zechner, Lei Chen, Michael.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.4 Analyzing Dependent Samples.
Does time spent on Facebook affect your grades? Study results presented by: Mary Vietti : Power Point Creator Justin Price : Editor & Conclusion Jacob.
Numerical Measures of Variability
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR
3D Face Recognition Using Range Images
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Correlation They go together like salt and pepper… like oil and vinegar… like bread and butter… etc.
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
Predicting Voice Elicited Emotions
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Machine Learning 5. Parametric Methods.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
Coordinate Algebra Practice EOCT Answers Unit 4. #1 Unit 4 This table shows the average low temperature, in ºF, recorded in Macon, GA, and Charlotte,
Automatic Pronunciation Scoring of Specific Phone Segments for Language Instruction EuroSpeech 1997 Authors: Y. Kim, H. Franco, L. Neumeyer Presenter:
Collecting and Processing Information Foundations of Technology Collecting and Processing Information © 2013 International Technology and Engineering Educators.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
5. Evaluation of measuring tools: reliability Psychometrics. 2011/12. Group A (English)
An Articulatory Analysis of Phonological Transfer Using Real-Time MRI Joseph Tepperman, Erik Bresch, Yoon-Chul Kim, Sungbok Lee, Louis Goldstein, and Shrikanth.
A Bayesian Network Classifier for Word-level Reading Assessment Joseph Tepperman 1, Matthew Black 1, Patti Price 2, Sungbok Lee 1, Abe Kazemzadeh 1, Matteo.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
CSE 4705 Artificial Intelligence
For Evaluating Dialog Error Conditions Based on Acoustic Information
Intelligent Information System Lab
Automatic Fluency Assessment
Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov
Automating Early Assessment of Academic Standards for Very Young Native and Non-Native Speakers of American English better known as The TBALL Project.
Does time spent on Facebook affect your grades?
Basic Practice of Statistics - 3rd Edition Inference for Regression
Analyzing Stability in Colorado K-12 Public Schools
Presentation transcript:

Predicting Children’s Reading Ability using Evaluator-Informed Features Matthew Black, Joseph Tepperman, Sungbok Lee, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory (SAIL) University of Southern California Los Angeles, CA, USA

Child N Child 1 “The Big Picture” Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”2 / 16 … … Children administered reading test map rip met rub zap left must frog flip snack mop lot fell must frog flip snack fine rope rake ± ± 0.36 … Evaluators rated each child’s overall reading ability – From 1 (“poor”) to 7 (“excellent”) Children assigned average score – Deviation from average also computed Feature Extraction Feature Selection Supervised Learning … Evaluators listed the high-level cues they used when making judgments – Informed feature extraction

Why? Reading assessment is important in children’s early education – Objective and subjective component Advantages of automatic reading assessment – Consistent evaluations across different subjects and over time – Save human evaluators time Previous work – Concentrated mostly on detecting errors at phoneme- and word-level – Predicting a child’s overall performance is relatively less researched Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”3 / 16

Corpus: TBALL Project Technology-Based Assessment of Language and Literacy Project – Collected in kindergarten to second grade classrooms in Los Angeles – Both native English and Spanish speakers 42 children – Isolated English word reading task – Word order identical for each child – Increasing order of difficulty – Number of words per child in this study ranged from 10 to Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”4 / 16

11 engineering graduate students – No significant difference between students and experts [Tepperman ‘06] Did not instruct evaluators how to grade or offer examples – We wanted evaluators to use own grading criteria Lastly, evaluators listed the criteria they used when judging Human evaluation: Explanation Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”5 / 16 Maintained chronological word order for each child Maintained chronological word order for each child Randomized child order for each evaluator Randomized child order for each evaluator

Human evaluation: Analysis Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”6 / 16 Mean pairwise evaluator correlation: Dependent variable: calculated ground-truth (GT) scores for each child by averaging across all 11 evaluators’ scores 3 metrics to compare each evaluator’s scores with GT scores 3 most cited cues that 11 evaluators used – Pronunciation correctness, speaking rate, and fluency of speech – Exact grading method unknown, but informs potentially useful features MetricMean Correlation0.899 Mean Error0.624 Maximum Error … … Pearson’s Correlation … Evaluator 1 Mean(Evaluators 2:11) Absolute Difference 2. Mean Error, 3. Max Error Used later to compare automatic results

Features: Preliminary work Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”7 / 16 Acoustic Models – 12 hours of held-out children’s word-level speech – Three-state 16 Gaussian HMMs  38 monophone  Background (BG) – silence and noise  Word-level “garbage” (GG) DictionaryAvg. #Entries for “map” Acceptable1.18/m ae p/ Reading Error2.13/m ey p/ Spanish Confusion1.09/m aa p/ Recognition3.71/m ae p/, /m ey p/, /m aa p/ Created 4 dictionaries with various phonemic pronunciations Extracted features correlated with cues at word-level – Natural since children reading a list of words in isolation

Features: Pronunciation correctness Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”8 / 16 Speech recognition with Recognition dictionary: BG | GG /m ey p/ BG | GG Start End /m aa p/ /m ae p/ 4 forced alignments over portion end-pointed as target word Extracted pronunciation correctness features for each word – 3 binary features (e.g., was “acceptable pronunciation recognized?”) – 10 continuous features (e.g., LL acc – max{LL read, LL Span, LL GG }) /m ae p/ /m ey p/ /m aa p/ GG LL acc LL read LL Span LL GG

Features: Speaking rate Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”9 / 16 Extracted several continuous word-level temporal features – Target word start time relative to when word first displayed – Target word total length in time (L target ) – L target / Number of syllables spoken – Number of syllables spoken / L target – L target / Number of phones spoken – Number of phones spoken / L target Also included the square root of all features listed above – Resulted in less-skewed distribution

Features: Fluency of speech Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”10 / 16 Re-ran speech recognition with disfluency-specialized grammars BG | GG Partial: /m ey p/ Partial: /m ey p/ BG | GG Start End Partial: /m aa p/ Partial: /m aa p/ Whole: /m ey p/ Whole: /m ey p/ Whole: /m aa p/ Whole: /m aa p/ Whole: /m ae p/ Whole: /m ae p/ Partial: /m ae p/ Partial: /m ae p/ m- BG ae- p- BG Partial word structure for /m ae p/ Presence of disfluencies significantly negatively correlated with perception of fluency of children’s speech [Black ‘07] – e.g., hesitations, repetitions, sounding-out the words Extracted 11 word-level features from the output transcription – e.g., the number of disfluent phones recognized – e.g., square root of the total time endpointed as disfluent speech

Child-Level Features Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”11 / 16 Needed to map word-level features to “child-level” features 13 descriptive statistics calculated across words for each child – Mean, standard deviation, skewness, kurtosis, minimum, minimum location (normalized), maximum, maximum location (normalized), range, lower quartile, median, upper quartile, and interquartile Final set of 481 child-level features – Redundancy in feature set – Not obvious which will be best Machine learning algorithm needed – Feature selection to eliminate redundant/irrelevant/noisy features – Supervised learning to accurately predict ground-truth overall scores

Chose to use linear supervised regression techniques – Simplicity, interpretability, small dataset (42 data points) – Leave-one-out cross-validation (LOO) to separate train and test sets – All learning parameters optimized using LOO on each cross-val train set Baseline: simple linear regression with each child-level feature, x 3 feature selection methods within linear regression framework – Forward selection – Stepwise linear regression – Lasso (least absolute shrinkage and selection operator); used LARS (least angle regression) algorithm to efficiently implement the lasso Learning Methods Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”12 / 16

Results Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”13 / 16 Lasso/LR method had higher correlation than all 3 baselines (p < 0.05) Lasso/LR method had higher correlation than all 3 baselines (p < 0.05) Lasso/LR method better than mean evaluator in all 3 metrics (not sig.) Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Method (Features)Corr.Mean |ε|Max |ε| LR (best correctness) LR (best temporal) LR (best fluency) LR (forward 2 features) LR (forward 3 features) LR (forward 4 features) Stepwise LR Lasso Lasso, then LR Mean human evaluator Type Child-level feature # times selected Mean coeff. Word-level featureStatistic Correctnessacceptable_pron_recognizedMean42 / Temporaltarget_word_start_sqrt_timeUpper Quart.42 / Type Child-level feature # times selected Mean coeff. Word-level featureStatistic Correctnessacceptable_pron_recognizedMean42 / Temporaltarget_word_start_sqrt_timeUpper Quart.42 / Fluencyvoiced_disfluencies_sqrt_timeMaximum24 / Fluencydisfluencies_sqrt_timeUpper Quart.18 /

Regression Plot: Lasso/LR method Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”14 / 16 Two-thirds of predictions had errors less than avg. human evaluator errors

Error Analysis: Lasso/LR method Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”15 / 16 3 outliers = Maximum outlier error < Mean maximum human error = 2.227

Conclusions & Future Work Accurately predicted judgments about children’s overall reading ability for one specific reading assessment task Extracted features correlated with cues evaluators used – Pronunciation correctness, speaking rate, and fluency of speech Used lasso algorithm to select the most relevant features and linear regression to learn a ground-truth evaluator’s scores Final model: – Chose, on average, one feature from each of the three feature classes – Significantly beat all baselines methods that used single features – Predicted scores within the mean human error for 28 out of 42 children Future work: – Improve feature extraction robustness – Model/predict the subjective judgments of individual expert evaluators – Extend to different reading assessment tasks Black et al., “Predicting Children's Reading Ability using Evaluator-Informed Features”16 / 16

* This work was supported in part by the National Science Foundation * Thank you! Questions?