Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov

Slides:



Advertisements
Similar presentations
L2 program design Content, structure, evaluation.
Advertisements

Spiros Papageorgiou University of Michigan
also known as the “Perceptron”
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Principles for teaching speaking 1.Give students practice with both fluency and accuracy 2.Provide opportunities for students to interact by using pair.
Confidential and Proprietary. Copyright © 2010 Educational Testing Service. All rights reserved. Catherine Trapani Educational Testing Service ECOLT: October.
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
The Wisdom of the Few A Collaborative Filtering Approach Based on Expert Opinions from the Web Xavier Amatriain Telefonica Research Nuria Oliver Telefonica.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Why is ASR Hard? Natural speech is continuous
Test Validity S-005. Validity of measurement Reliability refers to consistency –Are we getting something stable over time? –Internally consistent? Validity.
Language Assessment 4 Listening Comprehension Testing Language Assessment Lecture 4 Listening Comprehension Testing Instructor Tung-hsien He, Ph.D. 何東憲老師.
Direct Method Dr. Chen Chinfen. Background  Founded by Francois Gouin, in 1860, he observed hundreds of French students learning a foreign language and.
Automated Scoring of Picture- based Story Narration Swapna Somasundaran Chong Min Lee Martin Chodorow Xinhao Wang.
Measuring Complex Achievement
Arizona English Language Learner Assessment AZELLA
Scaling up Decision Trees. Decision tree learning.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Copyright © 2015 by Educational Testing Service. 1 Feature Selection for Automated Speech Scoring Anastassia Loukina, Klaus Zechner, Lei Chen, Michael.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Assessment. Workshop Outline Testing and assessment Why assess? Types of tests Types of assessment Some assessment task types Backwash Qualities of a.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
Experimental Research Methods in Language Learning Chapter 12 Reliability and Reliability Analysis.
Chapter 6 - Standardized Measurement and Assessment
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Effects of Word Concreteness and Spacing on EFL Vocabulary Acquisition 吴翼飞 (南京工业大学,外国语言文学学院,江苏 南京211816) Introduction Vocabulary acquisition is of great.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Sentence Durations and Accentedness Judgments
High Quality Voice Morphing
Machine Learning with Spark MLlib
BBI 2420 ORAL INTERACTION SKILLS
Data Transformation: Normalization
ASR-based corrective feedback on pronunciation: does it really work?
Evaluating Classifiers
Investigating Pitch Accent Recognition in Non-native Speech
Learning Usage of English KWICly with WebLEAP/DSR
CHAPTER 7 Sampling Distributions
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
Automatic Speech Recognition
Bag-of-Visual-Words Based Feature Extraction
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Introduction of IELTS Test
Supervised Time Series Pattern Discovery through Local Importance
Test Validity.
Erasmus University Rotterdam
Course Projects Speech Recognition Spring 1386
Why bother – is this not the English Department’s job?
SPEAKING ASSESSMENT Joko Nurkamto UNS Solo 11/8/2018.
Automatic Fluency Assessment
Partial Credit Scoring for Technology Enhanced Items
SPEAKING ASSESSMENT Joko Nurkamto UNS Solo 12/3/2018.
iSRD Spam Review Detection with Imbalanced Data Distributions
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Ensemble learning.
Lecture 15: Data Cleaning for ML
Intro to Machine Learning
Ensemble learning Reminder - Bagging of Trees Random Forest
Developing a Rubric for Assessment
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
The quality of choices determines the quantity of Key words
Introduction to Neural Networks
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Educational Testing Service (ETS)
The Nature of learner language
THE ASSISTIVE SYSTEM SHIFALI KUMAR BISHWO GURUNG JAMES CHOU
Machine Learning: Lecture 5
Joint Learning of Correlated Sequence Labeling Tasks
Presentation transcript:

Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov Using exemplar responses for training and evaluating automated speech scoring systems Anastassia Loukina, Klaus Zechner, James Bruno, Beata Beigman Klebanov

Context English language proficiency tests often include speaking section Different types of tasks: read aloud, sentence repeat, picture description, dialogue tasks … This talk: Tasks that elicit spontaneous speech for about 1 minute The response is recorded and sent for scoring

Automated speech scoring ASR Scoring model ML algorithm trained on human scores Feature computation Number of pauses, similarity to native model, vocabulary complexity, grammatical complexity etc. Score or NaN Other signal processing Filtering model Set of rules and classifiers to identify non-scoreable responses Pitch Amplitude Formants

Problem Obtaining reliable human scores for constructed responses is hard: Response factors: multidimensional production mapped to a single scale Rater factors Ensuring reliability for spoken responses even harder Relatively low human-human agreement at the level of individual response (r=0.55-0.65).

Can we improve system performance by training on clean data? Solution Human scoring: Test-takers answer several questions Responses from the same test taker are scored by a different raters The final aggregated score is highly reliable (r>0.9) Automated scoring: Engines are trained using human scores at response level Previous studies suggest that removing low-agreement (“hard”) cases from the training set may improve system performance (Beigman Klebanov & Beigman 2014, Jamison & Gurevych 2015). Can we improve system performance by training on clean data?

Data Exemplar corpus Main corpus Randomly sampled from responses to a large-scale language proficiency assessment 6 different types of questions eliciting spontaneous speech (1,140 different questions) 683,694 responses scored on a scale 1-4. 8.5% double scored: r=0.59 Exemplar responses from the same assessment selected for rater training/monitoring Same 6 types of questions (800 different questions) 16,257 responses sampled to obtain the same score distribution as in the main corpus Only includes responses where multiple experts agree on the same score

Model building 70 features extracted for each response: Delivery: fluency, pronunciation, prosody Language Use: vocabulary, grammar 7 different machine learning algorithms: OLS, Elastic Net, Random Forest, Multilayer perceptron regressor… Separate models trained for 6 types of questions Main train Exemplar train Main test Exemplar test

Model performance: within corpus Model trained and evaluated on exemplar responses consistently outperform those trained and evaluated on the main corpus There is no major difference between different learners r=0.66 r = 0.79

Model performance: across corpora Training on exemplar responses does not lead to improvement in performance on the main corpus Training on main corpus does not lead to degradation in performance on exemplar responses Main: r = .66 Exemplar: r = .64 Exemplar: r = .79 Main: r = .8

Difference in N in training set Exemplar corpus Main corpus Randomly sampled from responses to a large-scale language proficiency assessment 683,694 responses Only includes responses where multiple experts agree on the same score 16,257 responses sampled to obtain the same score distribution as in the main corpus. Main* corpus Randomly sampled from the training partition of the Main corpus Same N as in the training partition of the exemplar corpus: 12,398 responses

Main* vs. Exemplar There is no difference when models are evaluated on the main corpus Training on exemplar responses leads to a better performance on exemplar responses Large training set gives the best performance Main: r = .66 Exemplar: r = .64 Main*: r = .64 Main: r = .8 Exemplar: r = .79 Main*: r = .77

Modelling the differences (N=4,686,507) Error2 ~ learner + train_set * test_set + (1|response) + (1| model) Coefficient 0.291*** Intercept test_set.Exemplar -0.105** train_set.main -0.014*** train_set.main* -0.002* Learner.HuberRegressor -0.001*** Learner.MLPRegressor Learner.ElasticNet 0.002*** learner.GradientBoostingRegressor 0.003*** Learner.LinearSVR 0.007*** Learner.RandomForestRegressor 0.008*** train_set.main:test_setExemplar 0.016** train_set.main*:test_set.Exemplar 0.018**

What if we had more exemplar responses? Training on Exemplar responses has a small advantage for a very small N but not when the training set is sufficiently large. Training on Exemplar responses has a clear advantage for small N. This decreases with the increase in the size of the training set.

Do the models generate different predictions? Different learners trained on the same data sets: r = .97 (min r = .92). Same learner trained on different datasets: r = .98 (min r = .95). Different learners trained on different corpora seem to be producing essentially the same predictions

Conclusions As long as you have enough training data it doesn't matter whether you train on exemplar responses or random sample, at least if the response-level noise is random. The choice of training set (exemplar vs. random) has a major effect on the estimates of model performance Unless the number of available responses is really small (~1K), the cost of creating an Exemplar corpus is likely to outweigh benefits. “Cleaning up” the evaluation set or collecting a larger set of training responses is likely to be more useful.

Thank you!

Further error analysis (Main corpus) Sources of errors for 80 responses with the highest scoring error: 30% Noise in human labels. 22.5% Errors in pipeline: inaccurate ASR 30% Responses differentiated along dimensions not measured by features