Detecting Prosody Improvement in Oral Rereading

Slides:



Advertisements
Similar presentations
Chapter 10 Fluency Instruction
Advertisements

2013/12/10.  The Kendall’s tau correlation is another non- parametric correlation coefficient  Let x 1, …, x n be a sample for random variable x and.
G. Alonso, D. Kossmann Systems Group
Fluency Instruction Lynda Berger Chapter 10. Introduction Fluency instruction is an important part of every reading program because practice with connected.
Carnegie Mellon 1. Toward Exploiting EEG Input in a Reading Tutor Jack Mostow, Kai-min Chang, and Jessica Nelson Project LISTEN
Detecting Prosody Improvement in Oral Rereading Minh Duong and Jack Mostow Project LISTEN Carnegie Mellon University The research.
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Lessons from generating, scoring, and analyzing questions in a Reading Tutor for children Jack Mostow Project LISTEN (
Summary of Quantitative Analysis Neuman and Robson Ch. 11
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
Inference for Linear Regression Conditions for Regression Inference: Suppose we have n observations on an explanatory variable x and a response variable.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
1 1 Slide Slides Prepared by JOHN S. LOUCKS St. Edward’s University © 2002 South-Western/Thomson Learning.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1.
T tests comparing two means t tests comparing two means.
Carnegie Mellon How does the amount of context in which words are practiced affect fluency growth? Experimental results Jack Mostow, Jessica Nelson, Martin.
STATISTICS STATISTICS Numerical data. How Do We Make Sense of the Data? descriptively Researchers use statistics for two major purposes: (1) descriptively.
King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Lecture Slides Elementary Statistics Twelfth Edition
Improving Reading Fluency
Statistical analysis.
Regression and Correlation
Chapter 8: Estimating with Confidence
Dependent-Samples t-Test
Chapter 8: Estimating with Confidence
Assessing Students' Understanding of the Scientific Process Amy Marion, Department of Biology, New Mexico State University Abstract The primary goal of.
Statistical analysis.
Longitudinal Designs.
Correlation A Lecture for the Intro Stat Course
Simple Linear Regression
TE
Descriptive Statistics:
Micro-analysis of Fluency Gains in a Reading Tutor that Listens:
EVAAS Overview.
CHAPTER 26: Inference for Regression
Stat 217 – Day 28 Review Stat 217.
Independent versus Computer-Guided Oral Reading:
Chapter 5 Sampling Distributions
An Embedded Experiment to Evaluate the Effectiveness of Vocabulary Previews in an Automated Reading Tutor Jack Mostow, Joe Beck, Juliet Bey, Andrew Cuneo,
Confidence Intervals: The Basics
STATISTICS Topic 1 IB Biology Miss Werba.
Chapter 10: Estimating with Confidence
Basic Practice of Statistics - 3rd Edition Inference for Regression
Jack Mostow* and Joseph Beck Project LISTEN (
Chapter 8: Estimating with Confidence
Inference for Distributions 7.2 Comparing Two Means
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Statistical analysis.
Chapter 8: Estimating with Confidence
Chapter Nine: Using Statistics to Answer Questions
Reading With Your Child
St. Edward’s University
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
Chapter 8: Estimating with Confidence
COMPARING VARIABLES OF ORDINAL OR DICHOTOMOUS SCALES: SPEARMAN RANK- ORDER, POINT-BISERIAL, AND BISERIAL CORRELATIONS.
Chapter 8: Estimating with Confidence
Rapid Formation of Robust Auditory Memories: Insights from Noise
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Detecting Prosody Improvement in Oral Rereading Minh Duong and Jack Mostow Project LISTEN www.cs.cmu.edu/~listen Carnegie Mellon University The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grants R305A080628. The opinions expressed are those of the authors and do not necessarily represent the views of the Institute or the U.S. Department of Education.

Motivation A tutor should be able to detect improvement in: When? Oral reading rate And prosody (rhythm, intensity, intonation, pauses) When? In reading new texts [our AIED2009 paper] In rereading same text [this SLaTE 2009 paper] Faster improvement than on new text Easier to detect because controlling for text reduces variance Popular, effective form of practice This work: detect prosody improvement in rereading. (Not sure what to annotate here) Rereading eliminates variance due to differences in text

A real example Text: “Don’t you like surprises?” A second grader’s 1st reading Her 2nd reading, 5 days later [Click on each icon to hear the recording and see its prosodic contour get abstracted to the word level] As you can see, the second reading resembled the adult narration a lot more than the first one. (Child was at grade level according to paper tests. She also didn’t hear the adult narration before reading the text.) Adult narration

Prosodic attributes Duration: how long Mean intensity: how loud Latency Production Normalized Mean intensity: how loud In order to quantify prosody, we looked at various aspects of students’ readings, such as duration, latency, production, pitch and intensity. For the temporal attributes (duration, latency and production), we also computed their per-letter normalized versions, because we found that duration had a strong linear correlation with word length, as shown in the graph. Mean pitch: how high Time

Raw features Raw duration, latency, and production Pause frequency Absolute and normalized by word length Averaged within sentence Pause frequency Percent of words with latency > 10 ms (or rejected) Pitch variation Standard deviation of word pitch within sentence From those basic attributes of prosody, we computed the following raw features for each sentence reading…

Correlational features Expressive reading reflects comprehension Expressive readers’ prosody resembles adults’ [Schwanenflugel et al. ’04, ’06, ’08; Mostow & Duong AIED ’09] Correlate child and adult prosodic contours For each prosodic attribute (latency, pitch, intensity, …) Represented as its sequence of values For the words in the same sentence E.g. pitch correlation increases from 0.37 to 0.90 In addition, previous studies by Schwanenflugel and colleagues, as well as our own, found that … Motivated by these findings, for each prosodic attribute, we computed a correlational feature for each sentence as the Pearson Correlation Coefficient between the child and adult’s sequence of word-level values.

Data source: Project LISTEN’s Reading Tutor Reading Tutor that listens to children read aloud and responds with spoken and graphical feedback, as well as child’s help requests. See Videos page at www.cs.cmu.edu/~listen

Data set: 29,794 sentence rereadings 164 students in grades 2-4 Used the Reading Tutor at school in 2005-2006 Read 77,693 sentences; 38% were rereadings Students averaged 132 rereadings (1-1891) and reread those sentences 1.66 times on average 4901 distinct sentences 77,693 sentence readings, of which 38% were rereadings

Analyses Which features were most sensitive to gains? Can we detect learning? Which features detected individual gains? In our analyses, we sought to answer these 3 research questions:

1. How to measure feature sensitivity? As median per-student effect size for a sentence-level feature (e.g. correlation with adult pitch) Sentence: Gain: Student: Feature: 1st reading 0.37 +0.53 Mary 2nd reading 0.90 {+0.53, -0.05, …} Correl pitch We quantified feature sensitivity as the median per student effect size. In particular, starting at the sentence level… [Note in case someone asks: ES is an aggregate property; z-scores are not.]  Effect size = mean/SD = 0.08 ….. -0.05 ….. {0.08, 0.12, …}  Median effect size = 0.10 Joe ……. Effect size = 0.12

1. Which features were most sensitive?... raw duration raw norm duration raw norm production pause frequency raw production raw norm latency raw latency correl duration correl production correl latency correl norm latency correl norm duration pitch variation correl norm production correl pitch correl intensity To find the most sensitive features, we applied the previously described method to our dataset… Median effect size Median effect size ± 95 % confidence interval

1. Temporal features most sensitive raw duration* raw norm duration * raw norm production * pause frequency * raw production * raw norm latency * raw latency * correl duration * correl production * correl latency * correl norm latency * correl norm duration* pitch variation* correl norm production* correl pitch correl intensity … and found that temporal features (in red), especially the raw ones, were most sensitive, as measured by median effect size of mean student gain. (Why? Correlational features depend not only on the student’s speech but also on the particular adult’s  additional variability and measurement noise,  useful for assessing the quality of students’ prosody, but raw features are more sensitive to its improvement.) (95% confidence interval: we computed confidence intervals around the median effect sizes using the bootstrapping method, assuming normality among the medians of 100 sets (sampled with replacement)) Median effect size *: significantly > 0 at 0.05 level Median effect size ± 95 % confidence interval

2. Can we detect learning? Do gains indicate learning, or just recency effects? We split rereadings into: The same day: could be short-term recency effect A later day: actual “learning” Then computed effect sizes as before. (Explain: not necessarily generalizing to new contexts (learning); might just be memorizing, but at least retaining overnight)

2. Which features detected learning?... raw duration raw norm duration raw norm production pause frequency raw production raw norm latency raw latency correl duration correl production correl latency correl norm latency correl norm duration pitch variation correl norm production correl pitch correl intensity (Notes on next slide)

2. Temporal features detected learning raw duration* raw norm duration* raw norm production* pause frequency* raw production* raw norm latency* raw latency* correl duration* correl production* correl latency correl norm latency* correl norm duration* pitch variation correl norm production correl pitch correl intensity (one-tailed t-test that learning > 0) In this graph, we sorted the features in the same order in which they appeared in the first graph (showing feature sensitivity overall). Again, we found that temporal features were more sensitive to later day improvement/ learning. *: significantly > 0 at 0.05 level 15

2. How did learning compare with recency effects? raw duration raw norm duration raw norm production pause frequency raw production raw norm latency raw latency correl duration correl production correl latency correl norm latency correl norm duration pitch variation correl norm production correl pitch correl intensity

2. Recency ~ 2x learning where differed significantly raw duration* raw norm duration* raw norm production* pause frequency* raw production* raw norm latency raw latency correl duration correl production correl latency correl norm latency correl norm duration pitch variation correl norm production correl pitch correl intensity (one-tailed, paired t-test for each feature, pairing each student’s effect sizes for two cases) Recency effects were almost twice as large as learning effects for the features where recency was significantly greater than learning. *: Recency > Learning at 0.05 level 17

3. Which features detected individual gains? What % of students gained significantly (on a 1-tailed paired T-test) from one reading to the next: Overall When rereading on the same day (“recency effect”) When rereading on a later day (“learning”) The statistical significance tests described above are for overall differences between features or same- vs. later-day. What about for individuals? Are any features sensitive enough to detect improvements that are statistically reliable for particular students? To answer, we carried out a one-tailed paired T-test for each student’s set of feature values, repeated for each feature.

3. Which features detected individual gains? Overall raw duration raw norm duration raw norm production pause frequency raw production raw norm latency raw latency correl duration correl production correl latency correl norm latency correl norm duration pitch variation correl norm production correl pitch correl intensity Features that had greater effect sizes, also tend to have higher % of students with significant improvement.

3. Which features detected individual gains? Same day vs. later day raw duration raw norm duration raw norm production pause frequency raw production raw norm latency raw latency correl duration correl production correl latency correl norm latency correl norm duration pitch variation correl norm production correl pitch correl intensity Later day results didn’t lag behind same day as much as when computing median effect sizes. Raw norm duration is best for later day/learning.

3. How does individual gains’ significance depend on amount of data? Raw normalized duration (most sensitive overall) To analyze the effect of having a large amount of data on the significance of individual gains, we ranked the students by their amount of data, split them into 10 bins of equal size, then computed the % of students w/ significant gains for each bin. (Note: fluent students might not read as much  smaller amount of data)

Conclusions Which features were most sensitive to gains? Temporal, especially raw Did we detect learning, not just recency effects? Yes, but recency was ~2x stronger where different Did we reliably detect individual gains? Yes, for ~40% of all students No, for students who reread fewer than 21 sentences Yes, for 60% of students who reread 336+ sentences (Not sure what to annotate here)

Limitations and future work Why didn’t pitch or intensity features detect gains? Less sensitive to growth, e.g. than word decoding? Measurement error, e.g. in pitch tracking? Within-student variance, e.g. mic position or noise? Across-sentence variance, e.g. adult narration? Extend to work without adult narrations? Extend to detect gains in reading new text? (Not sure what to annotate here)

Thank you! Questions?

Long words are slower. Averaged over hundreds of thousands of words recognized by the Reading Tutor 25