The Reliability of Formant Measurements in High Quality Audio Data: The Effect of Agreeing Measurement Procedures Martin Duckworth, Kirsty McDougall, Gea de Jong, Linda Shockey
Introduction Formant measurement implicitly required legally in the UK in speaker comparison cases Measurements on analogue spectrograms had to be by hand and eye Measurements on digital spectrograms can be assisted by formant trackers, LPC is common
Introduction How replicable are measurements by eye on digital spectrograms?
Introduction How replicable are measurement by eye on digital spectrograms? If LPC tracking is used what can lead to variability?
Introduction How replicable are measurement by eye on digital spectrograms? If LPC tracking is used what can lead to variability? −Software settings
Introduction How replicable are measurement by eye on digital spectrograms? If LPC tracking is used what can lead to variability? −Software settings −Point at which data is extracted
Study Aims What is required in order to make measurements more replicable?
Study Aims What is required in order to make measurements more replicable? If software (but not method) is held constant and data is high quality, can different laboratories make the same F1-3 measurements?
Study Aims What is required in order to make measurements more replicable? If software (but not method) is held constant and data is high quality, can different laboratories make the same F1-3 measurements? If method of analysis is the same does this lead to statistically improved reliability between laboratories?
Aims continued We are aiming to find a reliable means of obtaining formant values We are examining reliability, not validity
Data read speech from Cambridge DyViS database male Standard Southern British English aged speakers:Set 1 (20 speakers) Set 2 (20 speakers)
Data 6 monophthongs: / i ː, æ, ɑː, ɔː, ʊ, u ː / 6 repetitions per vowel per speaker elicited in hVd contexts in sentences: It’s a warning we’d better HEED today. It’s only one loaf, but it’s all Peter HAD today. We worked rather HARD today. We built up quite a HOARD today. He insisted on wearing a HOOD today. He hates contracting words, but he said a WHO’D today.
Measurements Analysts from 3 labs – Cambridge, Plymouth, Reading Task: to measure F1, F2, F3 for each vowel token using Praat Set 1 – using individual – but constrained- methods Set 2 – after a meeting at which a single method is agreed
Set 1 Methods Measure the formants at a relatively early point in the vowel
Set 1 Methods Measure the formants at a relatively early point in the vowel Measure formants over no more than 5 glottal pulses
Set 1 Methods Measure the formants at a relatively early point in the vowel Measure formants over no more than 5 glottal pulses Use either: −LPC tracking checked against the spectrogram or
Set 1 Methods Measure the formants at a relatively early point in the vowel Measure formants over no more than 5 glottal pulses Use either: −LPC tracking checked against the spectrogram or −hand/eye measures
Set 2 Method Measure towards the start of the vowel
Set 2 Method Measure towards the start of the vowel Measure in a relatively steady early part of the vowel
Set 2 Method Measure towards the start of the vowel Measure in a relatively steady early part of the vowel Measure around the vowel's maximum intensity
Set 2 Method Measure towards the start of the vowel Measure in a relatively steady early part of the vowel Measure around the vowel's maximum intensity Use a single time slice
Set 2 Method (continued) Use the LPC formant tracker adjusted for best visual fit
Set 2 Method (continued) Use the LPC formant tracker adjusted for best visual fit When values generated by Praat are judged by visual inspection to be incorrect, replace them by correct values from a time-slice immediately preceding or following the slice being measured.
Results: HAD, F1 Lab1 Lab2 Lab3 Set 1
Results: HAD, F1 Lab1 Lab2 Lab3 Set 1
Results: HAD, F1 Lab1 Lab2 Lab3 Set 1 Set 2
Results: HAD, F1 Lab1 Lab2 Lab3 Set 1 Set 2
Statistical Analysis 3 formants 6 vowels 2 datasets = 36 tests Two-way ANOVA - repeated measures on the factor Lab (3) - between-groups factor Speaker (20) If Lab signficant at p < 0.05: Pairwise comparisons with Sidak correction
Results: HAD, F1 Lab1 Lab2 Lab3 Set 1 Set 2
Results: HAD, F1 Lab1 Lab2 Lab3 Lab: significant Set 1 Set 2
Results: HAD, F1 Lab1 Lab2 Lab3 Lab: significant Set 1 Set 2
Results: HAD, F1 Lab1 Lab2 Lab Set 1 Set 2 Lab: significant Lab: significant but pairwise comparisons NS
Results: HAD, F1 Lab1 Lab2 Lab3 Lab: significant Set 1 Set 2 NS Lab: significant but pairwise comparisons NS
Results: HAD, F2
Lab1 Lab2 Lab3 Set 1 Set 2 NS Lab: not significant NS
Results: HAD, F3
Lab1 Lab2 Lab3 Set 1 Set 2 Lab: significant Lab: not significant NS NS
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2 main effect
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2 pairwise comparisons
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2 improvement
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2 improvement
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2
Summary - HAD F1F2F3F1F2F3 LabsigNSsig NS 1 vs 2sigNS 1 vs 3sigNSsigNS 2 vs 3sigNSsigNS Set 1 Set 2 Set 2: good news
Effect of Lab - 6 vowels Set 1 F1F2F3 heedsigNSsig hadsigNSsig hardsig hoardsig who’dsig NS hoodsig
Effect of Lab - 6 vowels Set 1 Set 2 F1F2F3F1F2F3 heedsigNSsig NSsig hadsigNSsig NS hardsig NS sig hoardsig NS who’dsig NSsig hoodsig NS sig NS
Influence of Speaker Interaction Lab x Speaker significant (p < 0.05) for F1-F3 of all 6 vowels for both Set 1 and Set 2 certain speakers lead to measurement differences among labs for example…
F3 of HARD (Set 2) means by speaker
Agreement across labs in most cases, but certain individuals lead to measurement differences among labs
F3 of HARD (Set 2) means by speaker Agreement across labs in most cases, but certain individuals lead to measurement differences among labs
Subject 42 HARD6 F3 = 3325 Hz Subject 42 HARD4 F3 = 2219Hz Subject 42 HARD2 F3 = 2579Hz Difficult cases: subject 42 F3
Difficult cases: subject 43 F3 Subject 43 HARD2 F3? Subject 43 HARD1 F3? Visual inspection Visual inspection vs formant tracker Visual inspection
Subject 43 HARD2 F3? Subject 43 HARD1 F3? Visual inspection Tracker
The effect of intraspeaker variability, possibly voice quality This can affect: −The visibility of formants −The functioning of the LPC tracker for example…
The effect of intraspeaker variability Subject 37: HAD1 F1=??Subject 37: HAD6 F1..had today.
Discussion: Laboratory Effects Do different laboratories produce different formant values?
Discussion: Laboratory Effects Do different laboratories produce different formant values? YES
Discussion: Laboratory Effects Do different laboratories produce different values formant values? YES Does replicating the measurement method reduce these differences?
Discussion: Laboratory Effects Do different laboratories produce different formant values? YES Does replicating the measurement method reduce these differences? YES
Discussion: Laboratory Effects Do different laboratories produce different formant values? YES Does replicating the measurement method reduce these differences? YES Could these be reduced further?
Discussion: Laboratory Effects Do different laboratories produce different formant values? YES Does replicating the measurement method reduce these differences? YES Could these be reduced further? YES
Other sources of variability Settings (e.g. No. of poles; No of Formants in Praat)
Other sources of variability Settings The exact point in the vowel at which the measure is taken
Other sources of variability Settings The exact point in the vowel at which the measure is taken The ‘readability’ of the spectrogram which can be affected by speaker characteristics
Conclusion Developing standard ways of collecting formant values could assist comparisons between experts in case work If records are kept relating to time points, software and settings then the measurement process can be replicated
Acknowledgements IAFPA Research Grant for travel expenses Economic and Social Research Council UK for funding the DyViS Project ‘Dynamic Variability in Speech: A Forensic Phonetic Study of British English’ [RES ] Other members of the DyViS project – Francis Nolan and Toby Hudson