IAFPA 2007 Plymouth, July 22-25, 2007 Developments in automatic speaker recognition at the BKA Michael Jessen, Bundeskriminalamt Franz Broß, Univ. Applied.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
BPS - 5th Ed. Chapter 241 One-Way Analysis of Variance: Comparing Several Means.
Voice Onset Time as a Parameter for Identification of Bilinguals Claire Gurski University of Western Ontario London, ON Canada.
PSY 307 – Statistics for the Behavioral Sciences
Language and Speaker Identification using Gaussian Mixture Model Prepare by Jacky Chau The Chinese University of Hong Kong 18th September, 2002.
An Experimental Evaluation on Reliability Features of N-Version Programming Xia Cai, Michael R. Lyu and Mladen A. Vouk ISSRE’2005.
Current methods in forensic speaker ID: results of the fake case Tina Cambier-Langeveld Dutch Ministry of Justice formerly employed by the Netherlands.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Chapter 7 Probability and Samples: The Distribution of Sample Means
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Topics: Significance Testing of Correlation Coefficients Inference about a population correlation coefficient: –Testing H 0 :  xy = 0 or some specific.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
A Talking Elevator, WS2006 UdS, Speaker Recognition 1.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.
A methodology for the creation of a forensic speaker recognition database to handle mismatched conditions Anil Alexander and Andrzej Drygajlo Swiss Federal.
Understanding Sampling
Look who’s talking? Project 3.1 Yannick Thimister Han van Venrooij Bob Verlinden Project DKE Maastricht University.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Nick Wang, 25 Oct Speaker identification and verification using EigenVoices O. Thyes, R. Kuhn, P. Nguyen, and J.-C. Junqua in ICSLP2000 Presented.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University.
ICCS 2009 IDB Seminar – Nov 24-26, 2010 – IEA DPC, Hamburg, Germany Training Workshop on the ICCS 2009 database Weights and Variance Estimation picture.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Quantitative methods and R – (2) LING115 December 2, 2009.
Anil Alexander 1, Oscar Forth 1, Marianne Jessen 2 and Michael Jessen 3 1 Oxford Wave Research Ltd, Oxford, United Kingdom 2 Stimmenvergleich, Wiesbaden,
BPS - 5th Ed. Chapter 231 Inference for Regression.
Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)
Data statistics and transformation revision Michael J. Watts
Hypothesis Tests l Chapter 7 l 7.1 Developing Null and Alternative
Logic of Hypothesis Testing
Assumptions For testing a claim about the mean of a single population
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Basic Practice of Statistics - 5th Edition
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Statistical Models for Automatic Speech Recognition
Hypothesis Testing: Hypotheses
Statistical Models for Automatic Speech Recognition
Sfax University, Tunisia
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Digital Systems: Hardware Organization and Design
John H.L. Hansen & Taufiq Al Babba Hasan
A maximum likelihood estimation and training on the fly approach
Speaker Identification:
Combination of Feature and Channel Compensation (1/2)
Analyzing F0 and vowel formants of Persian based on long-term features
Within-speaker variability in long-term F0
Presentation transcript:

IAFPA 2007 Plymouth, July 22-25, 2007 Developments in automatic speaker recognition at the BKA Michael Jessen, Bundeskriminalamt Franz Broß, Univ. Applied Sciences Koblenz Stefan Gfroerer, Bundeskriminalamt

2 Some of our motivations for developing automatic speaker recognition Since about ten years general automatic speaker recognition technology has been adapted to meet the demands of forensic applications. Substantial increase in casework involving foreign languages. Automatic speaker recognition is claimed to be language-independent. Using automatic speaker recognition as a check against errors in traditional auditory-acoustic speaker identification (cf. collaborative exercise by Tina Cambier-Langeveld).

3 Stage 1 (2002): Developing a standard automatic speaker recognition system Material: Various labspeech data from U Koblenz of 82 male speakers Labspeech experiment „Pool 2010“ at BKA with 100 male speakers in systematically varied conditions, including Lombard Methods: Standard deviation of LPC-Cepstral coefficients Calculating intraspeaker and interspeaker distances from a total of ca. 12,000 distances in the Koblenz material and 80,000 distances in the BKA material Noise reduction with Wiener filter Speech-pause recognition Results and perspective: Error rates too high for forensic applications (EER 28% for good- quality material!) programming of GMM necessary

4 Intraspeaker distances Interspeaker distances Estimated Probability Logarithmic Euclidian Distance

5 Stage 2 (2003/04): Improving forensic significance Material: Increasing forensic relevance by re-recording the Pool 2010 data via real GSM-transmissions Adding background noise, including natural (e.g. traffic, river) Methods: Using MFCC and GMM Different enhancements (Wiener filtering for aperiodic disturbances, adaptive deconvolution for periodic ones) Calculation of distances between speech samples and GMMs Adding world model compensation Results: Reduction of equal error rates down to 0.01%! Better EER mainly due to  Better enhancement  GMM based on data from several speaking styles, incl. Lombard  world model compensation (from 3% to 0.01%)

6 Stage 3 (2005/06): Learning from the professionals Selecting and processing a collection of authentic case data, including from U Trier (Köster) Applying BATVOX to the case data Getting to know the ASPIC* system from EPFL** (Drygajlo, Meuwly, Alexander etc.) Project in which the BKA/Koblenz system (SPES***) was supplemented by procedures from ASPIC Testing this new system with case data Testing this new system with the NFI-TNO test * ASPIC = Automatic Speaker Individualisation by Computer **EPFL = École Polytechnique Fédérale de Lausanne (Swiss Federal Institute of Technology) ***SPES = Sprechererkennungssystem (speaker identification system)

7 Stage 4 (2006/07): Further technological developments Using PLPCC (Perceptual Linear Prediction Cepstral Components) instead of MFCC, which was inspired by RASTA-PLP used by Drygajlo et al. This change of parameters lead to significant improvements (from 27% to 17 % EER). Optimisations, including -Increasing number of feature vectors with number of GMM modules -Experimenting with different sizes and roll-off values of windows in the frequency domain -Improvement through averaging across different runs with different parameter settings Currently: Implementing Delta-features

8 Fully automatic vs. manually-guided automatic speaker recognition fully automaticmanually-guided Listening to the materialnot necessarynecessary Dividing suspect material for Drygajlo method and selecting reference population automatically by selection of reference population based on overall similarity scores manually by using explicit (mis)match criteria in the subdivision of suspect material and the selection of the reference population e.g. same language, same type telephone line, same type background noise, same speaking style Number of validation tests limited only by computational power limited also by human intervention Kinds of validation testsparticipation in NIST-type tests possible participation in NIST-type tests ruled out practically and/or by instruction

9 Casework experience (conclusion) Much better performance is possible for lab speech data than for real case data. The more varied the suspect material, the better the applicability of the Drygajlo normalisation and the better the results. So, it is an advantage to have control over the recording or compilation of the suspect material and to practice a manually-guided approach. Impostor tests are useful, i.e. when edits from a non-relevant conversation partner are included in the analysis as a control. Whether application is to German or another language so far makes no practical difference.

10 Casework experience (continued) Only in about 1/3 of the cases can automatic speaker recognition be applied at all; otherwise the signals are too poor, too short or technical/behavioural mismatches between questioned and suspect speaker occur. If applicable, the results are usually congruent with the results from the auditory- acoustic method. Discussion: what to do in case of non- congruent results?

11 Questions?

12 Direct method vs. Drygajlo normalisation* Log(LR) estimated probability density evidence distribution of between-speaker similarities distribution of within-speaker similarities * In collaboration with Didier Meuwly, Anil Alexander etc. Drygajlo normalisation: intra-speaker variability is modelled from different recordings of the same speaker in the case (suspect) Direct method: intra-speaker variability is modelled from the within- speaker comparisons in a population of speakers unrelated to the case

13 SPES - direct method SPES with Drygajlo normalisation correct classificationnon liquetincorrect classification Test results with case data General result: compared with the result achieved with lab data (0,01 % EER), EER in real case data rose to 35% - 28% (2005 to early 2006).

14 Analysis of discrepancies between automatic (SPES) vs. auditory-acoustic (BKA) method 1.SPES identical – BKA non-identical: 2x voices similar, but linguistic/phonetic differences 2.SPES identical – BKA non liquet: 2x 3.SPES non-identical – BKA identical: 2x 4.SPES non-identical – BKA non liquet: 2x 5.SPES non liquet – BKA identical: 4x 6.SPES non liquet – BKA non-identical: 1x If SPES gave a discrepant non-identical or non liquet result, it was because of poor technical quality, very short duration or technical/behavioural mismatch between questioned and suspect speaker.

15 NFI/TNO-Test with MFCC with PLPCC

16 evidence within-speaker similarity between-speaker similarity R or SDB C or TDB suspect questioned recording population recording of suspect spontaneous speech recording of suspect read speech questioned recording (spontaneous) speech data corpus (read and spontaneous speech) read speech samples only Drygajlo-normalisation