Download presentation
Presentation is loading. Please wait.
Published byAja Corum Modified over 10 years ago
1
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speaker Recognition University of Joensuu, Department of Computer Science PUMS 2003-2004 –seminaari 14.10.2004 Turku Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen
2
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Research Group Pasi Fränti Professor Juhani Saastamoinen Project manager Evgeny Karpov Project researcher Ville Hautamäki Project researcher Tomi Kinnunen Researcher Ismo Kärkkäinen Clustering algorithms PUMS project
3
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi PUMS & JoY Speaker Recognition PUMS season 2003-2004: –Identification, no verification –Port it in mobile phone –Feature fusion –Real-time http://cs.joensuu.fi/pages/pums
4
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Application Scenarios Speaker Verification Speaker Identification Speaker Recognition Whose voice is this?Is this Bob’s voice? (Claim) + Verification Imposter! ? Identification
5
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Identification System Recognition: min. MSE within DB over input speech Signal Processing Speaker Modelling Feature Vectors Speech Audio Add trained speaker profiles Use all profiles in recognition Decision Speaker Profile Database
6
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi sprofiler Results 2003-2004 Fusion Speech features (HY) ProfMatch srlib Real-time SpeakerProfiler Winsprofiler Epocsprofiler console UI Windows Series60 TCL/TK (HY) console UI common speaker recognition app. interface DB
7
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Planned Results sprofiler Fusion Speech features (HY) ProfMatch srlib Real-time SpeakerProfiler Winsprofiler Epocsprofiler DB Applications Access control Teleconference Large scale database Mobile phone login? Results 2003-2004 common speaker recognition app. interface Segmentation VAD common speaker recognition app. interface Verification
8
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi System in Mobile Phone Port to Symbian OS with Series 60 UI platform
9
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Symbian Phones Series 60 phone features: –16 MB ROM –8 MB RAM –176 x 208 display –32-bit ARM- processor –No floating-point unit!!! Series 80 Series 60 UIQ
10
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi FFTGEN Multiplication results must fit in 32 bits: truncate multiplication inputs FFTGEN: Truncate to 16/16 bits (“16/16 FFT”) 32-bit multiplication result FFT layer inputFFT Twiddle FactorX X 16-bit integer FFT layer output (part of it) Crop-off for next layer: 16 bits! 16-bit integer 16 used bits16 crop-off bits
11
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Proposed Information Preserving “22/10 FFT” Approximate DFT operator F with G Increase ||F-G||, preserve more signal information –minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024 –Truncate multiplication inputs to 22/10 bits (signal/op) 22 used bits 10 crop-off bits 32-bit multiplication result X 32-bit integer, 22 bits used16-bit integer, 10 bits used 32-bit integer FFT layer inputFFT Twiddle FactorX FFT layer output (part of it) Crop-off for next layer: 10 bits
12
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Scale of Error in Proposed FFT 16/1622/10 Log10 of relative error in FFT elements FFTGEN22/10 FFT average-0.775-2.118 standard deviation0.7970.590
13
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Mobile Phone Results TIMIT, 100 speakersrecog. rate (%)std. dev. (%) FLOAT100.0N/A FFTGEN9.71.6 FIXED95.81.2 MIXED100.0N/A MIXED298.00.6 implementation, signalrecog. rate (%)std. dev. (%) FLOAT, Symbian audio 83.24.38 FLOAT, PC audio100.0N/A FIXED, Symbian audio76.02.83 FIXED, PC audio100.0N/A
14
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Improving Accuracy by Information Fusion Feature set 1... Feature set 2 Feature set 3 Classifier 1 Classifier 2 Classifier 3 score 1 score 2 score 3 Decision feature vector Score combiner (e.g. 5 MFCCs) (e.g. F0 + -F0) (e.g. formants F1,F2,F3)
15
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Information Fusion Results Decision- level fusion Score- level fusion Feature- level fusion BASELINE: Best individual Feature set combination 14.615.816.8 MFCC + MFCC 15.2 52.0 16.8 14.7 12.621.216.0 All feature sets 29.919.4 FMT + FMT 18.217.1 ARCSIN + ARCSIN 19.816.0 LPCC + LPCC Fusion succesfull Fusion sucks N/A
16
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speech input stream Silence detection Feature extraction Pre-quantization Speaker database Speaker 1 model Speaker N model List of candidate speakers Active speakersPruned speakers Frame blocking Decision ? END... Fill buffer with new data All frames Non-silent frames Feature vectors Redused set of vectors Matching v v v v v v v Database pruning v v YesNo Vantage-point tree (VPT) indexing of the code vectors 1. Averaging 2. Random sampling 3. Decimation 4. Clustering (LBG) 1. Static pruning 2. Hierarchical pruning 3. Adaptive pruning 4. Confidence-based pruning Reducing # vectors Speed up NN search Reduce # speakers Real-Time Speaker Identification
17
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results: Baseline System (TIMIT) (Average length of test utterance = 8.9 s) Real-time requirement satisfied 4 x realtime
18
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results: Pre-Quantization (TIMIT) (Codebook size = 64) Averaging performs worst, clustering best About 2:1 speed-up to full search (no pre-quantization) without degradation in the accuracy 9 x realtime
19
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results: Pruning Variants (TIMIT) (Codebook size = 64) 11 x realtime Recommended method : adaptive pruning (AP)
20
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results: PQ, Pruning and PQP (TIMIT) (Codebook size = 64) 33 x realtime Recommended method : Combination of pre- quantization and pruning (PQP)
21
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results : VQ vs. GMM (TIMIT) 13:1 speed-up without degradation 9:1 to 10:1 speed-up without degradation VQGMM Best time : 0.27 s = 33 x realtime @ error rate 0.32 % Smallest error : 0.00 % @ 0.31 s = 28 x realtime Best time : 0.18 s = 49 x realtime @ error rate 0.16 % Smallest error : 0.16 % @ 0.18 s = 49 x realtime (Average length of test utterance = 8.9 s)
22
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results : VQ vs. GMM (NIST-1999) VQGMM 13:1 to 16:1 speedup with minor degradation 23:1 to 34:1 speedup with minor degradation Best time : 0.48 s = 63 x realtime @ error rate 19.22 % Smallest error : 17.34 % @ 11.4 s = 3 x realtime Best time : 0.82 s = 37 x realtime @ error rate 19.36 % Smallest error: 16.90 % @ 37.9 s = 0.8 x realtime (Average length of test utterance = 30.4 s)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.