University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speaker Recognition.

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speaker Recognition University of Joensuu, Department of Computer Science PUMS 2003-2004 –seminaari 14.10.2004 Turku Pasi Fränti, Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, Tomi Kinnunen, Ismo Kärkkäinen

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Research Group Pasi Fränti Professor Juhani Saastamoinen Project manager Evgeny Karpov Project researcher Ville Hautamäki Project researcher Tomi Kinnunen Researcher Ismo Kärkkäinen Clustering algorithms PUMS project

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi PUMS & JoY Speaker Recognition PUMS season 2003-2004: –Identification, no verification –Port it in mobile phone –Feature fusion –Real-time http://cs.joensuu.fi/pages/pums

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Application Scenarios Speaker Verification Speaker Identification Speaker Recognition Whose voice is this?Is this Bob’s voice? (Claim) + Verification Imposter! ? Identification

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Identification System Recognition: min. MSE within DB over input speech Signal Processing Speaker Modelling Feature Vectors Speech Audio Add trained speaker profiles Use all profiles in recognition Decision Speaker Profile Database

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi sprofiler Results 2003-2004 Fusion Speech features (HY) ProfMatch srlib Real-time SpeakerProfiler Winsprofiler Epocsprofiler console UI Windows Series60 TCL/TK (HY) console UI common speaker recognition app. interface DB

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Planned Results sprofiler Fusion Speech features (HY) ProfMatch srlib Real-time SpeakerProfiler Winsprofiler Epocsprofiler DB Applications Access control Teleconference Large scale database Mobile phone login? Results 2003-2004 common speaker recognition app. interface Segmentation VAD common speaker recognition app. interface Verification

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi System in Mobile Phone Port to Symbian OS with Series 60 UI platform

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Symbian Phones Series 60 phone features: –16 MB ROM –8 MB RAM –176 x 208 display –32-bit ARM- processor –No floating-point unit!!! Series 80 Series 60 UIQ

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi FFTGEN Multiplication results must fit in 32 bits: truncate multiplication inputs FFTGEN: Truncate to 16/16 bits (“16/16 FFT”) 32-bit multiplication result FFT layer inputFFT Twiddle FactorX X 16-bit integer FFT layer output (part of it) Crop-off for next layer: 16 bits! 16-bit integer 16 used bits16 crop-off bits

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Proposed Information Preserving “22/10 FFT” Approximate DFT operator F with G Increase ||F-G||, preserve more signal information –minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024 –Truncate multiplication inputs to 22/10 bits (signal/op) 22 used bits 10 crop-off bits 32-bit multiplication result X 32-bit integer, 22 bits used16-bit integer, 10 bits used 32-bit integer FFT layer inputFFT Twiddle FactorX FFT layer output (part of it) Crop-off for next layer: 10 bits

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Scale of Error in Proposed FFT 16/1622/10 Log10 of relative error in FFT elements FFTGEN22/10 FFT average-0.775-2.118 standard deviation0.7970.590

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Mobile Phone Results TIMIT, 100 speakersrecog. rate (%)std. dev. (%) FLOAT100.0N/A FFTGEN9.71.6 FIXED95.81.2 MIXED100.0N/A MIXED298.00.6 implementation, signalrecog. rate (%)std. dev. (%) FLOAT, Symbian audio 83.24.38 FLOAT, PC audio100.0N/A FIXED, Symbian audio76.02.83 FIXED, PC audio100.0N/A

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Improving Accuracy by Information Fusion Feature set 1... Feature set 2 Feature set 3 Classifier 1 Classifier 2 Classifier 3 score 1 score 2 score 3 Decision feature vector Score combiner (e.g. 5 MFCCs) (e.g. F0 +  -F0) (e.g. formants F1,F2,F3)

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Information Fusion Results Decision- level fusion Score- level fusion Feature- level fusion BASELINE: Best individual Feature set combination 14.615.816.8 MFCC +  MFCC 15.2 52.0 16.8 14.7 12.621.216.0 All feature sets 29.919.4 FMT +  FMT 18.217.1 ARCSIN +  ARCSIN 19.816.0 LPCC +  LPCC Fusion succesfull Fusion sucks N/A

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speech input stream Silence detection Feature extraction Pre-quantization Speaker database Speaker 1 model Speaker N model List of candidate speakers Active speakersPruned speakers Frame blocking Decision ? END... Fill buffer with new data All frames Non-silent frames Feature vectors Redused set of vectors Matching v v v v v v v Database pruning v v YesNo Vantage-point tree (VPT) indexing of the code vectors 1. Averaging 2. Random sampling 3. Decimation 4. Clustering (LBG) 1. Static pruning 2. Hierarchical pruning 3. Adaptive pruning 4. Confidence-based pruning Reducing # vectors Speed up NN search Reduce # speakers Real-Time Speaker Identification

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results: Baseline System (TIMIT) (Average length of test utterance = 8.9 s) Real-time requirement satisfied 4 x realtime

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results: Pre-Quantization (TIMIT) (Codebook size = 64) Averaging performs worst, clustering best About 2:1 speed-up to full search (no pre-quantization) without degradation in the accuracy 9 x realtime

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results: Pruning Variants (TIMIT) (Codebook size = 64) 11 x realtime Recommended method : adaptive pruning (AP)

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results: PQ, Pruning and PQP (TIMIT) (Codebook size = 64) 33 x realtime Recommended method : Combination of pre- quantization and pruning (PQP)

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results : VQ vs. GMM (TIMIT) 13:1 speed-up without degradation 9:1 to 10:1 speed-up without degradation VQGMM Best time : 0.27 s = 33 x realtime @ error rate 0.32 % Smallest error : 0.00 % @ 0.31 s = 28 x realtime Best time : 0.18 s = 49 x realtime @ error rate 0.16 % Smallest error : 0.16 % @ 0.18 s = 49 x realtime (Average length of test utterance = 8.9 s)

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Results : VQ vs. GMM (NIST-1999) VQGMM 13:1 to 16:1 speedup with minor degradation 23:1 to 34:1 speedup with minor degradation Best time : 0.48 s = 63 x realtime @ error rate 19.22 % Smallest error : 17.34 % @ 11.4 s = 3 x realtime Best time : 0.82 s = 37 x realtime @ error rate 19.36 % Smallest error: 16.90 % @ 37.9 s = 0.8 x realtime (Average length of test utterance = 30.4 s)

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speaker Recognition.

Similar presentations

Presentation on theme: "University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speaker Recognition."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speaker Recognition.

Similar presentations

Presentation on theme: "University of Joensuu Dept. of Computer Science P.O. Box 111 FIN- 80101 Joensuu Tel. +358 13 251 7959 fax +358 13 251 7955 www.cs.joensuu.fi Speaker Recognition."— Presentation transcript:

Similar presentations

About project

Feedback