University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Automatic Speaker Recognition for Series 60 Mobile Devices University of Joensuu, Department of Computer Science Specom’2004, Sep 20, 2004 Juhani Saastamoinen, Evgeny Karpov, Ville Hautamäki, and Pasi Fränti
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Background •Project in National FENIX programme –New Methods and Applications in Speech Technology •7 research institutes •Project partners: NRC, Lingsoft, National Bureau of Investigation, etc. •Joensuu: Speaker Recognition •
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Research Group Pasi Fränti Professor Juhani Saastamoinen Project manager Evgeny Karpov Project researcher Ville Hautamäki Project researcher Tomi Kinnunen Researcher Ismo Kärkkäinen Clustering algorithms PUMS project
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Application Scenarios Speaker Verification Speaker Identification Speaker Recognition Whose voice is this?Is this Bob’s voice? (Claim) + Verification Imposter! ? Identification
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Project Goal Port speaker recognition to Series 60 mobile phone
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Symbian Phones •Series 60 phone features: –16 MB ROM –8 MB RAM –176 x 208 display –ARM-processor –No floating-point unit!!! Series 80 Series 60 UIQ
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Symbian OS •Defined by Symbian consortium •Based on EPOC •Operating system for mobile phones –Real-time system –Long uptime required •Multitasking, multithreading
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Problems of Porting •Usual considerations when porting to phone –GUI event driven program(ming) –Platform specific programming model –Real-time system, exceptions •Application specific porting problems –Number crunching without floating point unit!!! –Signal processing numerically challenging
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Identification System Speaker Recognition: Classify input speech based on existing profiles Signal Processing Feature Extraction Speaker Modelling: Create speaker profile Feature Vectors Speech Audio Add speaker profiles during training Read and use all profiles during recognition Decision Speaker Profile Database
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax MFCC Signal Processing Time windowing DFTAbs Filter bank Log DCT Digital speech signal frame Feature vector Pre- emphasis •pre-emph. coeff. 0.97, Hamm window, 30 triangular mel-filters, base-2 logarithm, output 12 MFCC's
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Fixed-Point Implementation •Numerical analysis needed for fixed- point arithmetic implementation •Truncation and re-scaling to avoid overflows in the converted algorithm •Minimize information loss caused by computation in fixed-point arithmetic –Minimize relative error
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax FFT, Fixed-Point •Frequency spectrum of speech –Biggest source of numerical error –Butterflies have multiplications –Layers repeat truncation errors •Fixed number of bits per element –32, native integer size in many systems •Reference implementation: FFTGEN –
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax FFTGEN (16/16) •Multiplication: 32 x 32 -bit result must fit in 32 bits: truncate input •FFTGEN: Truncate inputs to 16/16 bits 32-bit multiplication result FFT layer inputFFT Twiddle FactorX X 16-bit integer FFT layer output (part of it) Crop-off for next layer: 16 bits! 16-bit integer 16 used bits16 crop-off bits
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Info Preserving FFT (22/10) •Approximate DFT operator F with G •Increase ||F-G||, preserve more signal information –minimize maximum relative error in scaled sine values with respect to scale; 980 good for FFT sizes up to 1024 –Truncate multiplication inputs to 22/10 bits (signal/op) 22 used bits 10 crop-off bits 32-bit multiplication result X 32-bit integer, 22 bits used16-bit integer, 10 bits used 32-bit integer FFT layer inputFFT Twiddle FactorX FFT layer output (part of it) Crop-off for next layer: 10 bits
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax FFT Spectrum, Fixed-Point original TIMIT signal TIMIT signal x 4 16/16 abs values22/10 abs values •x-axis: fixed-point FFT element abs. values •y-axis: correct FFT element abs. values
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Scale of Error in Proposed FFT 16/1622/10 Log10 of relative error in FFT elements 16/1622/10 average standard deviation
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax •Compute complex absolute values using maximum coordinate and coordinate ratio •Suppose |x| > |y| for z = x + i y, then •Interpret the (squared) y/x by t •Approx. square root by a polynomial P(t) •Constant time algorithm (vs. Newton) Magnitude Spectrum, Fixed-Point
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Logarithm, Fixed-Point •Use base 2 instead of base 10 –corresponds to output multiplication •Standard technique: –Return problem to interval [1,2) –Use linear interpolation from values stored in a look-up table –8 bits used for indexing the look-up table values
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Rest of System, Fixed-Point •No improvement needed in VQ/GLA •Should apply similar technique as with FFT to other signal processing –Pre-emphasis, utilize full 32 bits –Time windowing, use less bits in windowing function –FB, use less bits in frequency responses –DCT, use less bits for the cosines
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Effect of Signal Processing •TIMIT data sets, varying number of speakers (N) •For each N repeat (6x, 5x, 2x) train/recognize cycles (eliminate GLA initial solution randomness) •FFTGEN: FFT with 16/16 multiplication •Fixed-point: use proposed 22/10 FFT •Mixed: floating-point DSP, fixed-point GLA/VQ
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Effect of Signal Quality •GSM/PC data: 16 aligned dual recordings •All computations in floating-point arith. •Signal recorded with laptop and PC mic gives average recognition rate 100% •Signal recorded with Nokia 3660 results in average recognition rate 84,9%
University of Joensuu Dept. of Computer Science P.O. Box 111 FIN Joensuu Tel fax Conclusion •Speaker identification was ported to Symbian Series 60 mobile phone •22/10 bit usage in multiplication proposed instead of “standard” 16/16 •Experiments indicate that recognition accuracy improves from 68% to 95%