Spoken Digit Recognition Yi-Pei Chen 5/2/2016
Motivation
Steps Speech Acquisition Signal Preprocessing Feature Extraction Continuous speech waveform Elimination of background noise Framing Windowing MFCC Classifier Output Preprocessing is elimination of back ground noise, framing and windowing. Back ground noise is removed from the data. Continuous speech has been separated into frames. That method is known as framing. Windowing is used to determine the portion of the speech signal. Feature Extraction identify the components of the audio signal that are good for identifying the linguistic content MLP KNN SVM
Mel Frequency Cepstral Coefficients Speech: sounds generated by a human are filtered by the shape of the vocal tract including tongue, teeth etc. Process: Take the Fourier transform of (a windowed excerpt of) a signal. Map the powers of the spectrum onto the mel scale, using triangular overlapping windows. Take the logs of the powers at each of the mel frequencies. Take the discrete cosine transform of the list of mel log powers, as if it were a signal. The MFCCs are the amplitudes of the resulting spectrum. This shape determines what sound comes out. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced.
Dataset: Spoken Arabic Digit 44 males and 44 females’ native Arabic speakers collected by the Laboratory of Automatic and Signals, University of Badji- Mokhtar Annaba, Algeria 8800(10 digits x 10 repetitions x 88 speakers) time series of 13 MFCCs Sampling rate: 11025 Hz, 16 bits Window applied: hamming window
Current State Working on choosing the best classifier
References M.Kalamani, Dr.S.Valarmathy, S.Anitha. “Automatic Speech Recognition using ELM and KNN Classifiers” Vol 3, Issue 4, April 2015, IJIRCCE RICHARD P. LIPPMANN. “Neural Network Classifiers for Speech Recognition” The Lincoln Laboratory Journal, Volume 1, Number 1 (1988), 1-18, MIT Jean Hennebert, Martin Hasler and Hervé Dedieu , “Neural Networks in Speech Recognition” Department of Electrical Engineering, Swiss Federal Institute of Technology Abdul Ahad, Ahsan Fayyaz, Tariq Mehmood. “Speech Recognition using Multilayer Perceptron” p.103- 109, Vol.1, Students Conference, 2002. ISCON '02. Proceedings. IEEE Issam Bazzi. “Using Support Vector Machines for Spoken Digit Recognition” p.48-49, MIT Laboratory for Computer Science Spoken Language Systems Group MFCC tutorial: http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel- frequency-cepstral-coefficients-mfccs/