Speech Recognition Created By : Kanjariya Hardik G.
Introduction Speech recognition technology has recently reached a higher level of performance and robustness, allowing it to communicate to another user by talking. Speech Recognization is process of decoding acoustic speech signal captured by microphone or telephone,to a set of words. And with the help of these it will recognize whole speech is recognized word by word.
Types of SR There are two main types of speaker models: speaker independent and speaker dependent. Speaker independent models recognize the speech patterns of a large group of people. Speaker dependent models recognize speech patterns from only one person. Both models use mathematical and statistical formulas to yield the best work match for speech. A third variation of speaker models is now emerging, called speaker adaptive. Speaker adaptive systems usually begin with a speaker independent model and adjust these models more closely to each individual during a brief training period.
Speech produces a sound pressure wave which forms an acoustic signal. The microphone – receives the acoustic signal and converts it to an analogue signal. To store the analogue signal, it must be converted to a digital signal. A speech recognizer tries to transform a digitally encoded acoustic signal in a natural language into text in that language. How does it works?..
Speech Waveform/Spectrogram The spectrogram is an alternative way to characterize speech. The louder the sound the greater the amplitude on the y-axis. s p eech l ab Hz s
Speech Recognition Process Flow
Audio input Grammar Acoustic Model Recognized text The major components
It is important to understand that this audio stream is rarely pristine It contains not only the speech data (what was said) but also background noise. This noise can interfere with the recognition process, and the speech engine must handle (and possibly even adapt to) the environment within which the audio is spoken. Audio I/O
Once the speech data is in the proper format, the engine searches for the best match. It does this by taking into consideration the words and phrases it knows about (the active grammars), along with its knowledge of the environment in which it is operating. The knowledge of the environment is provided in the form of an acoustic model. Once it identifies the most likely match for what was said, it returns what it recognized as a text string. Acoustic+Grammer
About SR Engine SR requires a software application "engine" with logic built in to decipher and act on the spoken word. Sound Card –Converts acoustic signal to digital signal. Function of SR Engine- –SR Engine converts these digital signal to phonemes to word.
Different SR engine CMU Sphinx Microsoft SAPI IBM ViaVoice
Decoding process.
Recognition Process Flow Summary Step 1:User Input The system catches user’s voice in the form of analog acoustic signal. Step 2:Digitization Digitize the analog acoustic signal. Step 3:Phonetic Breakdown Breaking signals into phonemes.
Recognition Process Flow Summary Step 4:Statistical Modeling Mapping phonemes to their phonetic representation using statistics model. Step 5:Matching According to grammar, phonetic representation and Dictionary, the system returns an n-best list (I.e.:a word plus a confidence score) Grammar-the union words or phrases to constraint the range of input or output in the voice application. Dictionary-the mapping table of phonetic representation and word(EX:thu,thee the)
REPRESENTATION OF SOFTWARE 15
Challenges and Difficulties of SR Speech Recognition is still a very cumbersome problem. Following are the problem…. Speaker Variability Two speakers or even the same speaker will pronounce the same word differently Channel Variability The quality and position of microphone and background environment will affect the output
Current Software Options for PC Dragon Systems – Naturally Speaking Philips – FreeSpeech IBM – ViaVoice Lernout & Hauspie – Voice Xpress