Speech Recognition Yonglei Tao. Voice-Activated GPS.

Speech Recognition Yonglei Tao

Voice-Activated GPS

Voice User Interface (VUI)  A VUI allows human interaction with computers through a voice/speech platform  Basic components  Speech recognition  Meaning extraction  Response generation  Speech output  Benefits  Loosen some physical constraints  Provide tools for universal design  disability and situational impairments  Intuitive and efficiency

System Architecture

Components  Endpointing  Speech to endpointed utterance  Feature extraction  Endpointed utterance to feature vectors  Recognition  Feature vectors to word string(s)  Natural language understanding  Word string(s) to meaning(s)  Dialog management  Meaning to actions

Typical Recognition Components

Examples  Book, boot  Write, right  Flew, flu, flue  Eight books  Ate books  I scream  Ice cream

Components  Acoustic models  Internal representation of each basic sound  Dictionary  A list of words and pronunciations  Grammar  Defines all possible strings of words the recognizer can handle  Allows to associate a meaning with those strings  Either rule-based or statistical

Recognition  Recognition search  A recognizer searches the recognition model to find the best-matching word string  Confidence measures  A quantitative measure of how confident the recognizer is for the best-matching string  VUI developers can use those measures in several ways  N-Best processing  A recognizer returns several results with a confidence measure for each

Speech Recognition Engines  Microsoft Visual Studio & CMU Sphinx  Grammar  Android  Language model – free form for dictation or web search for short phrases  Google Web Speech API for Web Applications

BNF (Backus-Naur Form)  Notation for context-free grammars  Often used to describe the syntax of programming languages  Also specify the words and patterns of words to be listened for by a speech recognizer  EBNF (Extended Backus-Naur Form)  ABNF (Augmented Backus-Naur Form)  Basis for speech grammar specifications  ABNF for.Net  Regular grammar for Java

Basics ::=meaning "is defined as" | meaning "or" include category name Terminalbasic component ::= a b ca sequence ::= a | b | coptional ::= a | a one or more

An Example  Grammar for a speech recognition calculator  Reference: Grammar creation in C# https://msdn.microsoft.com/en-us/library/hh538495%28v=office.14%29.aspx

Speech to Text in C# using System.Speech.Recognition; using System.Speech.Synthesis; using System.Threading; static ManualResetEvent _completed = null; static void Main(string[] args) { _completed = new ManualResetEvent(false); SpeechRecognitionEngine _recognizer = new SpeechRecognitionEngine(); _recognizer.LoadGrammar(new Grammar(new GrammarBuilder("test")) Name = { "testGrammar" }); _recognizer.LoadGrammar(new Grammar(new GrammarBuilder("exit")) Name = { "exitGrammar" }); _recognizer.SpeechRecognized += _recognizer_SpeechRecognized; // add an event handler _recognizer.SetInputToDefaultAudioDevice(); _recognizer.RecognizeAsync(RecognizeMode.Multiple); … _completed.WaitOne(); // wait until speech recognition is completed _recognizer.Dispose(); // dispose the speech recognition engine }

Speech to Text in C#(Cont.) void _recognizer_SpeechRecognized(object sender, SpeechRecognizedEventArgs e) { if (e.Result.Text == "test") { Console.WriteLine("The test was successful!"); } else if (e.Result.Text == "exit") { _completed.Set(); } void _recognizer_SpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e) { if (e.Result.Alternates.Count == 0) { Console.WriteLine("Speech rejected. No candidate phrases found."); return; } Console.WriteLine("Speech rejected. Did you mean:"); foreach (RecognizedPhrase r in e.Result.Alternates) { Console.WriteLine(" " + r.Text); }

Text to Speech in C# SpeechSynthesizer _synthesizer = new SpeechSynthesizer(); synthesizer.Speak("Now the computer is speaking to you.");... synthesizer.Dispose(); // dispose the SpeechSynthesizer

References  SpeechRecognitionEngine Class  https://msdn.microsoft.com/en- us/library/system.speech.recognition.speechrecognitionengine%28v=vs.1 10%29.aspx?cs-save-lang=1&cs-lang=vb#code-snippet-1 https://msdn.microsoft.com/en- us/library/system.speech.recognition.speechrecognitionengine%28v=vs.1 10%29.aspx?cs-save-lang=1&cs-lang=vb#code-snippet-1  Speech recognition, speech to text, text to speech, and speech synthesis in C#  http://www.codeproject.com/Articles/483347/Speech-recognition- speech-to-text-text-to-speech-a http://www.codeproject.com/Articles/483347/Speech-recognition- speech-to-text-text-to-speech-a

Visual Studio Speech Recognizer

Speech Recognition with Visual Studio  Examples  http://www.phon.ucl.ac.uk/courses/spsci/compmeth/speech/recog nition.html http://www.phon.ucl.ac.uk/courses/spsci/compmeth/speech/recog nition.html  http://blogs.msdn.com/b/devschool/archive/2012/02/06/speech- recognition-using-visual-studio-determining-the-bna.aspx http://blogs.msdn.com/b/devschool/archive/2012/02/06/speech- recognition-using-visual-studio-determining-the-bna.aspx  Grammar Class  http://msdn.microsoft.com/en- us/library/system.speech.recognition.grammar.aspx http://msdn.microsoft.com/en- us/library/system.speech.recognition.grammar.aspx  GrammarBuilder Class  http://msdn.microsoft.com/en- us/library/system.speech.recognition.grammarbuilder.aspx http://msdn.microsoft.com/en- us/library/system.speech.recognition.grammarbuilder.aspx

Speech Recognition for Java  Sphinx 4  A speech recognition engine written entirely in Java  Created by CMU, Sun, Mitsubishi, HP, …  Open source  Compliant with JSpeech Grammar Format  Platform- and vendor-independent  Programmer’s guide http://cmusphinx.sourceforge.net/sphinx4/  An example https://www.assembla.com/code/sonido/subversion/nodes/4/sphin x4/src/apps/edu/cmu/sphinx/demo/helloworld https://www.assembla.com/code/sonido/subversion/nodes/4/sphin x4/src/apps/edu/cmu/sphinx/demo/helloworld

Android Speech Recognition public class MainActivity extends Activity { private static final int VOICE_RECOGNITION = 1; Button speakButton ; TextView spokenWords; @Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); speakButton = (Button) findViewById(R.id.button1); spokenWords = (TextView)findViewById(R.id.textView1); } @Override public boolean onCreateOptionsMenu(Menu menu) { // Inflate the menu; this adds items to the action bar if it is present. getMenuInflater().inflate(R.menu.main, menu); return true; }

public void btnSpeak(View view){ Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH); // Specify free form input intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE_MODEL, RecognizerIntent.LANGUAGE_MODEL_FREE_FORM); intent.putExtra(RecognizerIntent.EXTRA_PROMPT,"Please start speaking"); intent.putExtra(RecognizerIntent.EXTRA_MAX_RESULTS, 1); intent.putExtra(RecognizerIntent.EXTRA_LANGUAGE, Locale.ENGLISH); startActivityForResult(intent, VOICE_RECOGNITION); } @Override protected void onActivityResult(int requestCode, int resultCode, Intent data) { if (requestCode == VOICE_RECOGNITION && resultCode == RESULT_OK) { ArrayList results; results = data.getStringArrayListExtra(RecognizerIntent.EXTRA_RESULTS); // TODO Do something with the recognized voice strings Toast.makeText(this, results.get(0), Toast.LENGTH_SHORT).show(); spokenWords.setText(results.get(0)); } super.onActivityResult(requestCode, resultCode, data); }

Android and Web Speech Recognition  Android Voice Recognition Tutorial  http://www.javacodegeeks.com/2012/08/android-voice- recognition-tutorial.html http://www.javacodegeeks.com/2012/08/android-voice- recognition-tutorial.html  http://code4reference.com/2012/07/tutorial-android-voice- recognition/ http://code4reference.com/2012/07/tutorial-android-voice- recognition/  Google Web Speech Recognition Examples  http://stiltsoft.com/blog/2013/05/google-chrome-how-to-use- the-web-speech-api/ http://stiltsoft.com/blog/2013/05/google-chrome-how-to-use- the-web-speech-api/  http://stackoverflow.com/questions/17635354/developing-a- simple-voice-driven-web-app-using-web-speech-api http://stackoverflow.com/questions/17635354/developing-a- simple-voice-driven-web-app-using-web-speech-api  http://apprentice.craic.com/tutorials/37 http://apprentice.craic.com/tutorials/37

Challenges for VUI Design  People have very little patience for a "machine that does not understand”  VUIs need to respond to input reliably, or they will be rejected by their users  Designing a usable VUI requires interdisciplinary talents of computer science, linguistics and human factors  The closer the VUI matches the user's mental model of the task, the easier it will be to use with little or no training, resulting in both higher efficiency and higher user satisfaction

Natural Language Understanding  Ambiguity  Refers to phrases that look distinct in print but sound similar when spoken, for example,  “Wreck a nice beach”  “Recognize speech”  As the vocabulary and grammar get larger, the potential for ambiguity increases  Short words and phrases are harder to recognize than longer ones

Language Understanding (Cont.)  Deviation  Deviating from what the developer expects  For example, an issue with the question “Is that correct?”  Expecting a simple response like “Yes”, “No”, or “Correct”  Southern speakers would respond with “Yes, ma’am” or “No, ma’am”

Discussion  What you would expect if the user asks to start Microsoft Word?  Please start word  Could you start word  Start word  Please open word  Could you open word  Open word

Language Understanding (Cont.)  Keyword Extraction  Important for applications built with a speech recognizer that returns a string containing the actual words spoke by the user  Leaving the application to interpret their semantic meaning  One might say “Computer, find me some information about the flooding in Detroit recently“  Keywords like “find”, “flooding”, and “Detroit” are crucial for an accurate response from the VUI  Others are filler words

Dialog Management  Multi-modelity  Interaction can occur through different mediums  Need to consider when and which part of the application allows to be multi-model  Grammar  There is a close relationship between what a prompt says and what the caller ends up saying to the system  Especially the words used  Configuration files  You may choose the confidence level at which the recognizer will reject the input rather than return the answer  You may also choose parameters for the endpointer, that is, how long it should listen before timing out

Dialog Management (Cont.)  Error handling  Allow the user to be able to recover after errors and get the dialog with the user back on track  Recognition does not always succeed. When it fails, there are a number of messages the recognizer may return to the application.  Voice recognition accuracy  In-grammar data  Out-grammar data

Error Handling  In-grammar data  Correct Accept  the recognizer returned the correct answer  False Accept  the recognizer returned the wrong answer  False Reject  the recognizer could not find match and gave up  Out-of-grammar data  Correct Reject  the recognizer correctly rejected the input  False Accept  the recognizer returned a value that is wrong because the input is not in the grammar  How to handle each categories?

Error Handing in Android

Speech Recognition Yonglei Tao. Voice-Activated GPS.

Similar presentations

Presentation on theme: "Speech Recognition Yonglei Tao. Voice-Activated GPS."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Speech Recognition Yonglei Tao. Voice-Activated GPS.

Similar presentations

Presentation on theme: "Speech Recognition Yonglei Tao. Voice-Activated GPS."— Presentation transcript:

Similar presentations

About project

Feedback