EEC-693/793 Applied Computer Vision with Depth Cameras Lecture 12 Wenbing Zhao wenbing@ieee.org 1
Outline Speech Recognition How speech recognition works Exploring Microsoft Speech API (SAPI) Creating your own grammar and choices for the speech recognition engine Draw What I Want – building a speech-enabled application
How Speech Recognition Works Kinect microphone array captures the audio stream, and convert the analog audio into digital sound signals The audio sound signals are sent to the speech recognition engine for recognition The acoustic model of the speech recognition engine analyzes the audio and converts the sound into a number of basic speech elements, phonemes Then, the language model is used to analyze the content of the speech and match the word by combining the phonemes with a build-in dictionary Context sensitive
How Speech Recognition Works
Types of Speech Recognition Command mode; you say a command at a time for the speech recognition engine to recognize Sentence mode / diction mode: you say a sentence to perform an operation, e.g., mirror the shape
Microsoft Speech API Kinect SDK comes with the Microsoft Kinect speech recognition language pack
SpeechRecognitionEngine Class The InstalledRecognizers method of the speechRecognitionEngine class returns the lists of installed recognizers in the system, and we can filter them out based on the recognizer ID The SpeechRecognitionEngine class accepts an audio stream from the Kinect sensor and processes it The SpeechRecognitionEngine class raises a sequence of events when the audio stream is detected: SpeechDetected is raised if the audio appears to be a speech SpeechHypothesized then fires multiple times when the words are tentatively detected. Finally SpeechRecognized is raised when the recognizer finds the speech If the speech is detected but does not match properly or is of very low confidence level, the SpeechRecognitionRejected event handler will fire.
Steps for building speech-enabled apps Enable the Kinect audio source Start capturing the audio data stream Identify the speech recognizer Define the grammar for the speech recognizer Start the speech recognizer Attach the speech audio source to the recognizer Register the event handler for speech recognition Handle the different events invoked by the speech recognition engine
Identify the speech recognizer private static RecognizerInfo GetKinectRecognizer() { foreach (RecognizerInfo recognizer in SpeechRecognitionEngine.InstalledRecognizers()) string value; recognizer.AdditionalInfo.TryGetValue("Kinect", out value); if ("True".Equals(value, StringComparison.OrdinalIgnoreCase) && "en-US". Equals(recognizer.Culture.Name, StringComparison.OrdinalIgnoreCase)) return recognizer; } return null; RecognizerInfo recognizerInfo = GetKinectRecognizer();
Define grammar for the speech recognizer Using choice and GrammarBuilder Multiple sets of choices can be added in GrammarBuilder Creating grammar from GrammarBuilder Loading grammar into speech recognizer var colorObjects = new Choices(); colorObjects.Add("red"); colorObjects.Add("green"); colorObjects.Add("blue"); colorObjects.Add("yellow"); colorObjects.Add("gray"); // New Grammar builder for color grammarBuilder.Append(colorObjects); // Another Grammar Builder for object grammarBuilder.Append(new Choices("circle", "square", "triangle", "rectangle")); // Create Grammar from GrammarBuilder var grammar = new Grammar(grammarBuilder);
Define grammar for the speech recognizer Can also build grammar using XML SrgsDocument grammarDoc = new SrgsDocument("mygrammar.xml"); Grammar grammar = new Grammar(grammarDoc); <?xml version="1.0" encoding="UTF-8" ?> <grammar version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/06/grammar" tag-format="semantics/1.0" root="Main"> <rule id="color" scope="public"> <one-of> <item>red</item> <item>green</item> <item>blue</item> </one-of> </rule> </grammar>
Define grammar for the speech recognizer Multiple grammars can be loaded to the recognizer
Define grammar for the speech recognizer private void BuildGrammarforRecognizer(RecognizerInfo recognizerInfo) { var grammarBuilder = new GrammarBuilder { Culture = recognizerInfo.Culture }; // first say Draw grammarBuilder.Append(new Choices("draw")); var colorObjects = new Choices(); colorObjects.Add("red"); colorObjects.Add("green"); colorObjects.Add("blue"); colorObjects.Add("yellow"); colorObjects.Add("gray"); // New Grammar builder for color grammarBuilder.Append(colorObjects); // Another Grammar Builder for object grammarBuilder.Append(new Choices("circle", "square", "triangle", "rectangle")); // Create Grammar from GrammarBuilder var grammar = new Grammar(grammarBuilder); // Creating another Grammar and load var newGrammarBuilder = new GrammarBuilder(); newGrammarBuilder.Append("close the application"); var grammarClose = new Grammar(newGrammarBuilder);
// Start the speech recognizer speechEngine = new SpeechRecognitionEngine(recognizerInfo.Id); speechEngine.LoadGrammar(grammar); // loading grammer into recognizer speechEngine.LoadGrammar(grammarClose); // Attach the speech audio source to the recognizer int SamplesPerSecond = 16000; int bitsPerSample = 16; int channels = 1; int averageBytesPerSecond = 32000; int blockAlign = 2; speechEngine.SetInputToAudioStream( audioStream, new SpeechAudioFormatInfo(EncodingFormat.Pcm, SamplesPerSecond, bitsPerSample, channels, averageBytesPerSecond, blockAlign, null)); // Register the event handler for speech recognition speechEngine.SpeechRecognized += speechRecognized; speechEngine.SpeechHypothesized += speechHypothesized; speechEngine.SpeechRecognitionRejected += speechRecognitionRejected; speechEngine.RecognizeAsync(RecognizeMode.Multiple); } RecognizeAsync(): performs a single, asynchronous recognition operation. The recognizer performs the operation against its loaded and enabled speech recognition grammars
Handle the different events invoked by the speech recognition engine private void speechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e) { } private void speechHypothesized(object sender, SpeechHypothesizedEventArgs e) { wordsTenative.Text = e.Result.Text; } private void speechRecognized(object sender, SpeechRecognizedEventArgs e) wordsRecognized.Text = e.Result.Text; confidenceTxt.Text = e.Result.Confidence.ToString(); float confidenceThreshold = 0.6f; if (e.Result.Confidence > confidenceThreshold) CommandsParser(e);
private void CommandsParser(SpeechRecognizedEventArgs e) { var result = e.Result; Color objectColor; Shape drawObject; System.Collections.ObjectModel.ReadOnlyCollection<RecognizedWordUnit> words = e.Result.Words; if (words[0].Text == "draw") string colorObject = words[1].Text; switch (colorObject) case "red": objectColor = Colors.Red; break; case "green": objectColor = Colors.Green; case "blue": objectColor = Colors.Blue; case "yellow": objectColor = Colors.Yellow; case "gray": objectColor = Colors.Gray; default: return; }
var shapeString = words[2].Text; switch (shapeString) { case "circle": drawObject = new Ellipse(); drawObject.Width = 100; drawObject.Height = 100; break; case "square": drawObject = new Rectangle(); drawObject.Width = 100; drawObject.Height = 100; case "rectangle": drawObject.Width = 100; drawObject.Height = 60; case "triangle": var polygon = new Polygon(); polygon.Points.Add(new Point(0, 30)); polygon.Points.Add(new Point(-60, -30)); polygon.Points.Add(new Point(60, -30)); drawObject = polygon; default: return; }
canvas1.Children.Clear(); drawObject.SetValue(Canvas.LeftProperty, 80.0); drawObject.SetValue(Canvas.TopProperty, 80.0); drawObject.Fill = new SolidColorBrush(objectColor); canvas1.Children.Add(drawObject); } if (words[0].Text == "close" && words[1].Text == "the" && words[2].Text == "application") { this.Close();
Build KinectAudio App Create a new C# WPF project with name DrawShapeFromSpeech Add Microsoft.Kinect reference Add Microsoft.Speech (not System.Speech!!!) Design GUI Adding code
Add Microsoft.Speech assembly
GUI Design Canvas
Adding Code Import namespaces Add member variables: using Microsoft.Kinect; using Microsoft.Speech.Recognition; using Microsoft.Speech.AudioFormat; using System.IO; Import namespaces Add member variables: Register WindowLoaded event handler programmatically KinectSensor sensor; Stream audioStream; SpeechRecognitionEngine speechEngine; public MainWindow() { InitializeComponent(); Loaded += new RoutedEventHandler(WindowLoaded); }
Adding Code: WindowLoaded private void WindowLoaded(object sender, RoutedEventArgs e) { this.sensor = KinectSensor.KinectSensors[0]; this.sensor.Start(); audioStream = this.sensor.AudioSource.Start(); RecognizerInfo recognizerInfo = GetKinectRecognizer(); if (recognizerInfo == null) MessageBox.Show("Could not find Kinect speech recognizer"); return; } BuildGrammarforRecognizer(recognizerInfo); // provided earlier statusBar.Text = "Speech Recognizer is ready";
Adding Code: code provided earlier Add event handler for speechHypothesized Add event handler for speechRecognized CommandsParser() is invoked, which draws the shape spoken You can close the app by saying: close the application Add event handler for speechRecognitionRejected empty
EEC492/693/793 - iPhone Application Development Challenge Tasks For advanced students, improve the app in the following ways: Enable both color image and skeleton data streams Display color image frames (but not the skeleton) Modify the grammar such that you can add a particular shape to a particular joint location E.g., draw a red circle at the right hand Enable drawing by right (or left) hand, using the color and shape you specified in voice command 6/16/2018 EEC492/693/793 - iPhone Application Development