Gesture recognition using deep learning

Gesture recognition using deep learning
By Chinmaya R Naguri Under the guidance of Assoc. Prof. Razvan C. Bunescu

Introduction Evolution of User Interaction with Computers:
Punch cards – Until 1950s Electric Computer Keyboard – 1948 Mouse – 1960s (Douglas Engelbart) popularity 1980s Touch Screens – wide adoption/popularity 2000s Gesture & Voice Recognition – Recently (2010s) Our purpose is to give users and developers a platform to develop custom interaction gestures.

Introduction Reasons for slow growth in Gesture Input
Lack of compatible Applications / Interfaces Reduced Usability Low recognition accuracy It is changing now. Hardware – e.g. Sensors, GPUs, Media platforms Software – Algorithmic models e.g. Deep Learning We have been listening to the term gesture recognition for a while now. The reasons for the slow growth are the lack of applications created to use, reduced usability (not easy enough to use), recognition is not accurate. But this is all changing now with developments in hardware sensors, GPUs etc. also due to developments in software.

Introduction Leap Motion Controller
We are using Leap Motion Controller for this project. Here is the controller’s view of the hand. Center of the controller is the origin. Finger positions & speed can be seen in the right corner of the image. FPS.

Architecture

Architecture Leap Motion Frames .…. .…. Gesture Detection .….
Post Processing This is the over view of our system. I’ll explain each module individually in detail in the following slides. First let us see what each module is designed for. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Architecture Leap Motion Frames .…. .…. Gesture Detection .….
Post Processing Leap Motion gives data frames (similar to video frames), but these frames contain information about the fingers, such as positions and velocities. These frames are passed on to GD. Its job is to find out if a given frame is inside a gesture region or outside. All the blue frames here are outside a gesture region and yellow ones are inside a gesture region. Next these frames are passed onto post processing where it corrects few of the frame labels to help GC. PP is rule based on task specific. Next the gesture region found is sent to GC to classify it as a gesture. So, Gesture Recognition as a whole is divided into two modules GD (detect if any gesture is present at all) & GC (to find the type of gesture). Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Data Collection Leap Motion Frames .…. .…. Gesture Detection .….
Post Processing To train our machine learning models, we need example data. Let us see how we collect and manually label them. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Data collection Frames in a recording Index finger position (x, y, z)
Data of a single Frame Fingers This is data from one recording. A recording consists of series of frames containing multiple gestures. Here is the data for a single frame. Out of all this data we will be only using position and velocities for all the fingers. Thumb velocity (xv, yv, zv)

Data collection Video recording for the gesture
While collecting data or recording gestures we record a video to help in annotation. Let us see a sample video now to get a fair idea on how we are doing this. In the beginning of the video look for the keypress `Q`. At that point we begin the leap motion data frames recoding. After that we perform some gestures to record. Then stop. This is how we do a gesture and we will use this video in annotation based on time. We say at 10 sec 340 millisecond circle gesture started and ended at 11 sec 300 ms.

Data collection Data Annotation Data Frames & Time Stamps …. 120 240
Let us see how data annotation is done. What we are trying to do is to label the frames as inside and outside of a gesture region. Say from frame 200 to 280, swipe gesture has happened and so on. This is the reason we need a video to make our job easier. Here, we already have data frames with their timestamps associated (we get it from the controller). Then we note the begin & end times of a gesture into our program where it internally maps to the frames based on the timestamps. Let us look at this picture: Here first we report the start time of frames data collection; next we report a transition gesture, later we report a circle gesture. We do this to all the recorded examples, thus finishing data annotation. …. 120 240 360 480 …. millisecs

Gesture Detection Leap Motion Frames .…. .…. Gesture Detection .….
Post Processing Now, our next module is GD. Just to recall, GD helps to label the frames as outside or inside. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Outside and Inside probabilities
Gesture Detection Architecture .…. Outside and Inside probabilities 0.7 | 0.3 0.6 | 0.4 0.2 | 0.8 0.8 | 0.2 Softmax Softmax Softmax Softmax Let us the basic architecture for GD. A frame is given and it is passed on to the LSTM cell. LSTM cell takes 3 inputs. We update the memory of the cell by looking at the current frame and output is passed on to softmax, to classify if the frame is outside or inside of a gesture region. Similarly we do this for the next frame also and so on till the end of the frames. LSTM cell LSTM cell LSTM cell .…. LSTM cell Frames F1 F2 F3 Fn

Gesture Detection Training For each epoch: For each training example:
For each frame: pass it to LSTM cell and its output to Softmax Softmax outputs the outside/inside probabilities Use Adadelta optimization to update the parameters in order to improve the cost function (-log ptrue_label) Reset the cell state/memory to zero During training of GD: For each example and for each frame in each example we do as discussed. (read the lines). Reset the cell state. It is important. B’se the next example is independent of the current example. Its starting frames are not in continuation with the ending frames of the current example. So, keeping the memory will confuse the system than helping it. We do this for all the training examples. And then we iterate for another epoch until an early stop or all the epochs are finished.

Post processing Leap Motion Frames .…. .…. Gesture Detection .….
Next module is post processing. We alter few labels for the frames to help GC. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Post processing .…. .…. Post Processing Before After Frames vs Speed
graph If two regions are close by (say by 10 frames) we join them to form a single large gesture region. Boundaries are extended by including few frames (say 10) to include more information about begin & ending of a gesture region. Ignore a region if it is less than a threshold (say 10). This graph is Frames vs speed of index finger. The bold black line is the true gesture region from manual labeling. We see some gaps in between. They are filled after PP. Boundaries are missing for the output of GD. They are extended after PP This helps the next module by giving ample information to classify correctly.

Gesture Classification
Leap Motion Frames .…. .…. Gesture Detection .…. Post Processing Now, let us look at GC. Here, our goal is to classify the given gesture region into one of the interaction gestures. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Circle Architecture Gesture probabilities Transition – Circle – Swipe Correct – Wrong – Push – Pull 0.01 | 0.95 | 0.00 | 0.01 | 0.02 | | 0.005 Softmax Mean Pooling Let us see the basic architecture of GC. A sequence of frames, gesture region is given to us. We first, send each frame to an LSTM cell, just as before it takes 3 inputs. Memory from prev cell, output of prev cell and current input frame. Inside LSTM cell memory gets updated based on the current inputs. Next, the cell outputs is passed on to mean pool (mean in each dimension). Then the output is passed on to softmax to classify the given region into one of the gestures. LSTM cell LSTM cell LSTM cell .…. LSTM cell .…. Frames F1 F2 F3 Fn

Training for each epoch: for each training example: for each gesture region: send frames to LSTM cells and its output to meal pooling mean pool output to Softmax Classifier (outputs the gesture probabilities) Use Adadelta optimization to update the parameters in order to improve the cost function (-log ptrue_label) Let us see how we train GC. For each gesture region in a training example, (read lines).

Evaluation

Evaluation Data folds Data Validation Data Training Data Test Data
We split our data into multiple datasets using data folds. This ensures that all data is used for testing and also test data is not seen during training. Test Data Test Data Test Data Validation Data Training Data Test Data

Evaluation – Gesture Detection
Results for GD. We have results before PP on the left and after PP on the right. We used different evaluation metrics to see how well our model is performing. Precision .. Read Recall .. Read .. What we can observe here is that recall is improved and it is desired. B’se we want as to label as many gesture regions as possible, so that not to miss any region. Model can recover in GC by classifying a non gesture region as transition gesture. Precision is decreased, that is alright, because we are labeling few regions as containing gestures even though they does not have it. Just as before, our model can recover in GC. Percentage of frame labels computed that are correct Percentage of inside frames extracted by the system that are truly inside frames of a gesture Percentage of true inside frames that system also identified as inside frames of a gesture Harmonic mean of Precision and Recall

Evaluation – Gesture Classification
Let us see results for GC. Let me explain the evaluation metrics first: (read) Percentage of all the gestures of that type that are classified by the system as belonging to that type Percentage of gestures that are truly of that type out of all the gestures that the system classified as belonging to that type Harmonic mean of Precision and Recall Percentage of gestures that the system correctly classified

Evaluation – Gesture Classification
Confusion Matrix

Evaluation – Gesture Recognition
A gesture that is extracted (GD) and classified (GC) by the gesture recognition system is considered correct if the following two conditions are satisfied: At least 50% of the frames from the system gesture are contained in a true gesture The label assigned by the system is the same as the label of the true gesture.

Future work & Conclusion
Model is tuned to only one person because of data from a single user. Using an auto-encoder before GC will be helpful if the data is from multiple users Labeling the data takes a lot of time. Improvement needed. Few transition gestures are being labelled as Interaction gestures. Updating transition gestures dataset with these will help. A user feedback model could be nice (an active learning setting). A combined model, like voice & gesture recognition will be more interactive. Our model’s F-Score is nearly 97% - considered high enough for real-time use.

Questions ?

Gesture recognition using deep learning

Similar presentations

Presentation on theme: "Gesture recognition using deep learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gesture recognition using deep learning

Similar presentations

Presentation on theme: "Gesture recognition using deep learning"— Presentation transcript:

Similar presentations

About project

Feedback