Gesture recognition using deep learning

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Recording a Game of Go: Hidden Markov Model Improves Weak Classifier Steven Scher

Object Tracking for Retrieval Application in MPEG-2 Lorenzo Favalli, Alessandro Mecocci, Fulvio Moschetti IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

1.3 System Design Parts of a System InputProcessOutput Input Devices Central Processing Unit Output Devices The main parts of any computer system.

CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.

Presentation by: K.G.P.Srikanth. CONTENTS  Introduction  Components  Working  Applications.

 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.

Software Development Software Testing. Testing Definitions There are many tests going under various names. The following is a general list to get a feel.

Project Administration Chapter-4. Project Administration Project Administration is the process which involves different kinds of activities of managing.

Artificial Intelligence Techniques Multilayer Perceptrons.

BioMapper Bioinformatics Workflow Tool Cognitive Walkthrough 1 st November 2010.

CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.

Human pose recognition from depth image MS Research Cambridge.

Stable Multi-Target Tracking in Real-Time Surveillance Video

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Team Members Ming-Chun Chang Lungisa Matshoba Steven Preston Supervisors Dr James Gain Dr Patrick Marais.

CSSE463: Image Recognition Day 11 Due: Due: Written assignment 1 tomorrow, 4:00 pm Written assignment 1 tomorrow, 4:00 pm Start thinking about term project.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Over the recent years, computer vision has started to play a significant role in the Human Computer Interaction (HCI). With efficient object tracking.

Artificial Neural Networks This is lecture 15 of the module `Biologically Inspired Computing’ An introduction to Artificial Neural Networks.

Student Gesture Recognition System in Classroom 2.0 Chiung-Yao Fang, Min-Han Kuo, Greg-C Lee, and Sei-Wang Chen Department of Computer Science and Information.

Recent developments in object detection

Neural networks and support vector machines

CSSE463: Image Recognition Day 14

Unsupervised Learning of Video Representations using LSTMs

CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.

The Relationship between Deep Learning and Brain Function

Histograms CSE 6363 – Machine Learning Vassilis Athitsos

Energy models and Deep Belief Networks

Architecture Concept Documents

Presenter: Ibrahim A. Zedan

Session 7: Face Detection (cont.)

CSSE463: Image Recognition Day 11

Classification with Perceptrons Reading:

Computer Vision Lecture 12: Image Segmentation II

CS 698 | Current Topics in Data Science

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Massachusetts Institute of Technology

Features & Decision regions

Using Tensorflow to Detect Objects in an Image

Face Detection Viola-Jones Part 2

CSSE463: Image Recognition Day 11

Identifying Confusion from Eye-Tracking Data

Mentor: Salman Khokhar

Approaching an ML Problem

ML – Lecture 3B Deep NN.

Ainsley Smith Tel: Ex

Ensemble learning.

Information Retrieval

Evaluating Classifiers

Word embeddings (continued)

CSSE463: Image Recognition Day 11

CSSE463: Image Recognition Day 11

Programming with Shared Memory Specifying parallelism

Automatic Handwriting Generation

David Kauchak CS158 – Spring 2019

Introduction to Neural Networks

Object Detection Implementations

COSC 4368 Intro Supervised Learning Organization

Auditory Morphing Weyni Clacken

LHC beam mode classification

Shengcong Chen, Changxing Ding, Minfeng Liu 2018

Presentation transcript:

Gesture recognition using deep learning By Chinmaya R Naguri Under the guidance of Assoc. Prof. Razvan C. Bunescu

Introduction Evolution of User Interaction with Computers: Punch cards – Until 1950s Electric Computer Keyboard – 1948 Mouse – 1960s (Douglas Engelbart) popularity 1980s Touch Screens – wide adoption/popularity 2000s Gesture & Voice Recognition – Recently (2010s) Our purpose is to give users and developers a platform to develop custom interaction gestures.

Introduction Reasons for slow growth in Gesture Input Lack of compatible Applications / Interfaces Reduced Usability Low recognition accuracy It is changing now. Hardware – e.g. Sensors, GPUs, Media platforms Software – Algorithmic models e.g. Deep Learning We have been listening to the term gesture recognition for a while now. The reasons for the slow growth are the lack of applications created to use, reduced usability (not easy enough to use), recognition is not accurate. But this is all changing now with developments in hardware sensors, GPUs etc. also due to developments in software.

Introduction Leap Motion Controller We are using Leap Motion Controller for this project. Here is the controller’s view of the hand. Center of the controller is the origin. Finger positions & speed can be seen in the right corner of the image. FPS.

Architecture

Architecture Leap Motion Frames .…. .…. Gesture Detection .…. Post Processing This is the over view of our system. I’ll explain each module individually in detail in the following slides. First let us see what each module is designed for. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Architecture Leap Motion Frames .…. .…. Gesture Detection .…. Post Processing Leap Motion gives data frames (similar to video frames), but these frames contain information about the fingers, such as positions and velocities. These frames are passed on to GD. Its job is to find out if a given frame is inside a gesture region or outside. All the blue frames here are outside a gesture region and yellow ones are inside a gesture region. Next these frames are passed onto post processing where it corrects few of the frame labels to help GC. PP is rule based on task specific. Next the gesture region found is sent to GC to classify it as a gesture. So, Gesture Recognition as a whole is divided into two modules GD (detect if any gesture is present at all) & GC (to find the type of gesture). Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Data Collection Leap Motion Frames .…. .…. Gesture Detection .…. Post Processing To train our machine learning models, we need example data. Let us see how we collect and manually label them. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Data collection Frames in a recording Index finger position (x, y, z) Data of a single Frame Fingers This is data from one recording. A recording consists of series of frames containing multiple gestures. Here is the data for a single frame. Out of all this data we will be only using position and velocities for all the fingers. Thumb velocity (xv, yv, zv)

Data collection Video recording for the gesture While collecting data or recording gestures we record a video to help in annotation. Let us see a sample video now to get a fair idea on how we are doing this. In the beginning of the video look for the keypress `Q`. At that point we begin the leap motion data frames recoding. After that we perform some gestures to record. Then stop. This is how we do a gesture and we will use this video in annotation based on time. We say at 10 sec 340 millisecond circle gesture started and ended at 11 sec 300 ms.

Data collection Data Annotation Data Frames & Time Stamps …. 120 240 Let us see how data annotation is done. What we are trying to do is to label the frames as inside and outside of a gesture region. Say from frame 200 to 280, swipe gesture has happened and so on. This is the reason we need a video to make our job easier. Here, we already have data frames with their timestamps associated (we get it from the controller). Then we note the begin & end times of a gesture into our program where it internally maps to the frames based on the timestamps. Let us look at this picture: Here first we report the start time of frames data collection; next we report a transition gesture, later we report a circle gesture. We do this to all the recorded examples, thus finishing data annotation. …. 120 240 360 480 …. millisecs

Gesture Detection Leap Motion Frames .…. .…. Gesture Detection .…. Post Processing Now, our next module is GD. Just to recall, GD helps to label the frames as outside or inside. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Outside and Inside probabilities Gesture Detection Architecture .…. Outside and Inside probabilities 0.7 | 0.3 0.6 | 0.4 0.2 | 0.8 0.8 | 0.2 Softmax Softmax Softmax Softmax Let us the basic architecture for GD. A frame is given and it is passed on to the LSTM cell. LSTM cell takes 3 inputs. We update the memory of the cell by looking at the current frame and output is passed on to softmax, to classify if the frame is outside or inside of a gesture region. Similarly we do this for the next frame also and so on till the end of the frames. LSTM cell LSTM cell LSTM cell .…. LSTM cell Frames F1 F2 F3 Fn

Gesture Detection Training For each epoch: For each training example: For each frame: pass it to LSTM cell and its output to Softmax Softmax outputs the outside/inside probabilities Use Adadelta optimization to update the parameters in order to improve the cost function (-log ptrue_label) Reset the cell state/memory to zero During training of GD: For each example and for each frame in each example we do as discussed. (read the lines). Reset the cell state. It is important. B’se the next example is independent of the current example. Its starting frames are not in continuation with the ending frames of the current example. So, keeping the memory will confuse the system than helping it. We do this for all the training examples. And then we iterate for another epoch until an early stop or all the epochs are finished.

Post processing Leap Motion Frames .…. .…. Gesture Detection .…. Next module is post processing. We alter few labels for the frames to help GC. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Post processing .…. .…. Post Processing Before After Frames vs Speed graph If two regions are close by (say by 10 frames) we join them to form a single large gesture region. Boundaries are extended by including few frames (say 10) to include more information about begin & ending of a gesture region. Ignore a region if it is less than a threshold (say 10). This graph is Frames vs speed of index finger. The bold black line is the true gesture region from manual labeling. We see some gaps in between. They are filled after PP. Boundaries are missing for the output of GD. They are extended after PP This helps the next module by giving ample information to classify correctly.

Gesture Classification Leap Motion Frames .…. .…. Gesture Detection .…. Post Processing Now, let us look at GC. Here, our goal is to classify the given gesture region into one of the interaction gestures. Gesture Classification Circle / Swipe / Wrong / Correct / Pull / Push

Gesture Classification Circle Architecture Gesture probabilities Transition – Circle – Swipe Correct – Wrong – Push – Pull 0.01 | 0.95 | 0.00 | 0.01 | 0.02 | 0.005 | 0.005 Softmax Mean Pooling Let us see the basic architecture of GC. A sequence of frames, gesture region is given to us. We first, send each frame to an LSTM cell, just as before it takes 3 inputs. Memory from prev cell, output of prev cell and current input frame. Inside LSTM cell memory gets updated based on the current inputs. Next, the cell outputs is passed on to mean pool (mean in each dimension). Then the output is passed on to softmax to classify the given region into one of the gestures. LSTM cell LSTM cell LSTM cell .…. LSTM cell .…. Frames F1 F2 F3 Fn

Gesture Classification Training for each epoch: for each training example: for each gesture region: send frames to LSTM cells and its output to meal pooling mean pool output to Softmax Classifier (outputs the gesture probabilities) Use Adadelta optimization to update the parameters in order to improve the cost function (-log ptrue_label) Let us see how we train GC. For each gesture region in a training example, (read lines).

Evaluation

Evaluation Data folds Data Validation Data Training Data Test Data We split our data into multiple datasets using data folds. This ensures that all data is used for testing and also test data is not seen during training. Test Data Test Data Test Data Validation Data Training Data Test Data

Evaluation – Gesture Detection Results for GD. We have results before PP on the left and after PP on the right. We used different evaluation metrics to see how well our model is performing. Precision .. Read Recall .. Read .. What we can observe here is that recall is improved and it is desired. B’se we want as to label as many gesture regions as possible, so that not to miss any region. Model can recover in GC by classifying a non gesture region as transition gesture. Precision is decreased, that is alright, because we are labeling few regions as containing gestures even though they does not have it. Just as before, our model can recover in GC. Percentage of frame labels computed that are correct Percentage of inside frames extracted by the system that are truly inside frames of a gesture Percentage of true inside frames that system also identified as inside frames of a gesture Harmonic mean of Precision and Recall

Evaluation – Gesture Classification Let us see results for GC. Let me explain the evaluation metrics first: (read) Percentage of all the gestures of that type that are classified by the system as belonging to that type Percentage of gestures that are truly of that type out of all the gestures that the system classified as belonging to that type Harmonic mean of Precision and Recall Percentage of gestures that the system correctly classified

Evaluation – Gesture Classification Confusion Matrix

Evaluation – Gesture Recognition A gesture that is extracted (GD) and classified (GC) by the gesture recognition system is considered correct if the following two conditions are satisfied: At least 50% of the frames from the system gesture are contained in a true gesture The label assigned by the system is the same as the label of the true gesture.

Future work & Conclusion Model is tuned to only one person because of data from a single user. Using an auto-encoder before GC will be helpful if the data is from multiple users Labeling the data takes a lot of time. Improvement needed. Few transition gestures are being labelled as Interaction gestures. Updating transition gestures dataset with these will help. A user feedback model could be nice (an active learning setting). A combined model, like voice & gesture recognition will be more interactive. Our model’s F-Score is nearly 97% - considered high enough for real-time use.

Questions ?