UltraGesture: Fine-Grained Gesture Sensing and Recognition

UltraGesture: Fine-Grained Gesture Sensing and Recognition
Kang Ling†, Haipeng Dai†, Yuntang Liu†, and Alex X. Liu†‡ Nanjing University†， Michigan State University‡ Good morning, everyone. My name is Kang Ling, from Nanjing University. It’s my honor to be here to present our recent work “UltraGesture: Fine-Grained Gesture Sensing and Recognition”. The coauthors Haipeng Dai1, Yuntang Liu1 and Alex X. Liu12 are from Nanjing university and 2Michigan State University. In this paper, we present UltraGesture, a Channel Impulse Response (CIR) based ultrasonic gesture perception and recognition system. SECON'18 June 12th, 2018

Outline Motivation Doppler vs. FMCW vs. CIR Solution Evaluation
This slide shows the outline of my presentation. First of all, I will give a brief introduction about the background and related work. Then, I will show our basic idea and the backwards of existing Doppler and FMCW based method. At last, I will show you the details of the solution, as well as the evaluation results. Evaluation

Outline Motivation Doppler vs. FMCW vs. CIR Solution Evaluation

Motivation (a) AR/VR require new UI technology
(b) Smartwatch with tiny input screen User gesture recognition is an important technology in Human Computer Interaction (HCI) area, and gains greater awareness with the rising of AR and VR technologies. Meanwhile, with the miniaturization of mobile devices, such as smart watches, etc., the input pattern with fingers on the tiny screen gradually becomes cumbersome. In addition, trying to use smart devices while wearing gloves or with greasy hands often leads to annoying user experiences. (c) Some inconvenient scenarios

Related works (a) RF-Idraw, SigComm’14 (b) Kinect
Many user gesture recognition systems have been proposed. Traditional approaches use wearable sensors, cameras, or Radio Frequency (RF) signals. However, wearable sensors based approaches are normally burdensome because of the inconvenience of wearing sensors. Cameras based approaches suffer from bad lighting conditions or occlusion. RF based approaches use either Wi-Fi or specialized devices. However, Wi-Fi measurement is too coarse grained for recognizing minor gestures because of its inherent long wavelength; (for example, 6 centimeters for 5Ghz band, and Twelve point five (12.5) centimeters for 2.4GHz band) Moreover, specialized devices such as USRP or WARP SDR are often not cost effective. (c) Wisee, MobiCom’13 (d) Wifinger, UbiComp’16

Related works (ultrasound)
(a) Doplink, UbiComp’13 (b) Soundwave, CHI’12 (c) AudioGest, UbiComp’16 Several ultrasound based gesture recognition schemes have been proposed to deal this problem too. However, these works, such as Doplink, Soundwave and AudioGest, most rely on Doppler effect, which falls short in low frequency resolution. Then, there are many works focus on ultrasound tracking in recent years. For example, LLAP [36] leverages the phase changes in received baseband signals. FingerIO [37] uses chirps to track a finger or a palm with 8 mm 2D tracking error. Although these technologies show great progress in tracking accuracy, the biggest limitation is that they cannot handle multi-targets, especially when these targets are not move in the same direction. (d) LLAP, MobiCom’16 (e) FingerIO, CHI’16

To deal with these problems, We present a CIR based method. Evaluation

Principle Gesture recognition in ultrasound area relies on either speed estimation or distance estimation. speed distance The principle of ultrasound based gesture recognition system lies in measure distance change or speed changes. [click] For example, for a simple push-pull gesture, [click] we may have speed estimation like this, [click] or a distance estimation like this. Existing ultrasound based gesture recognition systems rely on Doppler effect to measure speed or FMCW chirp to measure distance. However, both measurement methods have its own drawbacks.

Doppler Effect 0.398m/s 23.43Hz 2048 48kHz 20kHz 340m/s
In Doppler effect based measurement, a moving object induces a frequency shift in the corresponding received signal, Let’s see the minimal speed resolution we can get here. [click] First, we have a relation between frequency shift and moving speed. Where Fc is the central frequency and c is the speed of sound. [click] Then, to get the frequency shift, we have to perform fourier transform, and the frequency resolution is limited by sampling frequency and fourier transform window. [click] To combine, we have the minimal speed resolution as this equation. [click] Substituting the parameters, we have minimal speed resolution being about m/s. （Note that this resolution cannot be compensated by zero padding the STFT window as no information are added, and the window length W is set to 2048 which corresponds to 40 ms and thus greatly exceeds coherence time (about 8.5 ms) in our scenario.）

FMCW chirp 4.25cm 2048 4kHz 340m/s Similarly, for FMCW, we have minimal distance resolution about 4.25 centimeters because of the narrow ultrasound bandwidth. Apparently, either a m/s speed resolution or a 4.25 centimeters distance resolution is not sufficient for finger scale motion gestures recognition.

Channel Impulse Response - CIR
The received sound signal can be classed into three components in time domain, which are direct sound, early reflections, and late reverberation. direct sound & early reflections late reverberation In this paper, we propose to use a channel impulse response (or CIR) based methods to estimate changes caused by finger gestures. In sound transmission, the received sound signal can be classed into three components, which as direct sound, early reflections and late reverberation. This sound transmission property is usually viewed as a linear system, where direct sound and early reflections are viewed as channel impulse response and the late reverberation is viewed as noise. [click], here the star symbol is a convolution symbol, h(t) is channel impulse response, or CIR, which we want, and the n(t) is noise. Obviously, different finger gestures will lead to different reflections change patterns. So if we can get the reflections change pattern, we can infer the possible gesture.

Channel Impulse Response - CIR
We use a Least Square (LS) method to estimate the CIR and get a theoretical distance resolution of 7.08 mm. h(t) x(t) y(t) Linear system Time resolution: 1/Fs = 1/48000 = 0.02ms Distance resolution: c/Fs = 340/48000 = 7.08mm To estimate CIR in a linear system, we use a linear square estimation method. The theoretical time resolution of CIR is one sample, which corresponds to a distance resolution of about 7.1 millimeters.

Doppler vs. FMCW vs. CIR Direct sound + static reflections
To have an intuitional comparison, we compare these methods under a push-and-pull gesture. [click] First, the Doppler based speed measurement [click] The FMCW based distance measurement, [click] The CIR based distance measurement, [click] Note the red line the CIR figure is the direct sound and static reflections. We remove this line by subtracting the last frame result from the current frame. [click] We call the result dCIR Now we can see the resolution difference within these methods.

Now, I’m going to present the detailed solution in our system. Evaluation

System overview Our system includes a training stage and a recognition stage. We propose to use a CNN model to recognize the dCIR images. Our system includes a training stage and a recognition stage. We propose to using a CNN model to recognize the dCIR images. Before the Gesture recognition, we perform Down-conversion, LS estimation, gesture detection in CIR measurement.

CIR - Barker Code We choose Barker code as our baseband data, because of its ideal autocorrelation property. (a) Barker code 13 (b) Barker code autocorrelation In system design, we choose BarkerCode as our baseband sent data. Barker Codes with length N equal to 11 and 13 are used in direct-sequence spread spectrum (DSSS) and pulse compression radar systems because of their autocorrelation properties (other sequences such as complementary sequences which have similar property work well too, we just choose Barker Code in this paper). The top two figures are a barker code and its autocorrelation picture. In practice, we copy it twice and up interpolate 12 times to meet the bandwidth and frame time requirement. The frame time is 10 milliseconds in our design. The two figures in the bottom are our sent and received data in baseband. (c) Sent baseband S[n] (d) Received baseband R[n]

CIR - up-down conversion
The baseband signal are modulated to 20 kHz for inaudible requirement. A passband filter is used for avoiding freque-ncy leakage. To meet the inaudible requirement, we modulate the baseband signal into ultrasound frequency, which is 20K Hz in our experiments. We perform the modulation by multiply a single frequency cosine wave and process the signal with a passband filter to avoid frequency leakage. The down-conversion process is in reverse.

CIR - channel estimation
We estimate the CIR in each frame (10ms) through a Least-Square (LS) method. After getting receiver signal in baseband, we estimate the CIR through a Least-Square method. In the LS method, we solve the following matrix equation to get CIR. The left part is a pre-defined training matrix which can be acquired from the sent signal. The right part is received baseband signal. After this process, for each frame, we can get a CIR measurement which indicates the distance of surrounding reflector. The LS estimation outputs a 140 valued complex CIR for every 10 milliseconds.

CIR - gesture image Assemble the CIRs along time, we can get a CIR image. Subtract CIR values from last frame, the moving part can be revealed, we call this image dCIR image. As we have 100 frames every second, we can assemble these CIR measurements into a matrix. [click] Or an image. As we have stated, the CIR information indicates the reflector distance, naturally, we can reveal the moving parts by a differential operation. [click] We call the differential results dCIR. The dCIR reveals where the gesture taken place and how does the fingers move in the air.

dCIR samples Some dCIR samples are shown in this slide,
For example, a double click gesture, which includes four moving stage, finger-tap, finger seperate, and again.

dCIR - gesture detection
We leverage the variance of dCIR samples along time to perform gesture detection. We use dCIR as gesture detection indicator too. By calculate the variance along the CIR index for each frame, we can easily detect a gesture by setting an experiential threshold.

CNN model - data tensor Actually, we can get dCIR measurement at each micro-phone, combine all microphones’ data together, we can get a data tensor. We collect dCIR image at most of 4 microphones in our self designed speaker-microphone kit. Actually, we can get dCIR measurement at each microphone, combine all microphones’ data together, we can get a 3d data tensor. The 3 dimension is microphone index, dcir index, and time respectively.

CNN model We choose CNN (Convolutional Neural Network) as the classifier in our work. Classifiers, such as SVM, KNN, may miss valuable information in feature extraction process. CNN is good at classifying high dimension tensor data i.e. image classification. We choose CNN as the classifier in our work for two reasons: 1st , Classifiers, such as SVM, KNN, require feature extraction or directly view every pixel as features, which may miss valuable information or destroy valuable relationship between neighboring pixels. In contract, CNN is suitable for classifying high dimension tensors with good performance in image classification area.

CNN model - input layer We set the input layer of our CNN model as a 160*140*4 data tensor. Input data tensor: 160 frames (1.6 seconds) 140 dCIR indexes 4 microphones We set the input layer of our CNN model as a 3 dimensional data tensor. The length of each dimension are shown in this figure. Where, 160 frames represent 1.6 seconds in time, 140 dCIR indexes are constrained by our frame design, And 4 microphones, which is tunable according to recording device.

CNN model - structure CNN Architecture:
Input → [ Conv → Relu → Pooling ] * layers → [ FC ] → Output Our UltraGesture recognition model takes about 2.48 M parameters when there are 5 layers. Our CNN structure includes an input layer, fully connection layer, output layer, and some repeated convolution layer, relu layer and pooling layer. We set the repeat time to 5 in our implementation. Which makes our CNN model take about 2.48 Mega parameters in memory consumption.

Now, I introduce the evaluation results of this work. Evaluation

Evaluations - devices We implemented our system and collected data from a Samsung S5 mobile phone and a self designed kit. Samsung S5: 1 speaker and 2 microphones Self-designed kit: 2 speakers (1 used) and 4 microphones We implemented our system and collected data from a Samsung S5 mobile phone and a self-designed kit. For Samsung S5, there are 1 speaker and 2 microphones For the self-designed kit, there are 2 speakers, we only use the central one, spk1 in the right figure, and 4 microphones. (a) Samsung S5 (b) Self-designed kit

Evaluations - gestures
We collected data for 12 basic gestures performed by 10 volunteers (8 males and 2 females) with a time span of over a month under different scenarios. We collected data for 12 basic gestures performed by 10 volunteers (8 males and 2 female) with a time span of over a month under different scenarios.

Evaluations - 1 The average recognition accuracy is 97.92% for a ten-fold cross validation result. Data source: Samsung S5 Normal Office environment With noise from air conditioner and servers We achieve an average recognition accuracy of 97.92% for a ten-fold cross validation result. The analyzed gesture data is obtained on Samsung S5 under normal office environment with noise from air conditioners and servers.

Evaluations - 2 We test the performance under different scenarios:
The recognition accuracy promotes from 92% (1 microphone) to 98% (4 microphones). The overall gesture recognition accuracy drops slightly from 98% to 93% when the noise level increases from 55dB to 65dB. We test the performance in terms of different microphone numbers and different noise levels. In picture below, each color in a group represents one type of gesture. In terms of microphone number, the recognition accuracy promotes from 92% to 98%, when we increase the number of microphones from 1 to 4, In terms of noise level, and the recognition accuracy drops when noise level increase.

Evaluations - 3 We test the system performance under some typical usage scenarios: New users: 92.56% Left hand: 98.00% With gloves on: 97.33% Occlusion: 85.67% Using UltraGesture while playing music: 88.81% We test the system performance under some typical usage scenarios: For example, for a new user, where the user hasn’t contributed training data in training process, the mean accuracy is 92 %. We have noticed that user performed the gesture with left hand or gloves will not significantly affect the recognition accuracy. But performing the gestures with occlusion or playing music at the same time will have a notable influence.

Conclusion The contributions of our work can be concluded as follows:
We analyzed the inherent drawbacks of existing Doppler and FMCW based methods. We proposed to use CIR measurement to achieve higher distance resolution. We combined the deep learning framework CNN to achieve high recognition accuracy. In conclusion， We made following three contributions: 1st, We analyzed the inherent drawbacks of existing Doppler and FMCW based methods. 2nd， We proposed to use CIR measurement to achieve higher distance resolution. 3rd, We combined the deep learning framework CNN to achieve high recognition accuracy.

Q & A That’s all, Thank you for your listening! Any questions?

UltraGesture: Fine-Grained Gesture Sensing and Recognition

Similar presentations

Presentation on theme: "UltraGesture: Fine-Grained Gesture Sensing and Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UltraGesture: Fine-Grained Gesture Sensing and Recognition

Similar presentations

Presentation on theme: "UltraGesture: Fine-Grained Gesture Sensing and Recognition"— Presentation transcript:

Similar presentations

About project

Feedback