Presentation is loading. Please wait.

Presentation is loading. Please wait.

Action Recognition in Video

Similar presentations


Presentation on theme: "Action Recognition in Video"— Presentation transcript:

1 Action Recognition in Video
Advanced Topics In Computer Vision, Spring 2016 Presented by: Sima Sabah

2 Action Recognition Pipeline
Video Representation I want to start with presenting the traditional action recognition pipeline: For each video we extract spatio-temporal features, we might do some pre-processing on them to make them more separable between actions, and use pooling to take only discriminative features, and normalization in case that some of the features have different scales. In the end we’ll have some video representation that we can feed into a classify that will determine the action in the video, in this case – fencing. The feature extraction usually consist of the following steps: detecting some spatio-temporal interest points in the video. Each point defines a space-time patch around it or a trajectory over a few frames, and we can compute descriptors on these volumes. The feature encoding, for example for bag of words, will start with taking a set of training videos and extract features from them. Those features will be quantized using k-means or Gaussian Mixture Models which are generalization of k-means. The quantized features form a codebook, and we can represent the video by a histogram of these words. Image credits: Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice Xiaojiang Peng · Limin Wang · Xingxing Wang · Yu Qiao [Peng, X., et al. 2014],[Jan van Gemert, UvA]

3 Hand-made Features HOG (Histogram of Oriented Gradients) [Dalal and Triggs, 2005] What kind of spatio-temporal features people used? We’ll start with hand-made features. The first one is actually a spatial filter and not temporal: Histogram of Oriented Gradients. Given an image, we look at the gradients in different blocks and calculate an histogram of the direction of the gradient, quantized to 8 directions.

4 Hand-made Features HOG (Histogram of Oriented Gradients)
Optical Flow and Histogram of Optical Flow (HOF) [Laptev et al. 2008] Optical Flow: Given two images we want to know where each interest point moved to. There are 2 assumptions: color constancy and small motion. The optical flow is a vector representing the direction and magnitude of the motion of each pixel, and we can compute histogram of optical flow the same way we do with the image gradients for HOG. Another useful representation is color coded version with color for orientation and saturation as magnitude. Hue is orientation atan(vx,vy); Saturation is magnitude sqrt(vx^2+vy^2)

5 Hand-made Features HOG (Histogram of Oriented Gradients)
Optical Flow and HOF Trajectories Trajectories: The position of each tracked point along multiple frames. The tracking can be done by median filtering of the optical flow. For dense trajectories it’s 3x3.

6 Hand-made Features HOG (Histogram of Oriented Gradients)
Optical Flow and HOF Trajectories MBH (Motion Boundary Histogram) [Dalal et al. ECCV 2006] MBH: These are the spatial derivatives of the optical flow. They have a few interesting properties: invariant to constant camera motion No motion: motion boundaries disappear The derivatives are for the x component of the optical flow and the y component of the optical flow separately

7 Hand-made Features HOG (Histogram of Oriented Gradients)
Optical Flow and HOF Trajectories MBH (Motion Boundary Histogram)

8 Learned Features Neural Networks
Learned features – usually computed with neural networks. Some simplification for people who don’t know anything about deep learning, this is how I want you to think about it. This was our traditional action recognition pipeline. In the case of deep learning the video representation and classifier are learned together, and the features are extracted according to the video representation. A typical neural network starts with convolution layers, and we can look at them as the feature extraction part. They compute convolution with different learned kernels of local areas in the video. The more layers there are – they look at larger areas in the video, and the features extracted can be more complex. There are also some non linear functions, normalization and pooling between those layers, that aren’t always explicitly mentioned that add to the complexity of those features. Following the convolutional layers there is a fully connected layer or layers, that we can look at as doing the video representation. It takes the local features and combines them to a single vector representing the video. In the end we have a final fully connected layer followed by softmax layer that represent our classifier. The last FC layer outputs a number for each action class that the softmax layer translates to probability that the video is of a certain action. Since we said the second last FC layer is the representation of the video, we can also feed its output to an external classifier like SVM. It may not make a lot of sense since our representation is learned such that the classifier will perform best, but it is useful if we want to combine representation from different frameworks.

9 Action Recognition with Improved Trajectories
Hang Wang, Cordelia Shmid ICCV 2013 The first work I want to present is Action Recognition with Improved Trajectories, but before we start improvement I want to start with the baseline they improved.

10 Dense Trajectories [Wang et.al. IJCV’13]
Densely sample, track non-homogenous regions for at most 15 frames and then resample to keep dense trajectories. Ignore static trajectories or sudden jumps. And extract spatio-temporal tubes from the rest size 32x32x15 Calculate descriptors on 16x16x5 space-time patches They also use the trajectory descriptor as normalized displacement vectors. At most 8 scales 1/sqrt(2) scale N=32 32X32X15 2X2X3 patches => 16x16x5 size Trajectories: normalized displacement vectors

11 Camera Motion Estimation
SURF (green) and optical flow (red) Motivation: 2 actions can look similar if the camera motion is similar, and is also corrupts the foreground motion descriptors. We extract SURF features which are robust to blur (and recognize blob like patches) and match with NN Choose “good features to match” from the optical flow which correspond to corner like patches (thresholding the smallest eigenvalue of the autocorrelation matrix – if we have 2 strong eigenvalues it’s a corner) Use RANSAC to find the homography

12 Warp Optical Flow Original optical flow
First row – 2 consecutive frames overlaid Second row – optical flow computed Third row – warped optical flow after camera estimation 4th row – in white the trajectories that are consistent with the camera motion, and we would like to remove Improves HOF a lot, MBH less Original optical flow

13 Removed Trajectories Successful examples Failure cases
Here in white you can see the optical flow that is consisted with camera motion and will be removed for different camera motions. Given the homography we can warp the optical flow to “undo” the camera motion and ignore the background trajectories and descriptors. In the last row there are failure cases: Left – motion blur – there arne’t enough good matches Right – The Homography was according to the human motion. Removed trajectories (white) and foreground ones (green)

14 Human Detection Part-based Human detector [Prest et al. 2012]
As we’ve seen in the failure case, the human descriptors are outliers in the homography computation so we would like to use human bounding box as a mask and exclude descriptors in it. They use human detector to eliminate the human matches from the homography estimation. The detection is made more robust due to tracking of the human with average optical flow for at most 15 frames forward and backwards (or until 50% intersection with another detection) After warping the optical flow – recompute the descriptors except HOG.

15 Results – Compare Features
It can be seen that combination of all the descriptors is the best. HOF and MBH are complementary, as they represent zero and first order motion information RmTrack – remove optical flow descriptors of background but use background descriptors WarpFlow – warp optical flow after motion estimation (of all the descriptors, including background) Both RmTrack and WarpFlow helps; WarpFlow contributes more; Combing them (ITF) works the best

16 Results – Encoding and Classifier
Bag of words: K-means SVM with RBF- χ 2 kernel Fisher Vector: GMM Linear SVM RBF – used to normalize different type of descriptors X^2 – distance function to measure distances between histograms Fisher vector is a better representation than Bag of Features. Different types of descriptors were normalized and concatenated to a single vectors, and linear SVM classifier was applied to it.

17 Feature Encoding Bag of Words + SVM with RBF- χ 2 kernel
χ 2 Distance function: 𝐷 𝑃,𝑄 = χ 2 𝑃,𝑄 = 1 2 𝑖 𝑃 𝑖 − 𝑄 𝑖 𝑃 𝑖 + 𝑄 𝑖 Fisher Vector + Linear SVM 2DK vector for each descriptor type D – descriptor dimension after factor-2 PCA K – number of Gaussians (256) Different descriptor types are concatenated after normalization 𝐾 𝑥 𝑖 , 𝑥 𝑗 =𝑒𝑥𝑝 𝐶 1 𝐴 𝑐 𝐷 𝑥 𝑖 , 𝑥 𝑗 Bag of words - sample 100,000 features of each type and use k-means for the dictionary The Chi squared kernel is suited for histogram distances The RBF kernel is for multiple channels (descriptor types) Fisher vector encodes both first and second order statistics between the video descriptors and GMM Randomly sample 260,000 features from training set to estimate GMM The size of the codebook – 4000 SVM C=100

18 Results Even without human detection they passed state of the art when it was published 2013. We’ll talk more about how they did compared to current state-of-the art later. HMDB: 6.8k videos, 51 categories

19 Two-Stream Convolutional Networks for Action Recognition in Videos
Karen Simonyan, Andrew Zisserman NIPS 2014 As we can see from the title we’re going to talk about convolutional networks, but not a standard one.

20 Network Architecture 224x224x3 224x224x2L
The input video is fed into 2 separate network. The spatial stream basically does action recognition from a single color image. The temporal stream is being fed with multi-frame optical flow and does action recognition from temporal information. The results of the networks are fused either by averaging or training SVM classifier. Although this network learns the representation of the video, it uses optical flow, which is a hand-made feature as input. The network trains 2 networks separately, the architecture is the CNN-M-2048. ReLU in each hidden layer Max pooling over 3x3 spatial windows with stride 2 Local normalization Difference between temporal and spatial: removed second normalization layer in the temporal to reduce memory consumption. 224x224x2L

21 Optical Flow Representation
Optical Flow Stacking Trajectory Stacking There are 2 types of optical flow representations: Optical flow stacking – computing the optical flow in the same spatial point in each frame. Trajectory stacking – computing optical flow in each point in the trajectory. For each of these there can be also 2 types of variations: -bi-directional (L/2) each direction instead of L consecutive starting with current frame -Mean flow substraction – to account for global camera motion. Done between each 2 frames on d.

22 Implementation Resize video to maximum spatial size 256 Training
Spatial: Randomly crop and flip 224x224x3 Temporal: Randomly crop and flip 224x224x2L L=1,5,10 With\without mean subtraction Uni\bi-directional optical flow Testing: Sample 25 frames 10 crops – corners and centers + flipped version Fusion: Average score or SVM Class score: Average of frames and crops Multi-task Learning 2 loss layers

23 Relation to Hand-made Features
HOF, MBH Can be learned  Single convolutional layer (containing orientation sensitive filters) followed by rectification and pooling layers Trajectory Can be an input using Trajectory stacking Still missing: Local pooling over spatio-temporal tubes centered at the trajectories Camera motion compensation However, still missing: Trajectories – the features don’t correspond to the trajectories. Local pooling over spatio-temporal tubes centered at the trajectories. Camera motion: currently compensated by mean displacement substraction Convolutional layer – derivatives Rectification – threshold of activations of close directions Average pooling can create the histogram, although I think they used max pooling…

24 Results – Temporal ConvNet
UCF-101 – optical flow representation HMDB-51

25 Results - UCF-101 Spatial ConvNet Two-stream ConvNet
Pre-train is spatial in on ImageNet Adding 78 classes from UCF101 that don’t have intersection HMDB is 2.6 times smaller than UCF101 UCF-101 13,320 Videos 101 Human Actions SVM – on the soft max outputs

26 Results – UCF-101 UCF-101 – optical flow representation
Two-Stream ConvNet

27 Results UCF101: 13K videos, 101 actions HMDB 6.8 videos, 51 actions

28 Actions ~ Transformations
X. Wang, A. Farhadi, A. Gupta CVPR 2016 Most of the action recognition works focus on finding appearance and motion features that can discriminate between different actions. But this is not the essence of an action.

29 Motivation For example if we look at this baby pushing a cart. If I ask you to describe to me what it means to push something – you wouldn’t talk about the pose I should be in or what kind of motion I need to do. Pushing is an action where there is some still object and my interaction with it will move it away from me. Basically the essence of an action is the transformation it does to the environment before and after it happens. If we ask the two stream network to represent this video and show us video with a similar representation we will get this baby, that for us is obvious is crawling towards his mother, and not pushing anything. But we can’t really blame the network, because the appearance and pose are similar, and the motion is in the same direction. The problem is the reoresentation.

30 Actions as Transformations
We would like to segment the video into precondition – the state before the action, and effect – the state after the action, such that the precondition after the action will get us to the effect. For example for a kick we have a ball on the grass, and a player approaches it, and after the action of the kick the ball flies away. Given this representation we can train a network using triplets of the precondition sequence, effect sequence and action to learn a representation for the precondition, effect and transformations. And test it with the precondition and effect sequences. We can also test its ability to generalize – if we trained a network using videos of climbing a tree we expect the network to understand that climbing a wall is the same action. Or in this example – jumping long is the same as jumping high even if they are in different direction and with different motions. We can also predict what the effect of an action will look like. For example, If we see someone diving with the sea in the background we expect that the next thing we’ll see is the person in the air above the water, and the splash of the water when the person hits it. We would like our network to understand it too.

31 Modeling Actions as Transformations
Given Video: 𝑋= 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑡 𝑋 𝑝 = 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑧 𝑝 are the precondition frames. 𝑋 𝑒 = 𝑥 𝑧 𝑒 , 𝑥 2 ,…, 𝑥 𝑡 are the effect frames. 𝑑-dimensional embedding : 𝑓 𝑝 𝑋 𝑝 𝑓 𝑒 𝑋 𝑒 Actions as transformations: 𝑇= 𝑇 1 , 𝑇 2 ,…, 𝑇 𝑛 is a set of 𝑑𝑥𝑑 matrices Goal: 𝑇 𝑦 𝑓 𝑝 𝑋 𝑝 = 𝑓 𝑒 𝑋 𝑒 Unknowns: 𝑇,𝑓 𝑝 𝑋 𝑝 , 𝑓 𝑒 𝑋 𝑒 , 𝑧 𝑝 , 𝑧 𝑒 2-step iterative EM algorithm: Learning model parameters Estimating latent variables Mathematical modeling

32 Network Architecture: Siamese Network
Loss: 𝑓 𝑝 𝑋 𝑝 Siamease network – VGG16 Each frame is being fed seperatly through the network until the average pooling. When we get all of the frames representation we average through time and after another fully connected layer we get our representation. After we have the precondition and effect representations we can multiply the precondition with each of the transformation matrices and compare to the effect. The loss function consist of 2 terms: The distance for the right transformation should be small The distance for the rest should be big, but if it’s bigger than the M threshold you don’t need to change the network parameters. (M=0.5 – if an action is far enough you don’t have to change the network to take it further away) 𝑓 𝑒 𝑋 𝑒 min 𝐷 𝑇 𝑦 𝑓 𝑝 𝑋 𝑝 , 𝑓 𝑒 𝑋 𝑒 + 𝑖≠𝑦 𝑛 max 0,𝑀−𝐷 𝑇 𝑖 𝑓 𝑝 𝑋 𝑝 , 𝑓 𝑒 𝑋 𝑒

33 Network Architecture: Two-Stream Siamese
Temporal stream ConvNet Spatial stream Each of the networks is trained separately with different parameters. The results are being fused. Temporal stream: 10 consecutive OF starting from current frame (uni) Optical flow stacking

34 Learning Algorithm Initialize network weights with pre-trained Two-Stream Network. Repeat: Forward propagation and feature computing for each frame Search Latent variables: Calculate joint loss Perform back-propagation 𝑧 𝑝 ∗ , 𝑧 𝑒 ∗ = argmin 𝑧 𝑝 , 𝑧 𝑒 𝐷 𝑇 𝑦 𝑓 𝑝 𝑋 𝑝 , 𝑓 𝑒 𝑋 𝑒 such that Training - for each stream separately! Searching latent variables – brute force all the combinations Distance function 1-cosine of the angle between the vectors. min 𝐷 𝑇 𝑦 𝑓 𝑝 𝑋 𝑝 , 𝑓 𝑒 𝑋 𝑒 + 𝑖≠𝑦 𝑛 max 0,𝑀−𝐷 𝑇 𝑖 𝑓 𝑝 𝑋 𝑝 , 𝑓 𝑒 𝑋 𝑒 Distance function: 𝐷 𝑣 1 , 𝑣 2 =1− 𝑣 1 ∙ 𝑣 2 𝑣 1 𝑣 2

35 Inference Objective: Model fusion: 2xTemporalScore + SpatialScore
min 𝑦, 𝑧 𝑝 , 𝑧 𝑒 𝐷 𝑇 𝑦 𝑓 𝑝 𝑋 𝑝 , 𝑓 𝑒 𝑋 𝑒 Temporal stream ConvNet Spatial stream Spatial Distance Score Temporal Inference – brute force search for (y, z_p, z_e)

36 Temporal Segmentation Results

37 Attention

38 Recognition Results UCF-101 HMDB51 ACT

39 Recognition Results On HMDB two-stream VGG isn’t better than regular VGG because two-stream use SVM for fusion and their implementation is averaging total scores. Two-stream features are complementary – they are discriminative motion and appearance features, and “our” is more semantic. Concatenating the two embedding (precondition after transformation and effect) and train softmax classifier Average the results with two-stream

40 Cross-Category Generalization
ACT 11648 videos 43 classes 16 super classes

41 Cross-Category Generalization
Average features Two-stream – similar appearance and motion Actions as transformations – semantically related

42 Visual Prediction

43 Summary We talked about the traditional pipeline of action recognition using hand made features, and different features that are used. We than talked about dense trajectories and how to improve them by warping the flow after camera motion estimation and human detection to make the estimation more robust. Then we moved to deep networks and introduced the two-stream network in which the spatial stream performs action recognition on still images and the temporal stream use an input of a sequence of optical flow fields concatenated in different ways. The fusion between the network was either averaging their scores or using SVM\fisher vectors to learn the fusion. Next, we introduced a new notion of what is an action by representing actions as linear transformation between embedding of pre-condition and effect of an action, and saw that we can generalize very well, and even predict the effect of a precondition given an action. We saw the difference between the semantic representation of the network, to the discriminative motion and appearance features of two-stream networks.

44 Take Home Message Compensating for camera motion
Hand made features and traditional 2D CNN captures discriminative appearance and motion features and not semantic features.

45 Questions?

46 Learning Spatiotemporal Features with 3D Convolutional Networks
D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri ICCV 2015

47 2D ConvNets vs. 3D ConvNets
2D convolution 2D convolution on multiple frames 3D convolution 2D convolution: lose temporal information after 1 convolution Two-stream – after the first convolution layer temporal information is collapsed although multiple frames (of optical flow) are fed into the network.

48 Goals Find good architecture for 3D convolutional network
Evaluate the quality of the network and the fully- connected representation on different tasks.

49 Finding Network Architecture
Conv1 64 pool1 pool2 pool3 pool4 Conv2 128 Conv3 256 Conv4 Conv5 pool5 fc6 2048 fc7 softmax 1x2x2 2x2x2 Convolution Kernel 3x3xd Homogenous Depth-d: d=1, 3, 5, 7 Increasing depth: 3, 3, 5, 5, 7 Decreasing depth: 7, 5, 5, 3, 3 The numbers of each layer represent the number of filters The numbers underneath the pooling layer represent the stride. For convolutional layers no stride (1x1x1). The first pooling is only spatial to keep temporal information because we can only pool 4 times. להגיד משהו על מספר הפרמטרים – שהאקספרסיביות לא אמורה להיפגע כי זה זניח The largest difference is 51K parameters which is 0.3% of the 17.5 million parameters of the network

50 Network Architecture Conv1 64 pool1 pool2 pool3 pool4 Conv2 128 Conv3
256 Conv4 Conv5 pool5 fc6 2048 fc7 softmax 1x2x2 2x2x2 pool2 softmax 2x2x2 pool3 pool4 Conv1a 64 pool1 Conv2a 128 1x2x2 Conv3a 256 Conv4a 512 Conv4b pool5 Conv5a Conv5b fc6 4096 fc7 Conv3b להגיד משהו על זה שהגדילו את הרשת כמה שיכלו לפי מגבלות GPU

51 Implementation Frames resized to 128x171 (approximately half- resolution) Training: Split video into non overlapping 16-frames clips Randomly crop and flip spatially 112x112 Test: 10 random clips center-cropped Classification in averaged Input video size: 112x112x16 Obtained by random crops and flips from a video resized to 128x171 The test was done with a single center crop for each of the 10 clips extracted (average).

52 Results Sport-1M 1.1 million sports videos 487 categories
Trained on sports 1m 1.1 Milion videos in 487 sports categories (5 times more categories and 100 times more videos than the DB used for choosing architecture.)

53 C3D Video Descriptor Generic Compact Efficient Simple
Using fc6 activations (4096)

54 C3D Video Descriptor Generic Compact Efficient Simple
Using fc6 activations (4096) Brox’s algorithm is used by Two-Stream network for computing optical flow.

55 Features Visualization

56 Results UCF-101 13,320 Videos 101 Human Actions
iDT and Two-stream actually work better then C3D iDT features are complementary – captures low level of appearance and motion instead of high level semantics.

57 Results Action Similarity Labeling Scene Object Recognition Also


Download ppt "Action Recognition in Video"

Similar presentations


Ads by Google