Week 3 Volodymyr Bobyr.

Week 3 Volodymyr Bobyr

Goals – Week 3 Goal Completed YES Partial NO
C3D Understanding & Implementation YES C3D Baseline on UCF101 Read & Understand I3D Paper I3D Implementation Partial I3D Baseline on UCF101 NO

Overview C3D Structure & Performance UCF101 Dataset
C3D Experimental Results I3D Overview I3D Implementation Goals for next week

C3D # Parameters: 78.41 Million Optimizer: SGD Pioneered:
3D Convolutions (Temporal) Spatio-Temporal Pooling (Temporal) Tran, Du, et al. “Learning Spatiotemporal Features with 3D Convolutional Networks.” 2015 IEEE International Conference on Computer Vision (ICCV), 2015

C3D Performance Showed considerable results compared to state-of-the-art models of the time Tran, Du, et al. “Learning Spatiotemporal Features with 3D Convolutional Networks.” 2015 IEEE International Conference on Computer Vision (ICCV), 2015

C3D Performance – UCF101 Tran, Du, et al. “Learning Spatiotemporal Features with 3D Convolutional Networks.” 2015 IEEE International Conference on Computer Vision (ICCV), 2015

UCF101 Simple actions Realistic Environments Groups share: Backgrounds
Actors Other features Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/ (2012)

C3D – Experimental Results
Dataset: UCF101 Augmentation: Resized to 171x128 Randomly Cropped 112x112 Learning Rate: 0.01 Batch Size: 11 From Scratch No fine-tuning

C3D – Experimental Results
Dataset: UCF101 Learning Rate: Batch Size: 13 Pre-trained on Kinetics Much better performance

I3D – Ideas Two-Stream Inflated 3D ConvNet
Builds on state-of-the-art image classifiers Performance far-exceeds state-of-the-art when pretrained on the Kinetics Dataset 400 human actions 400 video clips per action Each video clip approx. 10s long Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

Previous Models Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

Previous Video-Classification Models
LSTM + Conv2D: expensive to train due to backprop through time difficult to capture fine low-level motion 3D-ConvNet: can’t reuse 2D-ConvNet weights a lot of parameters Two-Stream: reuses 2D-ConvNet weights, because 1 RGB frame averages prediction from 1 RGB frame and 10 optical flow fr. good performance Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

Previous Video-Classification Models
3D ConvNet + Two-Stream: similar to Two-Stream, but passes the output of 2D ConvNets through a 3D ConvNet before output incrementally-better performance than plain Two-Stream Two-Stream 3D ConvNet: samples 5 RGB frames 10 frames apart and calculates corresponding optical flows passes both through 3D ConvNets and averages the output I3D mimics this structure with Inflation v1 Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

I3D – Inflation Pre-trains Conv3D feature extractors on ImageNet
Takes a single image and copies it N times to form a fake video with N frames Trains state-of-the-art 2D classifiers on this dataset of fake videos to learn features Inflates 2D Pooling layers by adding a temporal stride Raised challenges with the stride Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

I3D – Challenges Setting 3D Pooling temporal stride:
Depends on the framerate of the video If the temporal stride is too high, the model will lose track of edges, as an object in movement will move too far If the temporal stride is too low, the model will not capture scene dynamics *initially not understood Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

I3D – Input RGB: Optical Flow: Pure feedforward
Allows short-term fine feature learning 64 Frames Optical Flow: Recurrent in its nature Used to capture the flow of motion Somewhat compensates for non-RNN structure Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

I3D – Details Uses Inception V1 as its backbone for feature extraction – pretrained on ImageNet Each convolutional layer (except for the last) is followed by BatchNorm & ReLU activation Data augmentation: Random Cropping (spatial & temporal) Looping for short videos Horizontal flipping Photometric video augmentation Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

I3D – Details Testing: TV-L1 Optical flow algorithm Optimizer: SGD
Convolutional model application Averaging of output TV-L1 Optical flow algorithm Optimizer: SGD # Parameters: 25M (less than C3D) Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

I3D – Visualization Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

I3D – Performance Carreira, Joao, and Andrew Zisserman. “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, doi: /cvpr

Project Task: Using Audio for Self-Supervised Threat Detection
Dataset: Police Body-Camera Footage Subtask: Action classification (9 classes) Challenges: unstable camera movement, unclear beginning/end of actions Self-supervision: sound as a measure of danger

Project Proposed Structure: Perform late fusion to predict escalation
First stream: I3D Second stream: VGG-like audio network Third stream (possibly): caption analysis Perform late fusion to predict escalation

Additional Notes – CRCV Bot
Automatic Notifications while the model is running Implemented through Google’s API

Goals – Week 4 Read papers related to utilization of sound for improved video classification – VGG Implement a model which uses audio & video for classification

Week 3 Volodymyr Bobyr.

Similar presentations

Presentation on theme: "Week 3 Volodymyr Bobyr."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Week 3 Volodymyr Bobyr.

Similar presentations

Presentation on theme: "Week 3 Volodymyr Bobyr."— Presentation transcript:

Similar presentations

About project

Feedback