Self-Supervised Cross-View Action Synthesis

Slides:



Advertisements
Similar presentations
Limin Wang, Yu Qiao, and Xiaoou Tang
Advertisements

Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
Spatial Pyramid Pooling in Deep Convolutional
Developing an Algorithm
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
An Artificial Neural Network Approach to Surface Waviness Prediction in Surface Finishing Process by Chi Ngo ECE/ME 539 Class Project.
Logan Lebanoff Mentor: Haroon Idrees
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Radboud University Medical Center, Nijmegen, Netherlands
When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.
Big data classification using neural network
Unsupervised Learning of Video Representations using LSTMs
Analysis of Sparse Convolutional Neural Networks
Convolutional Neural Network
Summary of “Efficient Deep Learning for Stereo Matching”
Compact Bilinear Pooling
Week 3 (June 6 – June10 , 2016) Summary :
Article Review Todd Hricik.
Mastering the game of Go with deep neural network and tree search
Rotational Rectification Network for Robust Pedestrian Detection
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Ajita Rattani and Reza Derakhshani,
Summary Presentation.
Supervised Training of Deep Networks
Deep Belief Networks Psychology 209 February 22, 2013.
Presenter: Hajar Emami
Classification / Regression Neural Networks 2
By: Kevin Yu Ph.D. in Computer Engineering
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Bird-species Recognition Using Convolutional Neural Network
Goodfellow: Chap 6 Deep Feedforward Networks
Two-Stream Convolutional Networks for Action Recognition in Videos
Counting in Dense Crowds using Deep Learning
Object Classification through Deconvolutional Neural Networks
Very Deep Convolutional Networks for Large-Scale Image Recognition
A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE
Age and Gender Classification using Convolutional Neural Networks
Lecture: Deep Convolutional Neural Networks
Papers 15/08.
Use 3D Convolutional Neural Network to Inspect Solder Ball Defects
Lip movement Synthesis from Text
Designing Neural Network Architectures Using Reinforcement Learning
Autoencoders hi shea autoencoders Sys-AI.
ImageNet Classification with Deep Convolutional Neural Networks
Word embeddings (continued)
Heterogeneous convolutional neural networks for visual recognition
Department of Computer Science Ben-Gurion University of the Negev
Human-object interaction
Deep Object Co-Segmentation
Week 3 Presentation Ngoc Ta Aidean Sharghi.
CRCV REU UCF Summer 2019 Arisa Kitagishi.
Deep screen image crop and enhance
Multi-UAV to UAV Tracking
Deep screen image crop and enhance
CRCV REU 2019 Kara Schatz.
Appearance Transformer (AT)
Week 3 Volodymyr Bobyr.
Self-Supervised Cross-View Action Synthesis
Volodymyr Bobyr Supervised by Aayushjungbahadur Rana
Actor-Object Relation in Videos
Week 7 Presentation Ngoc Ta Aidean Sharghi
Self-Supervised Cross-View Action Synthesis
Deep screen image crop and enhance
Sign Language Recognition With Unsupervised Feature Learning
Self-Supervised Cross-View Action Synthesis
Truman Action Recognition Status update
Deep screen image crop and enhance
Self-Supervised Cross-View Action Synthesis
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Self-Supervised Cross-View Action Synthesis Kara Schatz Advisor: Dr. Yogesh Rawat UCF CRCV – REU, Summer 2019

Synthesize a video from an unseen view. Project Goal Synthesize a video from an unseen view. The goal of this project is to be able to synthesize a video from an unseen view

Synthesize a video from an unseen view. Project Goal Synthesize a video from an unseen view. Given: video of the same scene from a different viewpoint appearance conditioning from the desired viewpoint In order to achieve this, our approach will use a video of the same scene from a different viewpoint as will as appearance conditioning from the desired viewpoint

Approach This diagram shows the approach that we are using to accomplish our goal. The overall idea is to use a network to learn the appearance of the desired view and another network to learn a representation for the 3D pose in a different view of the video. Then, we will take both of those and input them into a video generator that will reconstruct the video from the desired view. To do the training, we will run the network on two different views and reconstruct both viewpoints. Once trained, we will only need to give one view of the video an one frame of the desired view.

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 The dataset that we will use is the NTU dataset, which contains over 56 thousand videos that are taken from 3 different camera angles, which give us the different viewpoints. The videos will be resized and randomly cropped to 112x112 before being passed to the network

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 Training inputs: There are many different ways to utilize the dataset for training, so here are some of the approaches I plan on using

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112   First, I want to use the 0 degree view and either the -45 or +45 degree view. Theoretically, I expect that this should be the easiest setup since it provides the minimum change in the angle.

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112   Then, I want to use only the -45 and +45 degree views without the 0 degree view. This should be more difficult since the change in the angle is double. Performing experiments in both of these ways will hopefully give insight into the impact of angle change on the success of the model.

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112   Next, I want to try with just any two randomly chosen views. This may be less successful compared to the first case, but it should hopefully improve on the second, more difficult case.

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112   Eventually, I would like to setup the network such that you could input the desired viewpoint, so I want to see if the network is successful when trained on 2 views and trying to generate a third view that it has not learned at all.

Network Design This week, I implemented the entire network in PyTorch. It consists of 3 different networks pipelined together.

Appearance Encoder VGG16 Input: 112x112x3 Output: 7x7x512 Pretrained on ImageNet The first is the Appearance Encoder. I am using the VGG16 network for this part of the task. Since we are not doing classification at all, I removed the fully connected and softmax layers. Here it shows a 224x224 input, but we have a 112x112 input. We still wanted to get a 7x7 output for the feature map, so I also removed the final max pooling layer. https://neurohive.io/en/popular-networks/vgg16/

3D Pose Estimation I3D Input: 112x112x8x3 Output: 7x7x1x1024 Pretrained on Charades For the 3D Pose Estimation network, I am using I3D, again, with some modifications to get the desired output size. I removed the final logit layers as well as the average pooling at the end, and I modified the final max pooling layer to only resize the depth, not the height and width. Now, we can remove the temporal dimension of the output since it’s just 1, and it can be concatenated with the output of the VGG network to allow for input in the Generator “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”

Video Generator Self-Designed Generator Input: 7x7x1536 Output: 112x112x8x3 Conv2D Upsampling add temporal dimension Conv3D x2 Upsampling Conv3D x2 Upsampling Which is the 3rd and final network component. This network was self-designed. I start with a 2d convolutional layer, and then I add the temporal dimension back in and proceed with 3d convolution and Upsampling layers to get back up to the desired output dimensions, which match the original input. Conv3D x2 Upsampling Conv3D

Loss Function For the loss function, I am using Mean Squared Error

Loss Function But because of the network design

Loss Function There are actually 3 different losses. All 3 are MSE losses, so I use the sum of them as the overall loss.

Next Steps Write Data Loader Train the network Now, that I have the network fully implemented, my next step is to hopefully get access to the data next week and write a data loader. Then, I can train the network on the specific inputs I was interested in

Next Steps Write Data Loader Train the network Make improvements: to networks to loss function data input strategy After that, I can start making changes to hopefully improve the model. I can make changes to the network, the loss function I am using, and the data input strategies to see how those impact performance.