Self-Supervised Cross-View Action Synthesis

Slides:

Advertisements

Similar presentations

Limin Wang, Yu Qiao, and Xiaoou Tang

Advertisements

Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.

Spatial Pyramid Pooling in Deep Convolutional

Developing an Algorithm

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

An Artificial Neural Network Approach to Surface Waviness Prediction in Surface Finishing Process by Chi Ngo ECE/ME 539 Class Project.

Logan Lebanoff Mentor: Haroon Idrees

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Radboud University Medical Center, Nijmegen, Netherlands

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Big data classification using neural network

Unsupervised Learning of Video Representations using LSTMs

Analysis of Sparse Convolutional Neural Networks

Convolutional Neural Network

Summary of “Efficient Deep Learning for Stereo Matching”

Compact Bilinear Pooling

Week 3 (June 6 – June10 , 2016) Summary :

Article Review Todd Hricik.

Mastering the game of Go with deep neural network and tree search

Rotational Rectification Network for Robust Pedestrian Detection

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Ajita Rattani and Reza Derakhshani,

Summary Presentation.

Supervised Training of Deep Networks

Deep Belief Networks Psychology 209 February 22, 2013.

Presenter: Hajar Emami

Classification / Regression Neural Networks 2

By: Kevin Yu Ph.D. in Computer Engineering

Layer-wise Performance Bottleneck Analysis of Deep Neural Networks

Bird-species Recognition Using Convolutional Neural Network

Goodfellow: Chap 6 Deep Feedforward Networks

Two-Stream Convolutional Networks for Action Recognition in Videos

Counting in Dense Crowds using Deep Learning

Object Classification through Deconvolutional Neural Networks

Very Deep Convolutional Networks for Large-Scale Image Recognition

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

Age and Gender Classification using Convolutional Neural Networks

Lecture: Deep Convolutional Neural Networks

Use 3D Convolutional Neural Network to Inspect Solder Ball Defects

Lip movement Synthesis from Text

Designing Neural Network Architectures Using Reinforcement Learning

Autoencoders hi shea autoencoders Sys-AI.

ImageNet Classification with Deep Convolutional Neural Networks

Word embeddings (continued)

Heterogeneous convolutional neural networks for visual recognition

Department of Computer Science Ben-Gurion University of the Negev

Human-object interaction

Deep Object Co-Segmentation

Week 3 Presentation Ngoc Ta Aidean Sharghi.

CRCV REU UCF Summer 2019 Arisa Kitagishi.

Deep screen image crop and enhance

Multi-UAV to UAV Tracking

Deep screen image crop and enhance

CRCV REU 2019 Kara Schatz.

Appearance Transformer (AT)

Week 3 Volodymyr Bobyr.

Self-Supervised Cross-View Action Synthesis

Volodymyr Bobyr Supervised by Aayushjungbahadur Rana

Actor-Object Relation in Videos

Week 7 Presentation Ngoc Ta Aidean Sharghi

Self-Supervised Cross-View Action Synthesis

Deep screen image crop and enhance

Sign Language Recognition With Unsupervised Feature Learning

Self-Supervised Cross-View Action Synthesis

Truman Action Recognition Status update

Deep screen image crop and enhance

Self-Supervised Cross-View Action Synthesis

Goodfellow: Chapter 14 Autoencoders

Presentation transcript:

Self-Supervised Cross-View Action Synthesis Kara Schatz Advisor: Dr. Yogesh Rawat UCF CRCV – REU, Summer 2019

Synthesize a video from an unseen view. Project Goal Synthesize a video from an unseen view. The goal of this project is to be able to synthesize a video from an unseen view

Synthesize a video from an unseen view. Project Goal Synthesize a video from an unseen view. Given: video of the same scene from a different viewpoint appearance conditioning from the desired viewpoint In order to achieve this, our approach will use a video of the same scene from a different viewpoint as will as appearance conditioning from the desired viewpoint

Approach This diagram shows the approach that we are using to accomplish our goal. The overall idea is to use a network to learn the appearance of the desired view and another network to learn a representation for the 3D pose in a different view of the video. Then, we will take both of those and input them into a video generator that will reconstruct the video from the desired view. To do the training, we will run the network on two different views and reconstruct both viewpoints. Once trained, we will only need to give one view of the video an one frame of the desired view.

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 The dataset that we will use is the NTU dataset, which contains over 56 thousand videos that are taken from 3 different camera angles, which give us the different viewpoints. The videos will be resized and randomly cropped to 112x112 before being passed to the network

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 Training inputs: There are many different ways to utilize the dataset for training, so here are some of the approaches I plan on using

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 First, I want to use the 0 degree view and either the -45 or +45 degree view. Theoretically, I expect that this should be the easiest setup since it provides the minimum change in the angle.

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 Then, I want to use only the -45 and +45 degree views without the 0 degree view. This should be more difficult since the change in the angle is double. Performing experiments in both of these ways will hopefully give insight into the impact of angle change on the success of the model.

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 Next, I want to try with just any two randomly chosen views. This may be less successful compared to the first case, but it should hopefully improve on the second, more difficult case.

Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 Eventually, I would like to setup the network such that you could input the desired viewpoint, so I want to see if the network is successful when trained on 2 views and trying to generate a third view that it has not learned at all.

Network Design This week, I implemented the entire network in PyTorch. It consists of 3 different networks pipelined together.

Appearance Encoder VGG16 Input: 112x112x3 Output: 7x7x512 Pretrained on ImageNet The first is the Appearance Encoder. I am using the VGG16 network for this part of the task. Since we are not doing classification at all, I removed the fully connected and softmax layers. Here it shows a 224x224 input, but we have a 112x112 input. We still wanted to get a 7x7 output for the feature map, so I also removed the final max pooling layer. https://neurohive.io/en/popular-networks/vgg16/

3D Pose Estimation I3D Input: 112x112x8x3 Output: 7x7x1x1024 Pretrained on Charades For the 3D Pose Estimation network, I am using I3D, again, with some modifications to get the desired output size. I removed the final logit layers as well as the average pooling at the end, and I modified the final max pooling layer to only resize the depth, not the height and width. Now, we can remove the temporal dimension of the output since it’s just 1, and it can be concatenated with the output of the VGG network to allow for input in the Generator “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”

Video Generator Self-Designed Generator Input: 7x7x1536 Output: 112x112x8x3 Conv2D Upsampling add temporal dimension Conv3D x2 Upsampling Conv3D x2 Upsampling Which is the 3rd and final network component. This network was self-designed. I start with a 2d convolutional layer, and then I add the temporal dimension back in and proceed with 3d convolution and Upsampling layers to get back up to the desired output dimensions, which match the original input. Conv3D x2 Upsampling Conv3D

Loss Function For the loss function, I am using Mean Squared Error

Loss Function But because of the network design

Loss Function There are actually 3 different losses. All 3 are MSE losses, so I use the sum of them as the overall loss.

Next Steps Write Data Loader Train the network Now, that I have the network fully implemented, my next step is to hopefully get access to the data next week and write a data loader. Then, I can train the network on the specific inputs I was interested in

Next Steps Write Data Loader Train the network Make improvements: to networks to loss function data input strategy After that, I can start making changes to hopefully improve the model. I can make changes to the network, the loss function I am using, and the data input strategies to see how those impact performance.