Self-Supervised Cross-View Action Synthesis Kara Schatz Advisor: Dr. Yogesh Rawat UCF CRCV – REU, Summer 2019
Synthesize a video from an unseen view. Project Goal Synthesize a video from an unseen view. The goal of this project is to be able to synthesize a video from an unseen view
Synthesize a video from an unseen view. Project Goal Synthesize a video from an unseen view. Given: video of the same scene from a different viewpoint appearance conditioning from the desired viewpoint In order to achieve this, our approach will use a video of the same scene from a different viewpoint as will as appearance conditioning from the desired viewpoint
Approach This diagram shows the approach that we are using to accomplish our goal. The overall idea is to use a network to learn the appearance of the desired view and another network to learn a representation for the 3D pose in a different view of the video. Then, we will take both of those and input them into a video generator that will reconstruct the video from the desired view. To do the training, we will run the network on two different views and reconstruct both viewpoints. Once trained, we will only need to give one view of the video an one frame of the desired view.
Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 The dataset that we will use is the NTU dataset, which contains over 56 thousand videos that are taken from 3 different camera angles, which give us the different viewpoints. The videos will be resized and randomly cropped to 112x112 before being passed to the network
Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 Training inputs: There are many different ways to utilize the dataset for training, so here are some of the approaches I plan on using
Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 First, I want to use the 0 degree view and either the -45 or +45 degree view. Theoretically, I expect that this should be the easiest setup since it provides the minimum change in the angle.
Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 Then, I want to use only the -45 and +45 degree views without the 0 degree view. This should be more difficult since the change in the angle is double. Performing experiments in both of these ways will hopefully give insight into the impact of angle change on the success of the model.
Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 Next, I want to try with just any two randomly chosen views. This may be less successful compared to the first case, but it should hopefully improve on the second, more difficult case.
Dataset: NTU RGB+D 56K+ videos 3 camera angles: -45°, 0°, +45° Resize and randomly crop to 112x112 Eventually, I would like to setup the network such that you could input the desired viewpoint, so I want to see if the network is successful when trained on 2 views and trying to generate a third view that it has not learned at all.
Network Design This week, I implemented the entire network in PyTorch. It consists of 3 different networks pipelined together.
Appearance Encoder VGG16 Input: 112x112x3 Output: 7x7x512 Pretrained on ImageNet The first is the Appearance Encoder. I am using the VGG16 network for this part of the task. Since we are not doing classification at all, I removed the fully connected and softmax layers. Here it shows a 224x224 input, but we have a 112x112 input. We still wanted to get a 7x7 output for the feature map, so I also removed the final max pooling layer. https://neurohive.io/en/popular-networks/vgg16/
3D Pose Estimation I3D Input: 112x112x8x3 Output: 7x7x1x1024 Pretrained on Charades For the 3D Pose Estimation network, I am using I3D, again, with some modifications to get the desired output size. I removed the final logit layers as well as the average pooling at the end, and I modified the final max pooling layer to only resize the depth, not the height and width. Now, we can remove the temporal dimension of the output since it’s just 1, and it can be concatenated with the output of the VGG network to allow for input in the Generator “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset”
Video Generator Self-Designed Generator Input: 7x7x1536 Output: 112x112x8x3 Conv2D Upsampling add temporal dimension Conv3D x2 Upsampling Conv3D x2 Upsampling Which is the 3rd and final network component. This network was self-designed. I start with a 2d convolutional layer, and then I add the temporal dimension back in and proceed with 3d convolution and Upsampling layers to get back up to the desired output dimensions, which match the original input. Conv3D x2 Upsampling Conv3D
Loss Function For the loss function, I am using Mean Squared Error
Loss Function But because of the network design
Loss Function There are actually 3 different losses. All 3 are MSE losses, so I use the sum of them as the overall loss.
Next Steps Write Data Loader Train the network Now, that I have the network fully implemented, my next step is to hopefully get access to the data next week and write a data loader. Then, I can train the network on the specific inputs I was interested in
Next Steps Write Data Loader Train the network Make improvements: to networks to loss function data input strategy After that, I can start making changes to hopefully improve the model. I can make changes to the network, the loss function I am using, and the data input strategies to see how those impact performance.