Self-Supervised Cross-View Action Synthesis Kara Schatz Advisor: Dr. Yogesh Rawat UCF CRCV – REU, Summer 2019
Synthesize a video from an unseen view. Project Goal Synthesize a video from an unseen view.
Synthesize a video from an unseen view. Project Goal Synthesize a video from an unseen view. Given: video of the same scene from a different viewpoint single image from the desired viewpoint
Motivation
Motivation Humans can do this easily. Can machines too?
Motivation Humans can do this easily. Can machines too? Cross-view image synthesis has been done
Motivation Humans can do this easily. Can machines too? Cross-view image synthesis has been done Cross-view video synthesis has not
Datasets
Datasets NTU 13K+ training videos 5K+ testing videos 3 camera angles: -45°, 0°, +45°
Datasets NTU PANOPTIC 13K+ training videos 5K+ testing videos 3 camera angles: -45°, 0°, +45° ~4000 training samples ~500 testing samples 100 cameras
Approach
Approach
Approach
Approach Key Point Extraction Key Point Extraction Key-points
Approach Key Point Extraction Trans- formation Key Point Extraction viewpoint Key Point Extraction Trans- formation Key-points Estimated Keypoints Key-points Key Point Extraction Trans-formation Key-points Estimated Keypoints Key-points viewpoint
Approach Key Point Extraction Trans- formation Consistency losses viewpoint Key Point Extraction Trans- formation Key-points Estimated Keypoints Key-points Consistency losses Key Point Extraction Trans-formation Key-points Estimated Keypoints Key-points viewpoint
Total Loss vs. Epochs Dataset = NTU Batch size = 20 Frame count = 16 Skip rate = 2 Old network New network
Output Frames: NTU Network 1 Input: Output: Ground Truth:
Output Frames: NTU Network 1 Network 2 Input: Output: Ground Truth:
Output Frames: Panoptic Network 1 Input: Output: Ground Truth:
Output Frames: Panoptic Network 1 Network 2 Input: Output: Ground Truth:
Output Frames: NTU FRAME 1 FRAME 2 Ground Truth: Output:
Output Frames: NTU . . . . . . FRAME 1 FRAME 2 FRAME 15 FRAME 16 Ground Truth: . . . Output:
Output Frames: Panoptic Ground Truth: Output:
Output Frames: Panoptic . . . Ground Truth: . . . Output:
Next Step Key Point Extraction Trans- formation Consistency losses viewpoint Key Point Extraction Trans- formation Key-points Estimated Keypoints Key-points Consistency losses Key Point Extraction Trans-formation Key-points Estimated Keypoints Key-points viewpoint
Next Step Improve key-point prediction and transformation to hopefully capture the actions in the videos Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of object landmarks through conditional image generation, 2018.