Actor-Object Relation in Videos Volodymyr Bobyr and Aayushjungbahadur Rana
Task Input: Dataset: VidOR – 10,000 Video-Clips Output: A video with: Actors: Adult, Child, Dog Objects: toys, furniture, etc. Actions: “holding”, “in front”, “talking to”, etc. Output: Spatial & Temporal Pixel-Perfect Localization of actors, objects, and actions Dataset: VidOR – 10,000 Video-Clips
Approach Convolutional encoder/decoder network: 4 Stages: Encoder backbone: I3D pretrained on kinetics Decoder: Feature pyramid network with diluted convolutions and side-connections 4 Stages: Actor & Object spatial segmentation Centroid Detection Action spatial segmentation Temporal connection – postprocessing
Details Input: (n_frames, 224, 224, 3) Output: Class Imbalance: Actor/Object Segmentation: (n_frames, 56, 56, 80) Centroid Detection: (n_frames, 56, 56, 1) Action Segmentation: (n_frames, 56, 56, 52) Class Imbalance: People: 56% of all objects Background: in every videoclip Solution: class weights
Mean Intersection over Union among pixels in each frame IoU Metrics Mean Intersection over Union among pixels in each frame
Data Preparation & Output Example Original image Augmented Image Experimental Segmentation Output Original centroids Augmented Centroids
Experimental Results In the past: Loss: Binary Cross-Entropy
Experimental Results Before: Loss: Categorical Cross-Entropy
Experimental Results Now: Categorical Cross-Entropy + Augmentation Tweaks