Download presentation
Presentation is loading. Please wait.
1
Actor-Object Relation in Videos
Volodymyr Bobyr and Aayushjungbahadur Rana
2
Task Input: Dataset: VidOR – 10,000 Video-Clips Output: A video with:
Actors: Adult, Child, Dog Objects: toys, furniture, etc. Actions: “holding”, “in front”, “talking to”, etc. Output: Spatial & Temporal Pixel-Perfect Localization of actors, objects, and actions Dataset: VidOR – 10,000 Video-Clips
3
Approach Convolutional encoder/decoder network: 4 Stages:
Encoder backbone: I3D pretrained on kinetics Decoder: Feature pyramid network with diluted convolutions and side-connections 4 Stages: Actor & Object spatial segmentation Centroid Detection Action spatial segmentation Temporal connection – postprocessing
4
Details Input: (n_frames, 224, 224, 3) Output: Class Imbalance:
Actor/Object Segmentation: (n_frames, 56, 56, 80) Centroid Detection: (n_frames, 56, 56, 1) Action Segmentation: (n_frames, 56, 56, 52) Class Imbalance: People: 56% of all objects Background: in every videoclip Solution: class weights
5
Mean Intersection over Union among pixels in each frame
IoU Metrics Mean Intersection over Union among pixels in each frame
6
Data Preparation & Output Example
Original image Augmented Image Experimental Segmentation Output Original centroids Augmented Centroids
7
Experimental Results In the past: Loss: Binary Cross-Entropy
8
Experimental Results Before: Loss: Categorical Cross-Entropy
9
Experimental Results Now:
Categorical Cross-Entropy + Augmentation Tweaks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.