Download presentation
Presentation is loading. Please wait.
1
Weak-supervision based Multi-Object Tracking
Alex Ruiz, Jyoti Kini, Dr. Mubarak Shah University of Central Florida Project Description Architecture Quantitative Results Training: Reset101 extracts features for each input images. Self Attention Module retains geometric patterns and long-range dependencies. 4D tensor correlation tensor represents the similarity matrix between a pair of image features. Neighbourhood Consensus Network using 4D convolutions eliminates inconsistent matches to further refine the affinity between image pairs. Loss: For a training pair (Ia and Ib), the weakly supervised loss is computed as below: where (S-a and S-b) represent the mean matching scores, also positive pairs are labeled with y=1 and negative pairs with y=-1 The proposed network tracks the object detections by extracting pixel-to-pixel correspondence, and further generating tracklets per object by associating labels to the objects. Based on the given quantitative results, the loss appears to be reducing and converging in both training and testing phases respectively, and self-attention module enhances the performance of the network. Goal: To solve the Multiple Object Tracking (MOT) problem in a sequence of frames by finding dense correspondences between a pair of images in a weakly-supervised manner. Steps: Obtain dense pixel-to-pixel matches between set of images, using the proposed similarity model. Key-point Matching Extract set of relevant key-points/image-pixels per object. Find the tracklets by mapping the pixels to the object detections. Tracklet Association Introduce tracklet-history based cascading capabilities to account for occlusions and to reduce identity switches. Conclusion Dataset The proposed model consisting of Self-Attention based module followed by the Neighbourhood Consensus Network effectively captures both the global long-term dependencies as well as the local context to enable Multiple Object Tracking. MOT17 dataset, primarily, comprises of 14 video sequences with crowded scenarios, camera motions, varying viewpoints, challenging weather conditions and balanced distribution of crowd density across training and the test-set. Additionally, the dataset provides object detections using existing detectors - DPM, FRCNN and SDP. Each datapoint in the the detection CSV file is in the format: frame number, identity number, bounding box coordinates – left, top, width, height, confidence score, class and visibility. Detection annotation: L(Ia, Ib) = -y (S-a + S-b) Future Work We intend to implement the Multi-head Attention module to further improve the key-point matches. Qualitative Results References [1] Rocco I, Cimpoi M, Arandjelović R, Torii A, Pajdla T, Sivic J. Neighbourhood Consensus Networks. InAdvances in Neural Information Processing Systems 2018 (pp ). [2] Milan, A., Leal-Taixé, L., Reid, I., Roth, S., & Schindler, K. (2016). MOT16: A benchmark for multi-object tracking. arXiv preprint arXiv: [3] Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2018). Self-attention generative adversarial networks. arXiv preprint arXiv: 1, 2, , , , , 1, ,
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.