Presentation is loading. Please wait.

Presentation is loading. Please wait.

Siamese Instance Search for Tracking

Similar presentations


Presentation on theme: "Siamese Instance Search for Tracking"— Presentation transcript:

1 Siamese Instance Search for Tracking
荷蘭 阿姆斯特丹大學 Ran Tao, Efstratios Gavves, Arnold W.M. Smeulders QUVA Lab, Science Park 904, 1098 XH Amsterdam CVPR 2016

2 Outline Introduction Siamese network architecture
Siamese Instance Search Tracker Experiments Conclusion

3 Outline Introduction Siamese network architecture
Siamese Instance Search Tracker Experiments Conclusion

4 Introduction At the core of many tracking algorithms is the function by which the image of the target is matched to the incoming frames We learn the matching mechanism from external videos that contain various disturbing factors the invariances without modeling these invariances Once the matching function has been learnt on the external data we do not adapt it anymore do not learn the tracking targets we apply it as is to new tracking videos of previously unseen target objects

5 Introduction Without explicit occlusion detection , model updating , tracker combination , forget mechanisms and other We simply match the initial target in the first frame with the candidates in a new frame

6 Outline Introduction Siamese network architecture
Siamese Instance Search Tracker Experiments Conclusion

7 Siamese network architecture
Siam 暹(ㄒㄧㄢ)羅 => now Thailand Siamese twins Siamese network [3] J. Bromley, J. W. Bentz, L. Bottou, I. Guyon, Y. LeCun, C. Moore, E. Sackinger, and R. Shah. Signature verification using a siamese time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7(04):669–688, 1993 1811年-1874 恩(Eng)昌(Chang) 馬戲團

8 Siamese network architecture
Cosine distance Shared weights Time Delay Neural Networks 1 for genuine pairs -1 for forgery pairs

9 Siamese network architecture
Loss function : 𝑙(d( 𝑥 𝑗 , 𝑥 𝑘 )) d( 𝑥 𝑗 , 𝑥 𝑘 ) = || 𝑓 𝑥 𝑗 −𝑓( 𝑥 𝑘 ) || 2 𝑓 𝑥 𝑗 𝑓 𝑥 𝑘 Siamese network CNN CNN W Patch 𝑥 𝑗 Patch 𝑥 𝑘

10 Outline Introduction Siamese network architecture
Siamese Instance Search Tracker Experiments Conclusion

11 Siamese Instance Search Tracker
1 2 3 Candidate sampling Box refinement

12 random pick another frame random pick one frame
AlexNet VGGNet max pooling reduce location precision lower level visual patterns, such as edges and angles higher layers get activated the most on more complex patterns, such as faces and wheels random pick another frame overlap> ρ + : positive (y=1) overlap< ρ − : negative (y=0) random pick one frame

13 Siamese Instance Search Tracker
𝑦 𝑗𝑘 𝜖 {0,1} 𝜖 : the minimum distance margin Euclidean distance

14 Siamese Instance Search Tracker
Tracking Inference Candidate sampling Radius sampling strategy Box refinement As in RCNN, train {x, y, w, h} of the box based on the first frame

15 Outline Introduction Siamese network architecture
Siamese Instance Search Tracker Experiments Conclusion

16 Experiments Implementation Details Candidate Sampling
10 radial and 10 angular divisions Search radius : the longer axis of the initial box in the first frame Scales : { , 1, 2 }

17 pairs of boxes/pair of frame
[41] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: an experimental survey. TPAMI, 36(7):1442–1468, 2014 Experiments Implementation Details Network training Load the pre-trained network parameters (ImageNet) and fine tune (Learning rate & Weight decay = 0.001) ALOV dataset [41] - removing the 12 videos (OTB) Positive : IoU > 0.7 Negative : IoU < 0.5 decreased by a factor of 10 after every 2 epochs pairs of frames pairs of boxes/pair of frame Training 60,000 128 Validation 2,000

18 Experiments Dataset Online tracking benchmark (OTB) [52] 50 videos
[52] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In CVPR, 2013 Experiments Dataset Online tracking benchmark (OTB) [52] 50 videos Evaluation metrics Success plot : Bounding boxes IoU (AUC) Precision plot : Bounding boxes center

19 Experiments Design evaluation
Network pre-tuned on ImageNet v.s. network tuned generically on external video data v.s. network fine tuned target-specifically on first frame To max pool or not to max pool? Multi-layer features vs. single-layer features Large net vs. small net

20 Experiments marginal improvement
fine tuned using large amount of external data are to be preferred

21 Experiments To max pool or not to max pool?

22 Experiments Multi-layer features vs. single-layer features
uses the outputs of layers conv4, conv5 and fc6 as features

23 Experiments Large net vs. small net

24 Experiments State-of-the-art comparison
29 trackers included in the benchmark [52] e.g. TLD [23], Struck [14] and SCM [57] recent trackers TGPR [9], MEEM [56], SO-DLT [47], KCFDP [19] and MUSTer [18] [18] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, and D. Tao. Multi-store tracker (muster): A cognitive psychology inspired approach to object tracking. In CVPR, 2015

25 Experiments SINT+ : uses an adaptive candidate sampling strategy suggested by [48] and optical flow [4] The sampling range is adaptive to the image resolution 30/512*w (w is image width) Filter out motion inconsistent candidates estimated optical flow, remove the candidate boxes that contain less than 25% of those pixels focuses on the tracking matching function [48] N. Wang, J. Shi, D.-Y. Yeung, and J. Jia. Understanding and diagnosing visual tracking systems. In ICCV, 2015 [4] T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. TPAMI, 33(3):500–513, 2011

26 Experiments

27 Experiments Temporal and spatial robustness

28 Experiments Per distortion type comparison occlusion and deformation
motion blur, fast motion, in-plane rotation, out of view and low resolution

29 Experiments Failure modes of SINT similar confusing object occlusion
same uniform occluded by the lighter

30 Experiments Additional sequences and re-identification
6 newly collected sequences from YouTube

31 Experiments Fishing Rally BirdAttack Soccer GD Dancing scale change
fast motion out-of-plane rotation non-rigid deformation low contrast illumination variation poorly textured

32 Experiments

33 Experiments We, furthermore, observe that provided a window sampling over the whole image using [58] SINT is accurate in target re-identification, after the target was missing for a significant amount of time from the video (Yoda) [58] C. L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In ECCV, 2014

34 1500-frame, 12-shot Star Wars video
appearing in 6 of the shots Red : present Black : absent

35 Outline Introduction Siamese network architecture
Siamese Instance Search Tracker Experiments Conclusion

36 Conclusion Learn a matching function matching the initial target in the first frame with candidates in a new frame The matching function is learned on ALOV, based on the proposed Siamese network No overlap between the training videos and any of the videos for evaluation Do not aim to do any pre-learning of the tracking targets

37 Conclusion Reaches state-of-the-art performance on OTB without updating the target, tracker combination, occlusion detection and alike Further, SINT allows for target re-identification after the target was absent for a complete shot

38 Thanks for listening!


Download ppt "Siamese Instance Search for Tracking"

Similar presentations


Ads by Google