University of Central Florida

University of Central Florida
Correcting Cuboid Corruption For Action Recognition In Complex Environment Syed Zain Masood, Adarsh Nagaraja, Nazar Khan, Jiejie Zhu and Marshall Tappen University of Central Florida

Action Sequences Can be broadly divided into:
Activity: Person of interest performing the action Background: Context and/or Clutter Simple datasets: background uninteresting Complex datasets: context can be useful Background Activity

Complex Action Sequences
Most action recognition approaches treat action recognition problem holistically. Systems designed to make intelligent decision when selecting features. Complexity added till goal achieved. Global Representation

Issues with Holistic Methods
Lack of understanding about the decision making process of these complex systems. Most complex datasets have strong contextual cues. How well would a system perform on actions with unrelated complex backgrounds?

Our Approach Goal Purpose
Examine action recognition in a way that separates action from context Is the system able to make an intelligent decision when confronted with adverse context? Purpose Useful to measure how much context matters Will help improve handling background clutter Avoid unnecessary complexity for recognition performance and thus higher efficiency

Our Approach Problem: Solution:
Current datasets: Strong contextual cues Solution: Need for creating a new dataset where activity without strong relevant context Easier if based on older sets; makes it possible to benchmark against older work

UCF Weizmann Dynamic Dataset
Simple actions from Weizmann Action Dataset Complex backgrounds from YouTube Matte action on complex background [1] Dataset available at:

UCF Weizmann Dynamic No humans in background
Backgrounds selected randomly for matting Ensures unhelpful background In some cases, might even be detrimental e.g. different actions having the same complex background

Testing Methodology Baseline Performance Basic “bag-of-words” system
Tuned to perform as well as a number of recently-published systems Dataset Our Baseline STIPS (HOF)[2] Liu et al. [3] Weizmann 98% 92% 91% KTH 93.5% 93.8% YouTube 65% - 71.2%

Baseline Performance Significant drop in performance
Completely unable to deal with clutter. Dataset Our Baseline Weizmann 98% UCF Weizmann Dynamic 36.5%

Why performance degrades?
Is it the matting process? No. Tests conducted on action sequences matted on gray background show 94% recognition. Change from simple to complex background only difference between datasets Background cues contributing significantly to the recognition process

How to remove the effect of background?
Experiment #1: Isolate actor from videos using available masks Prune background interest points With no background clutter, results should be comparable to those on Weizmann dataset

Experiment #1: Background Pruning
Interestingly, results improve but not as significantly as they dropped Simple background pruning does not help our cause System Recognition Accuracy Original Baseline System 36.5% With Pruning 68%

Background Pruning Limitations
Out-of-place actions have background clutter in cuboid at “good” interest point locations. Interest point pruning eliminates spatial but not temporal background clutter.

How to overcome this limitation?
Removing background information within cuboids might be helpful Experiment #2: Cuboid Masking: Zero out background frames

Cuboid Masking Results
System Recognition Accuracy Original Baseline System 36.5% With Pruning 68% Cuboid Masking 89% Comparable results achieved with background pruning of interest points and masking within “good” interest point cuboids

The Next Steps All above experiments were conducted using ground-truth annotations. Now that we have identified the problem: Need to do away with ground-truth actor masks and implement automatic localization of the actor. Need to test the system on well known complex dataset where context might be helpful.

Automatic Localization
We combine: An off-the-shelf human detector [4,5] Saliency detector method [6]

Automatic Localization
System Accuracy Original Baseline System 36.5% Automatic Localization + Interest Point Pruning 41% Automatic Localization + Interest Point Pruning + Cuboid Masking 48% Automatic localization not as good on UCF Weizmann Dynamic dataset Still, optimal performance is achieved using both interest point pruning and cuboid masking for the automatic localization

UCF Sports Dataset Reasons for selecting this dataset: Small size
Good resolution Ground-truth actor masks available

UCF Sports Dataset Experiment using ground-truth masks:
Using both techniques gives the optimal performance System Accuracy Original Baseline System 68% Interest Point Pruning 79% Interest Point Pruning + Cuboid Masking 85%

UCF Sports Dataset Experiment using automatic localization:
Automatic localization results not as bad as for UCF Weizmann Dynamic dataset Again, using cuboid masking results in the best performance System Accuracy Original Baseline System 68% Interest Point Pruning 77% Interest Point Pruning + Cuboid Masking 80%

What Have We Learned? Holistic approaches suffer without good context.
Localization is important and thus localization methods need to improve. Correct use of localization is essential. Once we can localize well, we can bring context back as an additional cue.

References [1] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30:228–242, [2] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In Computer Vision and Pattern Recognition, CVPR IEEE Conférence on, pages 1–8, [3] J. Liu, J. Luo, and M. Shah. Recognizing realistic actions from videos ”in the wild”. Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, 0:461–468, [4] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester. Discriminatively trained deformable part models, release 4. pff/latent-release4/. [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:1627–1645, [6] S. Goferman, L. Zelnik-Manor, and A. Tal. Context- aware saliency detection. In CVPR, pages 2376–2383. IEEE, 2010.

Q & A This work was supported by NSF grants IIS and IIS

University of Central Florida

Similar presentations

Presentation on theme: "University of Central Florida"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Central Florida

Similar presentations

Presentation on theme: "University of Central Florida"— Presentation transcript:

Similar presentations

About project

Feedback