Improving Human Action Recognition using Score Distribution and Ranking Minh Hoai Nguyen Joint work with Andrew Zisserman 1
2 Inherent Ambiguity: When does an action begin and end?
Precise Starting Moment? 3 -Hands are being extended? -Hands are in contact?
4 When Does the Action End? -Action extends over multiple shots -Camera shows a third person in the middle
Video clip Latent location of action Consider subsequences Max HandShake classifier Action Location as Latent Information HandShake scores Recognition score (in testing) Update the classifier (in training)
Poor Performance of Max 6 DatasetWholeMax Hollywood TVHID Mean Average Precision (higher is better) Possible reasons: The learned action classifier is far from perfect The output scores are noisy The maximum score is not robust Action recognition is … a hard problem
Video clip Latent location of action Considered subsequences HandShake classifier Can We Use Mean Instead? HandShake scores Mean On Hollywood2, Mean is generally better than Max WholeMaxMean Hollywood2-Handshake But not always
Another HandShake Example 8 The proportion of HandShake is small For Whole and Mean, the Signal-to-Noise ratio is small
Latent location of actionVideo clip HandShake scores Sampled subsequences Sort Improved HandShake score Distribution-based classification Base HandShake classifier Proposed Method: Use the Distribution
Case 1: equivalent to using Mean Learning Formulation Subsequence-score distribution Video label weights bias Hinge loss Weights for Distribution Emphasize the relative importance of classifier scores Special cases: Case 2: equivalent to using Max
Controlled Experiments 11 Random action location Synthetic video Two controlled parameters: -The action percentage -, the separation between non-action and action features
Controlled Experiments 12
Hollywood2 – Progress over Time %9.3% Best Published Results Mean Average Precision (higher is better)
Hollywood2 – State-of-the-art Methods 14 Dataset Introduction (STIP + scene context) Deep Learning features Mined compound features Dense Trajectory Descriptor (DTD) Improved DTD (better motion est.) DTD + saliency same Mean Average Precision (higher is better)
Results on TVHI Dataset % Mean Average Precision (higher is better)
Weights for SSD classifiers 16
AnswerPhone Example 1 17
AnswerPhone Example 2 18
The End 19