Download presentation
Presentation is loading. Please wait.
1
Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju
2
Motivation Activity recognition from video for higher functionality Who is presenting agenda item Attendee interest levels
3
Motivation Want it to be automatic and not involve hand generation of models Impractical in the case of many activities Less versatile as you might be constrained to particular aspects of the problem
4
Problem Definition Video Data Observations are extracted movement deltas via face tracking Hand label training segments Learn underlying models from training segments Carry out activity recognition
5
Approach - Learning Assume underlying models can be approximated by HMMs Use Baum Welch to learn best model using training segments Need to find observation space and number of states
6
Approach - Learning HMMs:
7
Approach - Learning To find observation space: Run through all training segments and add observations For new observation when doing recognition, augment learned observation matrices
8
Approach - Learning To find number of states, Q (for each activity): Set upper bound as length of longest training segment Iterate over values and generate most likely model using Baum Welch
9
Approach - Learning To find number of states, Q (for each activity): Choose best Q using N-fold cross validation using criterion of discriminative power With best Q, run Baum Welch using a number of sets of randomly initialized parameters to get λ a
10
Approach - Recognition Define a window width, w From the beginning, sequentially consider windows of observations (where L is length of entire sequence)
11
Approach - Recognition Calculate likelihood of each window segment L Rabinier, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings IEEE, 1989
12
Approach - Recognition Label middle frame in each window with activity with highest likelihood
13
Evaluation and Results Activities being observed:
14
Evaluation and Results Observation stream obtained from 87 second long image sequence 1296 individual frames Example frames after face detection:
15
Evaluation and Results Observation sequence first hand labeled Segments showing same activity extracted 4 training segments used to learn each activity
16
Evaluation and Results
17
Once underlying models were learned, calculate likelihood using sliding window Value of 21 was used for the window width, w, as this was the average length of training segments
18
Evaluation and Results
19
Carry out recognition using the likelihoods by assigning activities to the frames Compare against hand assigned labels Accuracy approximately 76%
20
Evaluation and Results Algorithm assigned: Different from hand label Same as hand label
21
Evaluation and Results Hand assigned: Different from algorithm label Same as algorithm label
22
Future Work Learn underlying model generating sequence of activities themselves Standardize lengths of training segments using Dynamic Time Warping and use that as the window width
23
The End Questions
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.