Download presentation
Presentation is loading. Please wait.
1
Recognizing Humans: Action Recognition
adam So, coming from pose recognition, we can further recognize what people are doing from their poses
2
Overview Data (Human localization) Action representation
Classification i’ll try to give a quick general super high-level overview of action recognition across a couple of papers, then focus on the paper that was posted so, generally, what people do to recognize actions is to first get some data find where the people in that data are if necessary represent each class of actions somehow train a classifier on those representations so let’s talk a bit about what kind of data people use
3
Getting Data
4
Recognizing Human Actions: A Local SVM Approach (2004)
There were some attempts with very controlled settings; this one is a set of videos of people doing stuff in a field or room in this one, most of the people are centralized, and all the actions are emphasized and there’s no noise these are easy to label, since there are like, 6 actions
5
Learning Actions From the Web (2009)
in the paper yall read, they used images from the web as sources of data (but evaluated on video)
6
Learning Realistic Human Actions from Movies (2010)
movies can also be a good source of data for actions the labeling is based on the movie’s scripts -- so, they look at the times of actions in the scripts (as an example, “Rick sits down with Ilsa”) which they have specific times for (since it’s a script) and they match those time snippets with the action (“sit down”)
7
Describing Common Human Visual Actions in Images (2015)
And now the size of data has been growing -- this paper uses the MSCOCO captioned set not strictly actions, but actions are a subset of the images these are mostly used in deep learning situations i think? 2015 Describing Common Human Visual Actions in Images M.R. Ronchi and P. Perona BMVC 2015, Swansea, Wales, September 2015.
8
Human Localization Localization may be necessary if there’s a lot of noise in the data, that is, if the people are not necessarily the foreground objects
9
A Discriminatively Trained, Multiscale, Deformable Part Model (2008)
the posted paper used the Deformable parts model which basically uses a pyramid of features that we’ll talk about in a bit and goes by the idea that objects are made of a bunch of parts which can be deformed
10
Action Representation (Features)
11
Recognizing Human Actions: A Local SVM Approach (2004)
since this paper used video, they incorporated spatio-temporal features
12
that is, they’re taking the gradient wrt x,y, and time
13
Learning Actions From the Web (2009)
Histogram of Oriented Gradients (HOG) features i think another group will be presenting more in depth on this, but here’s one set of features that you can use essentially the idea is to take a grid of the image, then take the image gradients in different directions and represent with the distribution (i.e. histogram) of those gradients these descriptors are affected by noisy responses, so the posted paper took a prior “probability of boundary” operator, which sort of takes the outline of the foreground objects
14
Actions Representation (Poses)
Jae Sung already talked a lot on poses, so I won’t spend too much time here
15
Recognizing Human Actions from Still Images with Latent Poses (2010)
but you can take the HOG features further by associating specific portions of the images like looking at arms or legs each of these rows specifies a “poselet”, which is learned from HOG features and SVM
16
Same paper. then taking those further to generalize where the joints are parts of a larger object so we define a “pose” to be legs, arms, and upper body and their locations other ways to represent may be to use attributes like we saw last class
17
Classification Non/Linear SVMs Regression Clustering Deep stuff
After representation you can use some method to classify your actions
18
Learning Actions from the Web
now let’s look at some parts of the posted paper that we havent talked about yet
19
Steps Query and collect images of actions from search engines
Find a representation for the images Build action models for each action Classify actions and annotate video I think their paper can be broken down into 4 main steps: (above) querying is straightforward enough. they used a really small set of actions though: running/walking/sitting/playing golf/dancing we also talked about the representation used, which was the PbHOG features
20
Cleaning the Noise Use a simple classifier for common foreground features background stuff shouldn’t influence action Separate features into background and foreground Keep images with high prob of foreground, toss images with low foreground Iterate through query results, and retrain classifier since the queries from the web (then) were pretty noisy, they attempted to clean up the data so, using the HOG features we talked about earlier, they separated features based on whether they were in the background or foreground which is decided based on whether or not the pose estimation box detected a person then they used a logistic regression with L2 regularization to reduce overfitting want to minimize the log-likelihood, y=1 is foreground, y=-1 is background, x is feature vector, w is weights
21
Combining different poses of the same action
NMF factorization to decompose data Cluster images with same max response basis vectors (Choose k=5, for different directions) Since the data is also variable in the different poses of actions, we want to somehow say that certain poses are the same action so we can use non-negative-matrix factorization to cluster these poses from the poses, we can then train a local classifier on the actions for our annotation task
22
Annotating Video For annotations, they track bounding boxes for each person they find in the frame, they also do some smoothing based on a dynamic programming approach and optimize for the highest classification probability
23
Evaluation Improving query returns Annotating video
24
Evaluation - query though i’m not sure how they determine the relevancy of each image. i think it’s by hand?
25
Evaluation - video and here’s the confusion matrix for their video annotations, per frame what jumps out is that a lot of the running poses get misclassified as dancing, and they say that the “dancing poses look similar to running” do you think that makes sense?
26
Extra stuff Neural Talk and Walk
trained on MS Coco image/caption pairs this is some extra stuff if there’s time. the system in this video uses a deep net model trained on the MS coco image set with captions, and tries to caption things as they’re walking by it’s not all actions, but since the set does include actions, it’ll try to classify those, like this one
27
Links to papers Recognizing human actions a local SVM approach
Recognizing human actions from still images with latent poses Recognizing realistic actions from videos “in the wild” Recognizing human actions from attributes Learning realistic human actions from movies Learning actions from the web A Discriminatively Trained, Multiscale, Deformable Part Model
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.