Parsing Human Motion with Stretchable Models

Name: Parsing Human Motion with Stretchable Models
Uploaded: 2017-10-20T16:02:06+00:00
Duration: PTM29S54
Description: Parsing Human Motion with Stretchable Models

Parsing Human Motion with Stretchable Models
Ben Sapp, David Weiss, Ben Taskar Hi, My name is Ben Sapp, and I'll be presenting our paper, titled "Parsing Human Motion with Stretchable Models"; joint work with my colleague David Weiss and advisor, Ben Taskar.

Parsing Human Motion Input Desired Output
The goal of this paper is The goal is to determine the joint locations of all body parts in every frame, as seen on the right. note #1: this is groundtruth! note #2: offline processing setting note #3: no hand-initialization

What to Model Detecting joints in isolation is hard
Where are the elbows? This is an extremely difficult problem: the joints in isolation are very hard to detect due to ambiguous appearance. We can do our best to capture their appearance using standard features such as discriminative HoG templates, optical flow, and skin/clothing detectors, but the signal is inherently weak. To improve this, we can model the pairwise relationships between kinematically connected parts, using larger HoG templates, contour support, and geometric properties such as limb length and angle. There are still more useful joint-pair relations we can model to be more accurate. Left-right symmetric joint-pairs can measure the color consistency of symmetric parts, their relative distance and enforce sensible left-right ordering of joints. ZOOM IN OF ELBOW

What to Model Detecting joints in isolation is hard
Need to describe relationships between joints Limbs HoG Contour support Length Angle Left-Right Symmetry Color Similarity Distance Left-Right Ordering This is an extremely difficult problem: the joints in isolation are very hard to detect due to ambiguous appearance. We can do our best to capture their appearance using standard features such as discriminative HoG templates, optical flow, and skin/clothing detectors, but the signal is inherently weak. To improve this, we can model the pairwise relationships between kinematically connected parts, using larger HoG templates, contour support, and geometric properties such as limb length and angle. There are still more useful joint-pair relations we can model to be more accurate. Left-right symmetric joint-pairs can measure the color consistency of symmetric parts, their relative distance and enforce sensible left-right ordering of joints. Joints HoG Skin color Optical Flow

What to Model Frame t Frame t+1 Left-Right Symmetry Limbs
Color similarity Distance Left-right ordering Limbs HoG Length Angle Contour support Temporal Persistence Color tracking Joint motion Finally, we can exploit temporal cues to further improve the model, by encoding the fact that temporal changes in appearance and location are smooth. Joints HoG Skin color Optical flow

= Full Model … … … time t t+1 t-1 Joints HoG Skin color Optical flow
Limbs Length Angle Contour support Symmetry Color similarity Distance Ordering Time Persistence Color tracking Joint motion All of these features and joint relationships capture essentially everything we could think of to describe and constrain a pose configuration of a person throughout time.

= Full Model: a CRF INTRACTABLE … … … … time t-1 t t+1 T
N joints, T frames This description of the problem naturally defines a Conditional Random Field. We let y denote the locations of all joints in all frames, and x the observed, video data. then we can evaluate the probability of any particular joint configuration as a product of unary and pairwise terms, based on the features we have just described. we can also specify the best possible configuration as the argmax solution of this distribution. The only problem is, because of the cyclic nature of the graph structure, we are forced to reason over an exponential number of possibilities, make this problem intractable. INTRACTABLE MAP placement:

Sidestepping Intractability
Filtering / Greedy methods Make hard decisions at each time step to keep around just a few possibilities Can’t correct for mistakes made in early frames Monocular 3D Pose Estimation and Tracking by Detection, Andriluka et al., CVPR10 Tracking as Repeated Figure/Ground Segmentation, Ren & Malik, CVPR07 Tracking by detection: An MCMC-based Particle Filter for Tracking Multiple Interacting Targets, Khan et al., ECCV’04 Particle filtering: So how do most people handle the computational intractability of tracking multiple, correlated parts through time? They are forced to resort to approximate inference techniques of various kinds, each with its own limitations. Sampling methods typically only reason forwards through time, and cannot correct for mistakes made in earlier times steps. Loopy belief propagation requires iteratively re-computing messages through the graph, and may never converge. Due to the computational cost, most methods restrict themselves to simple pairwise relationships that allow for fast message passing techniques, at the cost of losing some important cues that would help the problem.

Loopy belief propagation Costly iterative message passing, may not converge Forced to use simple edge relationships which allow fast inference tricks (convolutions, distance transforms) Measure Locally, Reason Globally: Occlusion-sensitive Articulated Pose Estimation, Sigal and Black, CVPR06 Progressive Search Space Reduction for Human Pose Estimation, Ferrari et al. CVPR08 So how do most people handle the computational intractability of tracking multiple, correlated parts through time? They are forced to resort to approximate inference techniques of various kinds, each with its own limitations. Sampling methods typically only reason forwards through time, and cannot correct for mistakes made in earlier times steps. Loopy belief propagation requires iteratively re-computing messages through the graph, and may never converge. Due to the computational cost, most methods restrict themselves to simple pairwise relationships that allow for fast message passing techniques, at the cost of losing some important cues that would help the problem.

Tree Decomposition + Agreement Algorithms
How do we do it? So how do we address this problem? We want all the rich pairwise interactions we discussed, but we need to maintain tractability. In this paper, we propose a way to decompose the full model into a collection of trees, and introduce novel ways of making them agree. Tree Decomposition + Agreement Algorithms + efficient + not greedy

frame t frame t+1 Cyclic Tree vs. = While inference in the full model is exponential in the number of joints, inference in a tree graphs is only *linear* in the number of time steps! HOWEVER, in any one tree, we we cannot cover all the edges – we lose some of the interesting interactions which we wanted to capture. Inference linear in number of joints ✔ Inference exponential in number of joints ✗ ✗ Lose some of our edges!

Sidestepping Intractability:
Complete graph is equivalent to the product of M submodel distributions. So, instead of just a single tree, we’re going to decompose our full model into a collection of M submodels, with each submodel covering some of the edges of the original graph. For example, here is the submodel which tracks the left and right elbows together, and also tracks the right elbow through time.

Model (dis)agreement For any given pose configuration y, models are equivalent: But in general, the submodels do not agree on what the best solution is: different submodels m and m’

Model (dis)agreement different tree models

Degree of Agreement / Computational Cost
The Value of Agreement force agreement No Agreement Single Variable Agreement Single Frame Agreement Full Agreement (Dual Decomposition) In this talk, we’ll go over a variety of techniques to come up with a single best guess. The techniques lie on a continuum of tradeoffs between computational effort, and the degree to which we force the tree models to agree on a decision. Guess which one works the best? It’s not what you think! Degree of Agreement / Computational Cost Does more agreement = better performance?

= Storyline Tree decomposition Rich pairwise features Model Agreement
× Rich pairwise features Model Agreement pairwise unary The remainder of the talk will go into details in 3 parts. First, we discuss the specifics of our different inference and learning techniques. Second, we describe all the features that go into our system. Third, we show our results against previous work on our new VideoPose dataset. Results ours prev. work

Discretizes angles into 15o+ increments, and only 1-3 scales.
A stretchable model Joint-based Limb-based versus Fine-grained variability in limb angle and length. Discretizes angles into 15o+ increments, and only 1-3 scales. We call our model stretchable because the joint-based representation allows us fine grained variability in the length and angle of limbs. this is CRUCIAL when it comes to fitting foreshortening. This is in comparison to previous pictorial structures representations which use a rectangular part representation, and are forced to discretize the number of angles and scales coarsely Also of note, this stretchable property is a major selling point of another talk at this CVPR, which they call flexible Stretchability is also a major selling point of another CVPR11 talk: Articulated Pose Estimation with Flexible Mixtures-of-Parts, (Yang and Ramanan) ≅ Stretchable

Features Frame t Frame t+1 Left-Right Symmetry Limbs
Color similarity Distance Left-right ordering Limbs HoG Length Angle Contour support Temporal Persistence Color tracking Joint motion Due to time restrictions, we can only show a few of our interesting features. For the rest, please read the paper. Joints HoG Skin color Optical flow

HoG right elbow SVM detector heat map.
HoG Limb Detectors The part detector of choice for most pose models: [Andriluka et al., CVPR09] [Felzenswalb et al., CVPR08+] [Bourdev et al, ECCV10] [Yang and Ramanan, CVPR11] [Wang et al, CVPR11] and many more! HoG right elbow SVM detector heat map. = true elbow location To get a sense of how difficult it is to detect individual joints, we can look at the HoG detector map for the right elbow, with the groundtruth labeled as a white circle. HOG part detectors are a standard representation for many popular parts-based models, and for many comprise the only source of image information. We can see from the video that this is a very weak cue and HoG alone is not going to solve the problem for us. colormap: low high unary features

Color similarity Left-right symmetry: Temporal appearance persistence:
L0-norm between patches in quantized color space -distance between patches Now we can look at a few features used by some of the more interesting pairwise connections. On the left, we show a left-right symmetry cue. We compare the color similarity of the patch labeled by the solid magenta box around the wrist, to every other patch in the image. We see that often, the other wrist, shown as a dotted magenta box, has a high similarity score. On the right, we show color tracking through time, based on L0-norm patch distance. The similarity of the patch labeled by the solid magenta box around the wrist in the previous frame is compare to every patch in the current frame. symmetry features temporal features

Learned linear SVM hand filters
Hand detectors skin color detection optical flow magnitude edges legend: Learned linear SVM hand filters skin flow edges Filter response maps * = Finally, one more gratuitous feature visualization: On the left video, in red we show an estimate of skin color based on face detection in each frame, and in cyan, optical flow motion discontinuities. We can take each of these sources of information and learn linear filters from them in order to detect hands. Convolving these features with the input, we obtain two different sources of information for where the hands might be. wrist features

Ablative Analysis: Which features matter?
joint accuracy, AUC 1% drop 1% drop 3% drop 4% drop 4.5% drop ….and these features are the ones that use non-kinematic interactions 8% drop beyond typical kinematic relationships

= Storyline Tree decomposition Rich pairwise features Model Agreement
× Rich pairwise features Model Agreement pairwise unary The remainder of the talk will go into details in 3 parts. First, we discuss the specifics of our different inference and learning techniques. Second, we describe all the features that go into our system. Third, we show our results against previous work on our new VideoPose dataset. Results ours prev. work

Agreement Methods Toy example: 2 models in disagreement, 3 frames.

Agreement: Dual Decomposition
model 1 model 2 Frame 1 Frame 2 Frame 3 Force all models to agree on every joint in every frame. [Bertsekas, 1999], [Komodakis et al., 2007]

Dual Decomposition Subgradient descent on dual to reach agreement:
while (!converged) { 1. run modified inference in all M submodels 2. adjust dual variables } Cost: Except, it may never converge (in which case, we round). And, 100 to 500 iterations (typical)  x slower!

Single Frame Agreement
??? = unconstrained / don’t care ??? ??? ??? ??? model 1 model 2 Frame 1 Frame 2 Frame 3 Force all models to agree on every joint in a single frame. Do this for every joint in turn at test time.

??? Naively, this would require running dual decomposition T times. Once was too much! But, through clever dynamic programming, we can do this efficiently. cost: inference in M submodels exact inference in each frame

Single Variable Agreement
? ? ? model 1 model 2 Frame 1 Frame 2 Frame 3 ? = don’t care / unconstrained Force all models to agree on a single joint in a single frame. Do this for every joint in turn at test time.

Single Variable Agreement
? For each joint, find the best scoring location such that all submodels agree. Compute max-marginals in each submodel separately. Add together max-marginal scores per node Take the highest scoring sum per node Cost: Negligible over running inference in the M submodels!

Model Agreement Roundup
Method: Running times: In practice: Dual Decomposition x longer than submodel inference! (4 hours on a 30-frame clip) Single Frame Agreement About twice as long in practice (1 fps) ??? big-O, not eqns Single Variable Agreement No more expensive than standard inference in M submodels (about 2 fps in practice) ?

Wrist Localization 90 80 70 Single Variable Agreement 60 % joints
within threshold Wrist-Tracking Tree Single Frame Agreement 50 gap thanks to: stretchability and rich temporal+symmetric cues Dual Decomposition 40 30 pixel matching limits Sapp et al. ECCV10 Eichner & Ferrari BMVC09 20 40 px 15 px Yang & Ramanan CVPR2011 10 15 25 30 35 40 Pixel Error Threshold

Elbow Localization Eichner & Ferrari BMVC09 Sapp et al. ECCV10
Elbow Tracking Tree Single Frame Agreement Single Variable Agreement Pixel Error Threshold 15 25 30 35 40 20 50 60 70 80 90 Dual Decomposition % joints within threshold Yang & Ramanan CVPR2011

Takeaways Stretchability, and rich temporal and symmetric pairwise cues Significantly outperforms previous work. Single variable agreement: a little agreement goes a long way Dual Decomposition takes orders of magnitude longer without bringing us any added value.

Final Results Thank you! VideoPose2.0 test set
Ours, Single Variable Agreement Eichner et al., BMVC09 Thank you!

VideoPose2.0 Highly articulated Significant amount of foreshortening
Highly articulated Significant amount of foreshortening Scale and location normalized 2 shows, 44 short clips, 1286 frames lower arm pixel-length histogram % 50 100 150 200 2 4 6 8 groundtruth scatterplot y x wrists elbows shoulders

Max-marginals Max scoring pose / MAP assignment: Max-marginal score:
Best pose which constrains joint yi to be at location p In the same time it takes to compute the max scoring pose, we can also compute the max-marginal score for every joint in every frame. fix right pixel p

Before: max-marginal for a single variable yi Now: max-marginal for set of variables yt at locations p: We can extend the notion of the max marginal of a single variable to a max-marginal over a set of variables. In particular, we’d like to examine the max-marginal distribution for all the joints in a single frame. Let y_t denote the 6 joints in frame t. Then we write the max-marginal for y_t like so. set of joint variables in frame t set of locations

Sidestepping Intractability: Decomposition without Agreement
frame t frame t+1 t t+1 t t+1 + + = = + HOWEVER, we can decompose the original, cyclic graph into a collection of SIX trees, which cover all the edges we care about.

Sidestepping Intractability: How we do it
= time persistent edge The complete graph has a natural decomposition into trees: each tree is responsible for tracking a single joint through time, and maintains kinematic and symmetric edges in each frame

Evaluation: Pixel Distance Threshold
100 40 px 15 px % Wrists 50 Read as: “half the wrists are within 20 pixels of the groundtruth” 15 px. 20 px. 40 px. Pixel Error Threshold

Our Contributions Efficient ensemble of tree models which capture a complex set of pairwise features A stretchable 2D layout model based on joints VideoPose2.0 dataset, containing a high degree of motion, pose and foreshortening variability Significant improvement over the best single frame models

Setup T = # of frames N = # of unique joints (in our case, 6)
Typical number of states for joint i: is sparse, thanks to: Cascaded PS [Sapp et al., ECCV 2010] T = # of frames N = # of unique joints (in our case, 6) As previously mentioned, we denote the observed, input video sequence with the variable x. We model every joint’s location in every frame of video and denote these as a vector y. … state space, cps denotes the pixel location of particular joint in a particular frame.

Tree submodels … mth tree submodel:
Score for pose configuration y in the mth submodel: weights features We use a linear model as our scoring function for any possible placement of joints. Because the weighted sum decomposes according to the edges of the graph, we can perform inference efficiently using dynamic programming to find the best configuration of joints. With N joints, and T frames of video, computing the best assignment takes time O(…) best configuration: takes

Rest of talk Part I: The Value of Agreement Part II: Features
unary Part II: Features pairwise ours Part III: Results prev. work

Rest of talk Part I: Models & Agreement Part II: Features
unary Part II: Features pairwise The remainder of the talk will go into details in 3 parts. First, we discuss the specifics of our different inference and learning techniques. Second, we describe all the features that go into our system. Third, we show our results against previous work on our new VideoPose dataset. ours Part III: Results prev. work

Find the best scoring pose in each frame, considering all submodels. best configuration of joints p for all submodels in frame t best configuration fixing joints in frame t to be at locations p for model m

Contour Support Legend: left lower limbs well-aligned with contours =

Sidestepping Intractability:
= × Complete graph is equivalent product of M submodel distributions. So, instead of just a single tree, we’re going to decompose our full model into a collection of M submodels. For example, here is the submodel which tracks the left and right elbows together, and also tracks the right elbow through time.

Geometry frame t frame t+1 distance travelled: length α Δx

Figure-from-flow pairwise: avg. along line limb & joint features
unary: flow map value at joint limb & joint features

Computing Single Frame Agreement
Compute incoming messages to frame t from all submodels. Incorporate messages into unary potentials for frame t. Perform exact inference in frame t subgraph by triangulating into cliques of size 3. single frame graph A B C D E F ABD A,B,D ACD CDF A,C,D incoming messages So it turns out that by precomputing message in each submodel, we can compute agreement exactly in a single frame, and it only takes as long as inference in the original submodels. CEF C,D,F cost: C,E,F inference in M submodels exact inference in each frame junction chain of 3-cliques in practice: both terms equally fast!

Parsing Human Motion with Stretchable Models

Similar presentations

Presentation on theme: "Parsing Human Motion with Stretchable Models"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parsing Human Motion with Stretchable Models

Similar presentations

Presentation on theme: "Parsing Human Motion with Stretchable Models"— Presentation transcript:

Similar presentations

About project

Feedback