Machine learning and imprecise probabilities for computer vision

Machine learning and imprecise probabilities for computer vision
Fabio Cuzzolin IDIAP, Martigny, 19/4/2006

Myself Master’s thesis on gesture recognition
at the University of Padova Visiting student, ESSRL, Washington University in St. Louis Ph.D. thesis on the theory of evidence Young researcher in Milan with the Image and Sound Processing group Post-doc at UCLA in the Vision Lab

My research research Computer vision Discrete mathematics
object and body tracking data association gesture and action recognition Discrete mathematics linear independence on lattices research Belief functions and imprecise probabilities geometric approach algebraic analysis combinatorial analysis

Imprecise probabilities
Computer Vision HMMs and size functions for gesture recognition Compositional behavior of hidden Markov models Volumetric action recognition Data association with shape information Evidential models for pose estimation Bilinear models for view-invariant gaitID Riemannian metrics for motion classification Imprecise probabilities Geometric approach Algebraic analysis

Approach Problem: recognizing an example of a known category of gestures from image sequences Combination of HMMs (for dynamics) and size functions (for pose representation) Continuous hidden Markov models EM algorithm for parameter learning (Moore)

Example transition matrix A -> gesture dynamics
state-output matrix C -> collection of hand poses The gesture is represented as a sequence of transitions between a small set of canonical poses

Size functions Hand poses are represented through their contours
real image measuring function family of lines size function table

Gesture classification
EM algorithm is used to learn HMM parameters from an input feature sequence the new sequence is fed to the learnt gesture models they produce a likelihood the most likely model is chosen (if above a threshold) HMM 1 HMM 2 … HMM n

Composition of HMMs Compositional behavior of HMMS: the model of the action of interest is embedded in the overall model Clustering: states of the original model are grouped in clusters, and the transition matrix recomputed accordingly:

State clustering Effect of clustering on HMM topology
“Cluttered” model for the two overlapping motions Reduced model for the “fly” gesture extracted through clustering

Kullback-Leibler comparison
KL distances between “fly” (solid) and “fly from clutter” (dash) KL distances between “fly” and “cycle” We used the K-L distance to measure the similarity between models extracted from clutter and in absence of it

Volumetric action recognition
problem: recognizing the action performed by a person viewed by a number of cameras 2D approaches: features are extracted from single views -> viewpoint dependence volumetric approach: features are extracted from a volumetric reconstruction of the moving body

k-means clustering to separate bodyparts
3D feature extraction k-means clustering to separate bodyparts Linear discriminant analysis (LDA) to estimate the direction of motion as the direction of maximal separation between the legs Locally linear embedding to find topological representation of the moving body

Uncertainty descriptions
A number of formalisms have been proposed to extend or replace classical probability: e.g. possibilities, fuzzy sets, random sets, monotone capacities, gambles, upper and lower previsions theory of evidence (A. Dempster, G. Shafer) Probabilities are replaced by belief functions Bayes’ rule is replaced by Dempster’s rule families of domains for multiple representation of evidence

belief function s: 2Θ ->[0,1]
Probability on a finite set: function p: 2Θ -> [0,1] with p(A)=x m(x), where m: Θ -> [0,1] is a mass function which meets the normalization constraint Probabilities are additive: if AB= then p(AB)=p(A)+p(B) A B2 B1 ..where m is a mass function on 2Θ s.t. Belief functions are not additive belief function s: 2Θ ->[0,1]

Dempster’s rule in the theory of evidence, new information encoded as a belief function is combined with old beliefs in a revision process belief functions are combined through Dempster’s rule Ai Bj AiÇBj=A intersection of focal elements

Example of combination
s1: m({a1})=0.7, m({a1 ,a2})=0.3 a1 a2 a3 a4 s2: m()=0.1, m({a2 ,a3 ,a4})=0.9 s1  s2 : m({a1}) = 0.7*0.1/0.37 = 0.19 m({a2}) = 0.3*0.9/0.37 = 0.73 m({a1 ,a2}) = 0.3*0.1/0.37 = 0.08

JPDA with shape info JPDA model: independent targets
shape model: rigid links Dempster’s fusion robustness: clutter does not meet shape constraints occlusions: occluded targets can be estimated

Body tracking Application: tracking of feature points on a moving human body

Pose estimation estimating the “pose” (internal configuration) of a moving body from the available images t=0 t=T salient image measurements: features CAMERA t=0 t=T

Model-based estimation
if you have an a-priori model of the object .. .. you can exploit it to help (or drive) the estimation example: kinematic model

Model-free estimation
if you do not have any information about the body.. the only way to do inference is to learn a map between features and poses directly from the data this can be done in a training stage

Collecting training data
motion capture system 3D locations of markers = pose

Training data when the object performs some “significant” movements in front of the camera … … a finite collection of configuration values are provided by the motion capture system q q 1 T y y 1 T … while a sequence of features is computed from the image(s)

Learning feature-pose maps
Hidden Markov models provide a way to build feature-pose maps from the training data a Gaussian density for each state is set up on the feature space -> approximate feature space map between each region and the set of training poses qk with feature value yk inside it

Evidential model approximate feature spaces ..
.. and approximate parameter space .. .. form a family of compatible frames: the evidential model

Human body tracking two experiments, two views
four markers on the right arm six markers on both legs two experiments, two views

Feature extraction 185 94 161 38 185 94 161 38 three steps: original image, color segmentation, bounding box

Performances comparison of three models: left view only, right view only, both views estimate associated with the “right” model “left” model ground truth pose estimation yielded by the overall model

GaitID The problem: recognizing the identity of humans from their gait
Typical approaches: PCA on image features, HMMs People typically use silhouette data Issue: view-invariance Can be addressed via 3D representations 3D tracking: difficult and sensitive

Bilinear models From view-invariance to “style” invariance
In a dataset of sequences, each motion possess several labels: action, identity, viewpoint, emotional state, etc. Bilinear models (Tenenbaum) can be used to separate the influence of two of those factors, called “style” and “content” (the label to classify) ySC is a training set of k-dimensional observations with labels S and C bC is a parameter vector representing content, while AS is a style-specific linear map mapping the content space onto the observation space

Content classification of unknown style
Consider a training set in which persons (content=ID) are seen walking from different viewpoints (style=viewpoint) an asymmetric bilinear model can learned from it through the SVD of a stacked observation matrix when new motions are acquired in which a known person is being seen walking from a different viewpoint (unknown style)… … an iterative EM procedure can be set up to classify the content (identity) E step -> estimation of p(c|s), the prob. of the content given the current estimate s of the style M step -> estimation of the linear map for the unknown style s

Three-layer model Three layer model Feature representation:
projection of the contour of the silhouette on a sheaf of lines passing through the center Three layer model each sequence is encoded as a Markov model, its C matrix is stacked in an observation vector, and a bilinear model is trained over those vectors

MOBO database Mobo database: 25 people performing 4 different walking actions, from 6 cameras Each sequence has three labels: action, id, view We set up four experiments in which one label was chosen as content, another one as style, and the remaining is considered as a nuisance factor Content = id, style = view -> view-invariant gaitID Content = id, style = action -> action-invariant gaitID Content = action, style = view -> view-invariant action recogntion Content = action, style = id -> style-invariant action recognition

Results Compared performances with “baseline” algorithm and straight k-NN on sequence HMMs

Distances between dynamical models
Problem: motion classification Approach: representing each movement as a linear dynamical model for instance, each image sequence can be mapped to an ARMA, or AR linear model Classification is then reduced to find a suitable distance function in the space of dynamical models We can use this distance in any of the popular classification schemes: k-NN, SVM, etc.

Riemannian metrics Some distances have been proposed: Martin’s distance, subspace angles, gap-metric, Fisher metric However, it makes no sense to choose a single distance for all possible classification problems When some a-priori info is available (training set).. .. we can learn in a supervised fashion the “best” metric for the classification problem! Feasible approach: volume minimization of pullback metrics

Learning pullback metrics
many unsupervised algorithms take in input dataset and map it to an embedded space they fail to learn a full metric Consider than a family of diffeomorphisms Fl between the original space M and a metric space N The diffeomorphism F induces on M a pullback metric Fl D N M

Space of AR(2) models Given an input sequence, we can identify the parameters of the linear model which better describes it We chose the class of autoregressive models of order 2 AR(2) Fisher metric on AR(2) Compute the geodesics of the pullback metric on M

Results scalar feature, AR(2) and ARMA models
NN algorithm to classify new sequences Identity recognition Action recognition

Results -2 Recognition performance of the second-best distance and the optimal pull-back metric The whole dataset is considered, regardless the view View 1 View 5

Geometric approach to the ToE
Belief functions can be seen as points of an Euclidean space of dimension 2n-2 Belief space: the space of all the belief functions on a given frame it has the shape of a simplex each subset A  A-th coordinate s(A)

Geometry of Dempster’s rule
conditional subspaces Dempster’s rule can be studied in the geometric setup too Geometric operator mapping pairs of points onto another point of the belief space

Probabilistic approximation
Problem: given a belief function s, finding the “best” probabilistic approximation of s this can be solved in the geometric setup compositional criterion the approximation behaves like s when combined through Dempster’s rule comparative study of all the proposed probabilstic approximations

Lattice structure 1F maximal coarsening Q Å W Q W minimal refinement
families of frames have the algebraic structure of a lattice order relation: existence of a refining F is a locally Birkhoff (semimodular with finite length) lattice bounded below

Total belief theorem generalization of the total probability theorem
a-priori constraint conditional constraint whole graph of candidate solutions, connections with combinatorics and linear systems

Machine learning and imprecise probabilities for computer vision

Similar presentations

Presentation on theme: "Machine learning and imprecise probabilities for computer vision"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine learning and imprecise probabilities for computer vision

Similar presentations

Presentation on theme: "Machine learning and imprecise probabilities for computer vision"— Presentation transcript:

Similar presentations

About project

Feedback