Good afternoon everyone

Good afternoon everyone
Good afternoon everyone. My name is Danhang Tang, from Imperial College London. Today I'm gonna introduce our work "Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests”. This is a joint work with Tsz-Ho Yu from Cambridge University and my supervisor Dr. T-K Kim. And it's sponsored by Samsung. Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests Danhang Tang Tsz-Ho Yu T-K Kim Sponsored by

Before I start, let me show you a short clip of our results so you can have an intuitive idea of our work

Motivation Multiple cameras with invserse kinematics
[Bissacco et al. CVPR2007] [Yao et al. IJCV2012] [Sigal IJCV2011] Specialized hardware (e.g. structured light sensor, TOF camera) [Shotton et al. CVPR’11] [Baak et al. ICCV2011] [Ye et al. CVPR2011] [Sun et al. CVPR2012] Learning-based (regression) [Navaratnam et al. BMVC2006] [Andriluka et al. CVPR2010] The motivation behind is, there are many successful existing human pose estimation methods.

Motivation Discriminative approaches (RF) have achieved great success in human body pose estimation. Efficient – real-time Accurate – frame-basis, not rely on tracking Require a large dataset to cover many poses Train on synthetic, test on real data Didn’t exploit kinematic constraints Examples: Shotton et al. CVPR’11, Girshick et al. ICCV’11, Sun et al. CVPR’12 Among them using Random Forest on depth data recently becomes very successful. It is very efficient, can run in real-time. It is also very accurate, even the single frame-based gives satisfying results. Because they are learning-based methods, which requires a large dataset to cover as many poses as possible. People often train on synthetic data and test on real data. Also they normally don't exploit kinematic constraints.

Challenges for Hand? Viewpoint changes and self occlusions
Discrepancy between synthetic and real data is larger than human body Labeling is difficult and tedious!

Our method Viewpoint changes and self occlusions Hierarchical Hybrid
Forest Discrepancy between synthetic and real data is larger than human body Transductive Learning Labeling is difficult and tedious! Semi-supervised Learning

Existing Approaches Generative approaches Discriminative approaches
Model-fitting No training is required Oikonomidis et al. ICCV2011 De La Gorce et al. PAMI2010 Hamer et al. ICCV2009 Motion capture Ballan et al. ECCV 2012 Slow Needs initialisation and tracking Discriminative approaches Similar solutions to human body pose estimation Performance on real data remains challenging Wang et al. SIGGRAPH2009 Stenger et al. IVC 2007 Keskin et al. ECCV2012 Discriminative approaches Similar solutions to human body pose estimation Performance on real data remains challenging Xu and Cheng ICCV 2013 Moreover, I’d like to point out that in this year’s ICCV, Xu and Cheng also proposed a solution to address these issues and achieved good results on real data. Unlike our method, they tried to model sensor noise explicitly.

Our method Viewpoint changes and self occlusions Hierarchical Hybrid
Forest Discrepancy between synthetic and real data is larger than human body Labeling is difficult and tedious!

Hierarchical Hybrid Forest
Viewpoint Classification: Qa The idea is to break the problem of pose estimation down into a 3-level coarse-to-fine search. In each level we use a term to measure the quality of split functions. The first level, we label training data with their viewpoint, which is the normal to the palm. And the measuring term Q_a is set to the information gain of a split. STR forest: Qa – View point classification quality (Information gain) Qapv = αQa + (1-α)βQP + (1-α)(1-β)QV

Viewpoint Classification: Qa Finger joint Classification: Qp At the second level, pixels are classified into one of the 16 finger joints. Therefore measuring term Q_p is also an information gain on this classification. STR forest: Qa – View point classification quality (Information gain) Qp – Joint label classification quality (Information gain) Qapv = αQa + (1-α)βQP + (1-α)(1-β)QV

Viewpoint Classification: Qa Finger joint Classification: Qp Pose Regression: Qv At the third level, in order to vote for occluded joints. We use regression and hough voting here. Q_v is designed to minimise the variance of voting vectors to all joints. STR forest: Qa – View point classification quality (Information gain) Qp – Joint label classification quality (Information gain) Qv – Compactness of voting vectors (Determinant of covariance trace) Qapv = αQa + (1-α)βQP + (1-α)(1-β)QV

Viewpoint Classification: Qa Finger Joint Classification: Qp Pose Regression: Qv Using all three terms together is slow. Therefore we design an adaptive switching scheme to control the coefficients in a binary form. For example, at the first level, we keep monitoring the purity of viewpoints at each node. If it is pure enough, we will switch to the next term, which is finger joint classification. STR forest: Qa – View point classification quality (Information gain) Qp – Joint label classification quality (Information gain) Qv – Compactness of voting vectors (Determinant of covariance trace) (α,β) – Margin measures of view point labels and joint labels Qapv = αQa + (1-α)βQP + (1-α)(1-β)QV

Our method Viewpoint changes and self occlusions
Discrepancy between synthetic and real data is larger than human body Transductive Learning Labeling is difficult and tedious! Semi-supervised Learning

Transductive learning
Source space (Synthetic data S) Synthetic data S: Generated from an articulated hand model. All labeled. Target space (Realistic data R) Realistic data R: Captured from Primesense depth sensor A small part of R, Rl are labeled manually (unlabeled set Ru) The idea of transductive learning is to transfer knowledges from one domain to another. In our case, the first domain is synthetic training data, which is generated with an articulated CAD model. Therefore we have labels for all pixels. The second domain the realistic data. which captured with primesense depth sensor. Only a small part of real data is manually labeled. Training data D = {Rl, Ru, S}: labeled unlabeled

Source space (Synthetic data S) Target space (Realistic data R) As mentioned before, the hierarchical hybrid forest will divide the feature space in a coarse-to-fine manner. Training data D = {Rl, Ru, S}: Realistic data R: Captured from Kinect A small part of R, Rl are labeled manually (unlabeled set Ru) Synthetic data S: Generated from a articulated hand model, where |S| >> |R|

Source space (Synthetic data S) Target space (Realistic data R) To transfer the knowledges, during training, for each labeled data in realistic domain, we use nearest neighbour search to find a closest data point in synthetic domain and form a pairing relationship between them. During training, if a split function separate this pair into two childnodes, a penalty will be given to it. ----- Meeting Notes (06/12/ :53) ----- With this transductive pairing term, the training process will tend to select those split functions that best satisfy these two domains. Training data D = {Rl, Ru, S}: Similar data-points in Rl and S are paired(if separated by split function give penalty)

Semi-supervised learning
Source space (Synthetic data S) Target space (Realistic data R) To make use of unlabelled real data, we adopted a semi-supervised term to minimise the appearance variance of patches for each node. With this semi-supervised term and the pairing term working together, we can transfer the knowledge from synthetic to real, from label to unlabel. And train a classifer that work well on both in one go. Training data D = {Rl, Ru, S}: Similar data-points in Rl and S are paired(if separated by split function give penalty) Introduce a semi-supervised term to make use of unlabeled real data when evaluating split function

Kinematic refinement Due to the memory restriction, trees cannot be grown too deep. Therefore the voting results can be ambiguous sometimes. To be able to select correct positions from finger joints proposals produced by STRForest. We adopted a data-driven kinematic refinement step afterwards. The idea is, for each joint, we fit a 2-part GMM to the votes, and measure the euclidean distance between these two modes. If the distance is smaller than a threshold. We will say that this joint is certain, otherwise uncertain. With this method, we can separate finger joint proposals into a certain set and an uncertain set. And then we use the certain joints to query a large joint position database in a nearest neighbour manner, and choose the uncertain joint positions that are close to the result of the query. ----- Meeting Notes (06/12/ :12) ----- The idea is, for each joint, we fit a 2-part GMM to the votes, and measure the euclidean distance between these two modes. If the distance is smaller than a threshold. We will say that this joint is certain, otherwise uncertain. With this method, we can separate finger joint proposals into a certain set and an uncertain set. And then we use the certain joints to query a large joint position database in a nearest neighbour manner, and choose the uncertain joint positions that are close to the result of the query. ----- Meeting Notes (06/12/ :36) ----- The idea is, for each joint, we fit a 2-part GMM to the votes, and measure the euclidean distance between these two modes. If the distance is smaller than a threshold. We will say that this joint is certain, otherwise uncertain. With this method, we can separate finger joint proposals into a certain set and an uncertain set. And then we use the certain joints to query a large joint position database in a nearest neighbour manner, and choose the uncertain joint positions that ar e close to the result of the query.

Experiment settings Training data: Synthetic data(337.5K images)
Real data(81K images, <1.2K labeled) Evaluation data: Three different testing sequences Sequence A --- Single viewpoint(450 frames) Sequence B --- Multiple viewpoints, with slow hand movements(1000 frames) Sequence C --- Multiple viewpoints, with fast hand movements(240 frames)

Self comparison experiment
Self comparison(Sequence A): This graph shows the joint classification accuracy of Sequence A. Realistic and synthetic baselines produced similar accuracies. Using the transductive term is better than simply augmented real and synthetic data. All terms together achieves the best results.

Multiview experiments
Multi view experiment (Sequence C): The slide shows the results of two multiview sequences. The first column is mean error across all joints. Second column is the error of palm, which is a certain joint. And the 3rd column is the error of index finger tip, which is an uncertain joint. We can see that the kinematic refinement can improve uncertain joint significantly in some frames. Also I would like to point out that in sequence C, an abrupt motion change happens at around frame #200. In this case the tracking base method from FORTH fails and error accumulate. But our single-frame based method can recover quickly. Even with those frames that are not lost tracked. Our methods performs significantly better. ----- Meeting Notes (06/12/ :21) ----- Also I would like to point out that in sequence C, an abrupt motion change happens at around frame #200. In this case the tracking base method from FORTH fails and error accumulate. But our single-frame based method can recover quickly. Even with those frames that are not lost tracked. Our methods performs significantly better.

Conclusion A 3D hand pose estimation algorithm
STR forest: Semi-supervised and transductive regression forest A data-driven refinement scheme to rectify the shortcomings of STR forest Real-time (25Hz on Intel i7 PC without CPU/GPU optimisation) Works better than state-of-the-arts Makes use of unlabelled data, required less manual annotation. More accurate in real scenario

Video demo

Thank you! http://www.iis.ee.ic.ac.uk/icvl

Good afternoon everyone

Similar presentations

Presentation on theme: "Good afternoon everyone"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Good afternoon everyone

Similar presentations

Presentation on theme: "Good afternoon everyone"— Presentation transcript:

Similar presentations

About project

Feedback