Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests
Danhang Tang, Tsz-Ho Yu, Tae-kyun Kim
Imperial College London, UK

Introduction
● ○ ○ ○ ○ Experiments ○ ○ ○ Methodology ○ ○ ○ ○ ○ ○ ○ ○ ○

※ The slides excerpted parts of the author's oral presentation at ICCV 2013.
Viewpoint changes and self occlusions
Discrepancy between synthetic and real data is larger than human body

Challenges for Hand?
Labeling is difficult and tedious!
Viewpoint changes and self occlusions
Discrepancy between synthetic and real data is larger than human body

Method
Labeling is difficult and tedious!

Hierarchical Hybrid Forest
Transductive Learning
Semi-supervised Learning
Generative Approach : use explicit hand models to recover the hand pose - optimization, 현재 hypothesis 를 최적화 하기 위해 앞 결과에 의존

Existing Approaches

Oikonomidis et al. ICCV2011
De La Gorce et al. PAMI2010
Hamer et al. ICCV2009

Motion capture
Ballan et al. ECCV 2012
Xu and Cheng ICCV 2013

Generative Approach : learn a mapping from visual features to the target parameter space, such as joint labels or joint coordinates(i.e. hand poses), from a labelled training dataset.
- classification, regression, each frame independent, error recovery

Wang et al. SIGGRAPH2009
Stenger et al. IVC 2007
Keskin et al. ECCV2012
Discriminative Approach

achieved great success in human body pose estimation.
Efficient : real-time
Accurate : frame-basis, not rely on tracking

Require a large dataset to cover many poses
Train on synthetic, test on real data
Hierarchical Hybrid Forest

STR forest:
Qa – View point classification quality (Information gain)

Viewpoint Classification: Q a

Q apv = αQ a + (1-α)βQ P + (1-α)(1-β)Q V

To evaluate the classification performance of all the viewpoint labels in dataset
Hierarchical Hybrid Forest

STR forest:
Qa – View point classification quality (Information gain)
Qp – Joint label classification quality (Information gain)

Viewpoint Classification: Q a
Finger joint Classification: Q P

Q apv = αQ a + (1-α)βQ P + (1-α)(1-β)Q V

To measure the performance of classifying individual patch
Hierarchical Hybrid Forest

STR forest:
Qa – View point classification quality (Information gain)
Qp – Joint label classification quality (Information gain)
Qv – Compactness of voting vectors (Determinant of covariance trace)

Viewpoint Classification: Q a
Finger joint Classification: Q P
Pose Regression: Q V

Q apv = αQ a + (1-α)βQ P + (1-α)(1-β)Q V
Hierarchical Hybrid Forest

STR forest:
Qa – View point classification quality (Information gain)
Qp – Joint label classification quality (Information gain)
Qv – Compactness of voting vectors (Determinant of covariance trace)
(α,β) – Margin measures of view point labels and joint labels

Viewpoint Classification: Q a
Finger Joint Classification: Q P
Pose Regression: Q V

Q apv = αQ a + (1-α)βQ P + (1-α)(1-β)Q V

Using all three terms together is slow.
Transductive Learning

Training data D = {R l, R u, S}: labeled unlabeled

Target space (Realistic data R)

Realistic data R:
»Captured from Primesense depth sensor
»A small part of R, R l are labeled manually (unlabeled set R u )

Source space (Synthetic data S )

Synthetic data S:
»Generated from an articulated hand model. All labeled.
Transductive Learning

Training data D = {R l, R u, S}:

Synthetic data S:
»Generated from a articulated hand model, where |S| >> |R|

Realistic data R:
»Captured from Primesense depth sensor
»A small part of R, R l are labeled manually (unlabeled set R u )

Source space (Synthetic data S )
Target space (Realistic data R)
Transductive Term Q t

Training data D = {R l, R u, S}:

Similar data-points in R l and S are paired(if separated by split function give penalty)
Q t is the ratio of preserved association after a split

Source space (Synthetic data S )
Target space (Realistic data R)
Nearest neighbour
Semi-supervised Term Q u

Training data D = {R l, R u, S}:

Similar data-points in R l and S are paired(if separated by split function give penalty)
Q u evaluates the appearance similarities of all realistic patches R within a node

Source space (Synthetic data S )
Target space (Realistic data R)
Kinematic Refinement

1. 각 관절에 대하여 GMM 으로 voting, 두 모드의 가우시안 사 이의 euclidean 거리를 측정
2.High Confidence / Low Confidence
3.High Confidence -> query large joint position database choose the uncertain joint positions that are close to the result of the query.
Experimental Settings

Evaluation data: Three different testing sequences
1.Sequence A --- Single viewpoint(450 frames)
2.Sequence B --- Multiple viewpoints, with slow hand movements(1000 frames)
3.Sequence C --- Multiple viewpoints, with fast hand movements(240 frames)

Training data:
»Synthetic data(337.5K images)
»Real data(81K images, <1.2K labeled)
Self comparison experiment
»This graph shows the joint classification accuracy of Sequence A.
»Realistic and synthetic baselines produced similar accuracies.
»Using the transductive term is better than simply augmented real and synthet ic data.
»All terms together achieves the best results.
Reference
[1] Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture, CVPR, 2014
[2] A Survey on Transfer Learning, Transactions on knowledge and data engineering, 2010
[3] Motion Capture of Hands in Action using Discriminative Salient Points, ECCV, 2012