Su-A Kim 3 rd June 2014 Danhang Tang, Tsz-Ho Yu, Tae-kyun Kim Imperial College London, UK Real-time Articulated Hand Pose Estimation using Semi-supervised.

Slides:

Advertisements

Similar presentations

Active Appearance Models

Advertisements

Good afternoon everyone

Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.

Three things everyone should know to improve object retrieval

Limin Wang, Yu Qiao, and Xiaoou Tang

Silhouette-based Object Phenotype Recognition using 3D Shape Priors Yu Chen 1 Tae-Kyun Kim 2 Roberto Cipolla 1 University of Cambridge, Cambridge, UK 1.

Face Alignment with Part-Based Modeling

Patch to the Future: Unsupervised Visual Prediction

Juergen Gall Action Recognition.

Yuanlu Xu Human Re-identification: A Survey.

Mixture of trees model: Face Detection, Pose Estimation and Landmark Localization Presenter: Zhang Li.

1.Introduction 2.Article [1] Real Time Motion Capture Using a Single TOF Camera (2010) 3.Article [2] Real Time Human Pose Recognition In Parts Using a.

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Lecture Pose Estimation – Gaussian Process Tae-Kyun Kim 1 EE4-62 MLCV.

Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.

Silhouette Lookup for Automatic Pose Tracking N ICK H OWE.

Face Alignment at 3000 FPS via Regressing Local Binary Features

Robust Object Tracking via Sparsity-based Collaborative Model

Segmentation-Free, Area-Based Articulated Object Tracking Daniel Mohr, Gabriel Zachmann Clausthal University, Germany ISVC.

Modeling 3D Deformable and Articulated Shapes Yu Chen, Tae-Kyun Kim, Roberto Cipolla Department of Engineering University of Cambridge.

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Enhancing Exemplar SVMs using Part Level Transfer Regularization 1.

Optimization & Learning for Registration of Moving Dynamic Textures Junzhou Huang 1, Xiaolei Huang 2, Dimitris Metaxas 1 Rutgers University 1, Lehigh University.

São Paulo Advanced School of Computing (SP-ASC’10). São Paulo, Brazil, July 12-17, 2010 Looking at People Using Partial Least Squares William Robson Schwartz.

Student: Yao-Sheng Wang Advisor: Prof. Sheng-Jyh Wang ARTICULATED HUMAN DETECTION 1 Department of Electronics Engineering National Chiao Tung University.

Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.

Kinect Case Study CSE P 576 Larry Zitnick

Speaker Adaptation for Vowel Classification

3D Hand Pose Estimation by Finding Appearance-Based Matches in a Large Database of Training Views

Presented by Zeehasham Rasheed

Hierarchical Subquery Evaluation for Active Learning on a Graph Oisin Mac Aodha, Neill Campbell, Jan Kautz, Gabriel Brostow CVPR 2014 University College.

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

Real-Time Decentralized Articulated Motion Analysis and Object Tracking From Videos Wei Qu, Member, IEEE, and Dan Schonfeld, Senior Member, IEEE.

Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.

Real-time Action Recognition by Spatiotemporal Semantic and Structural Forest Tsz-Ho Yu, Tae-Kyun Kim and Roberto Cipolla Machine Intelligence Laboratory,

Bag of Video-Words Video Representation

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

CSE 185 Introduction to Computer Vision Pattern Recognition.

Building Face Dataset Shijin Kong. Building Face Dataset Ramanan et al, ICCV 2007, Leveraging Archival Video for Building Face DatasetsLeveraging Archival.

A General Framework for Tracking Multiple People from a Moving Camera

Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.

WSCG2008, Plzen, 04-07, Febrary 2008 Comparative Evaluation of Random Forest and Fern classifiers for Real-Time Feature Matching I. Barandiaran 1, C.Cottez.

Multi-Output Learning for Camera Relocalization Abner Guzmán-Rivera UIUC Pushmeet Kohli Ben Glocker Jamie Shotton Toby Sharp Andrew Fitzgibbon Shahram.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

#MOTION ESTIMATION AND OCCLUSION DETECTION #BLURRED VIDEO WITH LAYERS

Semantic Embedding Space for Zero Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Vision-based human motion analysis: An overview Computer Vision and Image Understanding(2007)

Human pose recognition from depth image MS Research Cambridge.

Action and Gait Recognition From Recovered 3-D Human Joints IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS— PART B: CYBERNETICS, VOL. 40, NO. 4, AUGUST.

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

Jigsaws: joint appearance and shape clustering John Winn with Anitha Kannan and Carsten Rother Microsoft Research, Cambridge.

Visual Odometry David Nister, CVPR 2004

Max-Confidence Boosting With Uncertainty for Visual tracking WEN GUO, LIANGLIANG CAO, TONY X. HAN, SHUICHENG YAN AND CHANGSHENG XU IEEE TRANSACTIONS ON.

ICCV 2007 Optimization & Learning for Registration of Moving Dynamic Textures Junzhou Huang 1, Xiaolei Huang 2, Dimitris Metaxas 1 Rutgers University 1,

Presenter: Jae Sung Park

Learning Image Statistics for Bayesian Tracking Hedvig Sidenbladh KTH, Sweden Michael Black Brown University, RI, USA

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

Articulated Human Pose Estimation using Gaussian Kernel Correlation

Segmentation of Building Facades using Procedural Shape Priors

Intelligent Learning Systems Design for Self-Defense Education

Guillaume-Alexandre Bilodeau

A Forest of Sensors: Using adaptive tracking to classify and monitor activities in a site Eric Grimson AI Lab, Massachusetts Institute of Technology

Mean Euclidean Distance Error (mm)

Developing systems with advanced perception, cognition, and interaction capabilities for learning a robotic assembly in one day Dr. Dimitrios Tzovaras.

Pose Estimation for non-cooperative Spacecraft Rendevous using CNN

Oral presentation for ACM International Conference on Multimedia, 2014

Liyuan Li, Jerry Kah Eng Hoe, Xinguo Yu, Li Dong, and Xinqi Chu

INDOOR DENSE DEPTH MAP AT DRONE HOVERING

Week 5 Cecilia La Place.

Presentation transcript:

Su-A Kim 3 rd June 2014 Danhang Tang, Tsz-Ho Yu, Tae-kyun Kim Imperial College London, UK Real-time Articulated Hand Pose Estimation using Semi-supervised Transductive Regression Forests Introduction ● ○ ○ ○ ○ Experiments ○ ○ ○ Methodology ○ ○ ○ ○ ○ ○ ○ ○ ○ ※ The slides excerpted parts of the author’s oral presentation at ICCV 2013.

Su-A Kim 3 rd June Viewpoint changes and self occlusions Discrepancy between synthetic and real data is larger than human body Challenges for Hand? Labeling is difficult and tedious!

Su-A Kim 3 rd June Viewpoint changes and self occlusions Discrepancy between synthetic and real data is larger than human body Method Labeling is difficult and tedious! Hierarchical Hybrid Forest Transductive Learning Semi-supervised Learning

Su-A Kim 3 rd June Generative Approach : use explicit hand models to recover the hand pose - optimization, 현재 hypothesis 를 최적화 하기 위해 앞 결과에 의존 Existing Approaches Oikonomidis et al. ICCV2011 De La Gorce et al. PAMI2010 Hamer et al. ICCV2009 Motion capture Ballan et al. ECCV 2012 Xu and Cheng ICCV 2013 Generative Approach : learn a mapping from visual features to the target parameter space, such as joint labels or joint coordinates(i.e. hand poses), from a labelled training dataset. - classification, regression, each frame independent, error recovery Wang et al. SIGGRAPH2009 Stenger et al. IVC 2007 Keskin et al. ECCV2012

achieved great success in human body pose estimation.  Efficient : real-time  Accurate : frame-basis, not rely on tracking  Require a large dataset to cover many poses  Train on synthetic, test on real data Su-A Kim 3 rd June Discriminative Approach

Su-A Kim 3 rd June Hierarchical Hybrid Forest STR forest: Qa – View point classification quality (Information gain) Viewpoint Classification: Q a Q apv = αQ a + (1-α)βQ P + (1-α)(1-β)Q V To evaluate the classification performance of all the viewpoint labels in dataset

Hierarchical Hybrid Forest Su-A Kim 3 rd June STR forest: Qa – View point classification quality (Information gain) Qp – Joint label classification quality (Information gain) Viewpoint Classification: Q a Finger joint Classification: Q P Q apv = αQ a + (1-α)βQ P + (1-α)(1-β)Q V To measure the performance of classifying individual patch

Hierarchical Hybrid Forest Su-A Kim 3 rd June STR forest: Qa – View point classification quality (Information gain) Qp – Joint label classification quality (Information gain) Qv – Compactness of voting vectors (Determinant of covariance trace) Viewpoint Classification: Q a Finger joint Classification: Q P Pose Regression: Q V Q apv = αQ a + (1-α)βQ P + (1-α)(1-β)Q V

Hierarchical Hybrid Forest Su-A Kim 3 rd June STR forest: Qa – View point classification quality (Information gain) Qp – Joint label classification quality (Information gain) Qv – Compactness of voting vectors (Determinant of covariance trace) (α,β) – Margin measures of view point labels and joint labels Viewpoint Classification: Q a Finger Joint Classification: Q P Pose Regression: Q V Q apv = αQ a + (1-α)βQ P + (1-α)(1-β)Q V Using all three terms together is slow.

Transductive Learning Su-A Kim 3 rd June Training data D = {R l, R u, S}: labeled unlabeled Target space (Realistic data R) Realistic data R: »Captured from Primesense depth sensor »A small part of R, R l are labeled manually (unlabeled set R u ) Source space (Synthetic data S ) Synthetic data S: »Generated from an articulated hand model. All labeled.

Transductive Learning Su-A Kim 3 rd June Training data D = {R l, R u, S}: Synthetic data S: »Generated from a articulated hand model, where |S| >> |R| Realistic data R: »Captured from Primesense depth sensor »A small part of R, R l are labeled manually (unlabeled set R u ) Source space (Synthetic data S ) Target space (Realistic data R)

Transductive Term Q t Su-A Kim 3 rd June Training data D = {R l, R u, S}: Similar data-points in R l and S are paired(if separated by split function give penalty) Q t is the ratio of preserved association after a split Source space (Synthetic data S ) Target space (Realistic data R) Nearest neighbour

Semi-supervised Term Q u Su-A Kim 3 rd June Training data D = {R l, R u, S}: Similar data-points in R l and S are paired(if separated by split function give penalty) Q u evaluates the appearance similarities of all realistic patches R within a node Source space (Synthetic data S ) Target space (Realistic data R)

Kinematic Refinement Su-A Kim 3 rd June 1. 각 관절에 대하여 GMM 으로 voting, 두 모드의 가우시안 사 이의 euclidean 거리를 측정 2.High Confidence / Low Confidence 3.High Confidence -> query large joint position database choose the uncertain joint positions that are close to the result of the query.

Evaluation data: Three different testing sequences 1.Sequence A --- Single viewpoint(450 frames) 2.Sequence B --- Multiple viewpoints, with slow hand movements(1000 frames) 3.Sequence C --- Multiple viewpoints, with fast hand movements(240 frames) Training data: »Synthetic data(337.5K images) »Real data(81K images, <1.2K labeled) Experimental Settings Su-A Kim 3 rd June

Su-A Kim 3 rd June Self comparison experiment »This graph shows the joint classification accuracy of Sequence A. »Realistic and synthetic baselines produced similar accuracies. »Using the transductive term is better than simply augmented real and synthet ic data. »All terms together achieves the best results.

Su-A Kim 3 rd June

Su-A Kim 3 rd June Reference [1] Latent Regression Forest: Structured Estimation of 3D Articulated Hand Posture, CVPR, 2014 [2] A Survey on Transfer Learning, Transactions on knowledge and data engineering, 2010 [3] Motion Capture of Hands in Action using Discriminative Salient Points, ECCV, 2012