Download presentation
Presentation is loading. Please wait.
Published byBattistina Mari Modified over 6 years ago
1
Real-Time Human Pose Recognition in Parts from Single Depth Image
Zihang Huang CS2310 seminar
2
References Sub paper Main paper
[1]Jamie Shotton , Toby Sharp , Alex Kipman , Andrew Fitzgibbon , Mark Finocchio , Andrew Blake , Mat Cook , Richard Moore, Real-time human pose recognition in parts from single depth images, Communications of the ACM, v.56 n.1, January 2013 Sub paper [2]J.Shotton,R.Girshick,A.Fitzgibbon,T.Sharp,M.Cook,M.Finocchio, R. Moore, P. Kohli, A. Criminisi, A. Kipman, and A. Blake. Efficient human pose estimation from single depth images. PAMI, , 4 [3]Thomas B. Moeslund , Adrian Hilton , Volker Krüger, A survey of advances in vision-based human motion capture and analysis, Computer Vision and Image Understanding, v.104 n.2, p , November 2006 [doi> /j.cviu ] [4]Ronald Poppe, Vision-based human motion analysis: An overview, Computer Vision and Image Understanding, v.108 n.1-2, p.4-18, October, 2007 [doi> /j.cviu ]
3
1. Motivation 2. Approach 3. Experiments 4. Conclusion 5. Reference
4
What is articulated body pose estimation?
introduction What is articulated body pose estimation? Recovers the pose of an articulated body, which consists of joints and parts using image-based observations. Research in pose recognition has been on going for 20+ years. Many assumptions: multiple cameras, manual initialization, controlled/simple backgrounds
5
Application
6
Application Ubiquitous surveillance cameras in public places
More than 60M CCTV in China A person is monitored average 300 times/day in London Current CCTV system mainly record video stream without understanding human action and event in video Understanding human activity is important for intelligent surveillance system
7
Challenges Human pose is deformable Need a way to recognize many of these poses with different body shapes, scales, clothing types Decide how to identify parts of the body and how detailed our labeling should be Backgrounds, light levels, color, and texture invariance
8
Main Idea
9
Using conventional intensity cameras
Previous Approaches Using conventional intensity cameras Learn an initial pose => learn variations of that pose Estimate for locations of body segments from which to build the body Using Depth Cameras Build 3D models, divide the model into parts, then search parts of the body CPU expensive
10
This paper approach Two steps: 1. find body parts -different in depth 2. compute joint positions -random decision forests Large and varied dataset Both synthetic and motion captured data Object recognition approach Intermediate body part representation Pose estimation reduced to per-pixel classification Create scored proposals of body joints
11
Data Gathering Depth image benefits Color and texture invariance Good performance in low light level Accurate scaling estimation Background subtraction Reduce silhouette ambiguity
12
Data Types
13
Synthetic Data Goals: reality and variety Use of randomized rendering pipeline, which produces samples that are fully labeled and can be trained on Learning is used to provide invariance towards: camera position, body pose, body size and body shape Other slight variations are height, weight, mocap frame, camera noise, clothing, hairstyle, etc.
14
Real Data(Motion Capture Data)
Large database is built using motion capture and human actions related to target application (dancing, running, etc.) Expect classifier to generalize unseen poses Wide range of poses vs all possible combinations many redundant poses are discarded based on initial data and furthest neighbor clustering
15
The Rendering Pipeline
Necessary to account for pose variation and model variation Start with: base character and pose - transform on: rotation and translation, hair and clothing, weight and height variations, camera position and orientation, camera noise - add transformations to dataset
16
Different Renderings
17
System overview system overview. from a single input depth image, a per-pixel body part distribution is inferred. (Colors indicate the most likely part labels at each pixel and correspond in the joint proposals.) Local modes of this signal are estimated to give high-quality proposals for the 3D locations of body joints, even for multiple users. finally, the joint proposals are input to skeleton fitting, which outputs the 3D skeleton for each user.
18
Joint Position Proposal
Density estimation are depth invariant Depending on the application, inferred body parts can be pre-accumulated Mean shift algorithm finds modes efficiently Final joint estimation: sum of the pixel weights reaching their give modes multiple body parts over the same area can be merged to form a localized joint
19
Body Part Labeling Body parts are broken down into an intermediate body part representation of 31 body parts (object by parts) Observation: Parts should be small enough to localize different body joints Parts should be small in numbers such that no classifier space is wasted
20
Depth Image Features Feature are weak individually Solution: combine with a decision forest Solution is efficient: - one feature reads at most 3 image pixels - performs at most 5 arithmetic operations - can be implemented on GPU
21
Depth Image Features Can directly get real-time 3D body joints from Kinect by random forest algorithm
22
Number of decision trees
23
Experiments Test data: Set of 5000 synthesized depth images Real dataset of 8808 frames from more than 15 different subjects 28 depth image sequences ranging from short motion to full actions Parameters: 3 trees, depth of 20, 300k training images/tree, 2000 training example pixels/image, 2000 candidate features, 50 candidate thresholds/feature
24
results
25
Conclusions No temporal information Frame-by-frame
Local pose estimate of parts Each pixel & each body joint treated independently Very fast(super real-time estimation) Simple depth image features Decision forest classifier Limited compute budget
26
THANKS! Any questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.