Presentation is loading. Please wait.

Presentation is loading. Please wait.

Proxemics Recognition

Similar presentations


Presentation on theme: "Proxemics Recognition"— Presentation transcript:

1 Proxemics Recognition
Hello, everyone. Thank you for your attendance to my talk. Today’s topic is Proxemics Recognition which is a novel problem in computer vision field. This is a joint work with Simon, Anitha and Deva who is my supervisor in UCI. First, what is proxemics? Yi Yang1, Simon Baker, Anitha Kannan, Deva Ramanan1 1Department of Computer Science, UC Irvine

2 Proxemics Proxemics: the study of spatial arrangement of people as they interact - anthropologist Edward T. Hall in 1963 The term, Proxemics, is first introduced by anthropologist Edward Hall in 1963 when he study the spatial arrangement between people as they interact. In our daily life, proxemics happen almost all the time as long as there are multiple people in a roughly same space. For example, right now.

3 In his classic study, he found that we human normally has about 4 difference space concepts. From our own person point of view,

4 the farest space is the public space where you can imagine that we stand in wild and other people are too far away to have any possible direct communication with us.

5 While in social space, it is possible for us to talk to a person but hard to touch.

6 When the people come closer and closer, they start to get into our personal space. Normally that means we can talk to through face to face and touch them with our hands.

7 Once they come to our intimate space, this means the person can use their all body parts to interact with us. So we can walking with them side by side, hugging them or even kissing them.

8 Brother and Sister Holding Hands Friends Walking Side by Side
During this summer intern, what we focus a lot is the proxemics in personal space and intimate space which are much more interesting to study in computer vision as there are many complicated interactions happen in these two spaces. In the slides, you can see the examples from top left to bottom right, such as holding hands, walking side by side, holding a baby and hugging while holding hands. Mom Holding Baby Husband Hugging and Holding Wife’s Hand

9 Touch Code Hand Touch Hand Hand Touch Shoulder Shoulder Touch Shoulder
Arm Touch Torso According to anthropologist Edward Hall, such complex interactions can be explained using a small collection of basic elements called touch codes, such as hand touch hand, hand touch shoulder, shoulder touch shoulder and arm touch torso. The set is small, but the combination can be exponentially large. For example, the top right proxemics can be described using shoulder touch shoulder while the bottom right can be described using the combination of hand touch hand and hand touch shoulder.

10 Applications Personal Photo Search: Analysis of TV shows and movies
Find a specific interesting photo Analysis of TV shows and movies Kinect Web Search Auto-Movie/Auto-Slideshow Locate interesting scenes Once we understand how people interact with each other in images, then it enables a lot of potential applications that involved with multiple human interactions. We can come later and discuss about them after the talk.

11 Proxemics DataSet 200 training images 150 testing images
Collected From Simon, Bing, Google, Gettyimage, Flickr No video data No Kinect 3d depth data To study this problem, we build a dataSet with 200 training images and 150 testing images collected from Simon’s personal photos, Bing images, google images, gettyimages and flickr. One point we need to make is, our system input is pure image pixels. There are no video data, and no Kinect depth data.

12 Proxemics DataSet Complexity Number of People
Here is a simple illustration of our dataSet. The x-axis is number of people showing in one image. The y-axis is the proxemics complexity determined subjectively by myself. You may have a different opinion. As you can see, our dataSet contains various kinds of human interactions from far away to close by, from 2 people to many pepole and from simple touch to very complex touch. Recall that our goal is to recognize the touch codes and interactions between pairs of people without the help from Kinect. Then how can we handle such complicated problem? Number of People

13 A Naïve Approach Input Image
Well, suppose we have a kinect working for images and suppose our image kinect is able to report the full pose of humans in this two boys image. Then we can directly utilize the predicted locations of the two boys hands and easily report that they are holding hands based on the hand distance. Hence a naïve approach for recognizing human interactions in images would be given an input image with people,

14 A Naïve Approach Input Image Image Feature
We first extract discriminative image features such as the edge features

15 Pose Estimation i.e. Find skeletons
A Naïve Approach Input Image Image Feature and apply the current state of the art human pose estimation methods to find the skeletons and joint locations for each person independently. Pose Estimation i.e. Find skeletons

16 A Naïve Approach Input Image Image Feature Hand touch Hand
Then given the predicted joint locations, we build a classifier based on it and output the final interaction recognition labels. This is a very straightforward process and you would hope that this would work. Hand touch Hand Interaction Recognition Pose Estimation i.e. Find skeletons

17 Naïve Approach Results
But when we look at the experimental results, we are very disappointed. Wshow two of the precision recall curves for recognizing hand touch hand and hand touch shoulder tested in our dataSet. You can see that this naïve approach almost work as bad as a random guess. Hand touch Hand Hand touch Shoulder

18 Human Pose Estimation Not bad when no real interaction between people
To find out the failure reasons, we go back looking at the pose estimation results. One of the potential reasons is that we don’t have Kinect that working for images, but what we found is that when the two people have no real interactions, our current pose estimation algorithm is not that bad. - Y. Yang & D. Ramanan, CVPR 2011

19 Interactions Hurt Pose Estimation
Occlusion + Ambiguous Parts However, when we look at those more interesting human interaction images, what we found is that this independent human pose estimation fails to find the correct human pose because of the large occlusion and ambiguous parts. In order to make it work, we have to reason about whether a part is occluded by another person, or whether the predicted part belongs to another person. However, explicitly reasoning about occlusion and ambiguous parts can be very expensive and we choose another way to directly handle this problem.

20 Our Approach Direct Proxemics Recognition
Input Image Image Feature Given an input image with its edge features extracted,

21 Our Approach Direct Proxemics Recognition
Input Image Image Feature we skip the independent pose estimation step and directly decide whether there is a human interaction or not. Then what is the underlying model? Hand touch Hand Interaction Recognition

22 Pictorial Structure Model
𝐼: Image 𝑙 𝑖 : Location of part 𝑖 To explain our approach, we need to trace back to the well-known pictorial structure model in computer vision for human detection. In the pictorial structure model, a human is explained as a set of parts, such as head, torso, arms and legs, with springs representing the relative locations between parts. A scoring function is defined based on the locations of these parts. We write l_i for the 2D location of part i. We score a configuration of parts with a summation over two terms. - P. Felzenszwalb etc., PAMI 2009

23 Pictorial Structure Model
𝛼 𝑖 : Unary template for part 𝑖 𝜙(𝐼, 𝑙 𝑖 ): Local image features at location 𝑙 𝑖 The first term scores a local match for placing a part template at a particular image location. We write alpha for the part template, and phi for the image features extracted at that location. We use edge features, and score the template with a dot product.

24 Pictorial Structure Model
𝜓( 𝑙 𝑖 , 𝑙 𝑗 ): Spatial features between 𝑙 𝑖 and 𝑙 𝑗 𝛽 𝑖𝑗 : Pairwise springs between part 𝑖 and part 𝑗 The second term consists of a deformation model that evaluates the relative locations of pairs of parts. We write psi for the spatial features between two part locations, and beta for the parameters of a spring that favors certain offsets over others.

25 “Two Head Monster” Model
i.e. “Hand-Touching-Hand” Our direct proxemics model, can be viewed as an extension of the previously model except now we have two humans in our model. We give it a nickname called, the “two head monster model”. Using hand-touch-hand as an example, our model detects a two head monster that forces there is a hand-hand spring connection in it.

26 “Two Head Monster” Model
i.e. “Hand-Touching-Hand” Based on the two full body skeletons, we develop a simplified verion that crops the noninformative parts and leaves the two heads and two arms only. That is our model recognizing hand touch hand.

27 Shoulder touch Shoulder
The models Hand touch Hand Hand touch Shoulder Shoulder touch Shoulder Arm touch Torso We build four models in our experimental study. Hand touch hand, hand touch shoulder, shoulder touch shoulder and arm touch torso. For each one of them, there is one more spring connection between the two persons capturing the relative locations in the interactions.

28 Match Model to Image Inference: max 𝐿 𝑆(𝐼,𝐿) Learning:
Efficient algorithm: Dynamic programming Learning: Structural SVM Solver Given an image, the inference of our model can be done very efficiently using dynamic programming. The model parameters are learned using supervised data with a structure SVM solver.

29 Refinements + Extensions
Sub-categories Because of symmetry, 4 models for hand-hand etc R. Hand L. Hand L. Hand R. Hand L. Hand L. Hand R. Hand R. Hand There is still some ambiguity in our model because of the left-right symmetry of arms and shoulders. Hence in our refinements, we build 4 sub-models for hand-hand to capturing the different visual appearances.

30 Refinements + Extensions
Sub-categories Because of symmetry, 4 models for hand-hand etc Co-occurrence of proxemics: Reduce redundancy, map Multi-Label -> Multi-Class R. Hand L. Hand L. Hand R. Hand L. Hand L. Hand R. Hand R. Hand Another keypoint is that not all of the proxemics will happen simultaneously between two persons. We consider that as well.

31 Naïve Approach Results
We run experiments and compare it against the previous naïve approach. Hand touch Hand Hand touch Shoulder

32 Direct Approach Results
As you can see from the precision-recall curves shown in the blue line, our model significantly outperforms the baseline for both hand touch hand and hand touch shoulder. Hand touch Hand Hand touch Shoulder

33 Quantitative Results [1] Y. Yang & D. Ramanan, CVPR 2011
For quantitative comparison, you can see there is an 30% average precision boost by using our direct model which is again very significant. [1] Y. Yang & D. Ramanan, CVPR 2011

34 Improves Pose Estimation
Y & D CVPR 2011 Our Model The direct proxemic model can reversely improves pose estimation as well as recognizing human interactions. Here we show three examples for hand-touching-hand. By recognizing people are hand-touch-hand and utilizing the information that their two hands must be very close, the arm locations are predicted much more accurate.

35 Improves Pose Estimation
Y & D CVPR 2011 Our Model Similar phenomenon occurs in hand touching shoulder. The independent pose estimation algorithm fails to capture the occlusion and reason about the ambiguous arm. While our model can even predict the occluded part locations by utilizing the information that a hand should be near to a shoulder.

36 Improves Pose Estimation
Y & D CVPR 2011 Our Model For the hand touch torso, similar improvements happen again. There are a number of Simon holding baby photo but somehow they are not in the highest score list.

37 Conclusion Proxemics and touch codes for human interaction
In conclusion, we introduce a novel concept in computer vision literature called proxemics and touch codes for modeling human interactions.

38 Conclusion Proxemics and touch codes for human interaction
Directly recognizing proxemics significantly outperforms We develop a novel model that directly recognizing proxemics and it significantly outperforms the naïve approach which uses pose estimation as a preprocessing step.

39 Conclusion Proxemics and touch codes for human interaction
Directly recognizing proxemics significantly outperforms Recognizing proxemics helps pose estimation And finally we show that by recognizing proxemics, it also helps pose estimation.

40 Thank Simon and MSR for internship
Acknowledgements That’s all. Thank Simon and MSR for giving me this exciting internship experience. Thank Simon and MSR for internship

41 Thank Anitha for a lot of suggestions
Acknowledgements Thank Anitha for a lot of wonderful suggestions for both my research and life. Thank Anitha for a lot of suggestions

42 Thank Anarb for gettyimages
Acknowledgements Aso thank Anarb who is my labmate by telling me gettyimages so I am able to download a lot of useful images for studying. Thank Anarb for gettyimages

43 Thank Eletee for her beautiful smiling
Acknowledgements Also thank Eletee who is also my labmate for her beautiful smiling that makes our lab with sunshine eveyday. Thank Eletee for her beautiful smiling

44 Thank everybody for not falling asleep
Acknowledgements Finally, thank everybody for not falling asleep in my talk. Thank everybody for not falling asleep

45 Thank you Thank you very much.

46 Articulated Pose Estimation
- Yi Yang & Deva Ramanan, CVPR 2011

47 Inference & Learning Inference
max 𝐿,𝑀 𝑆(𝐼,𝐿,𝑀) For a tree graph (V,E): dynamic programming min 𝑤 1 2 𝑤 s.t. ∀𝑛∈pos 𝑤∙𝜙 𝐼 𝑛 , 𝑧 𝑛 ≥1 ∀𝑛∈neg,∀𝑧 𝑤∙𝜙 𝐼 𝑛 ,𝑧 ≤−1 Given labeled positive { 𝐼 𝑛 , 𝐿 𝑛 , 𝑀 𝑛 } and negative { 𝐼 𝑛 }, write 𝑧 𝑛 =( 𝐿 𝑛 , 𝑀 𝑛 ), and 𝑆 𝐼,𝑧 =𝑤∙𝜙 𝐼,𝑧 Learning In order to train our model, we assume a fully supervised training dataset, where we are given positive and negative images with part locations and mixture types. Since the scoring function is linear in its parameters, we can learn the model using a SVM solver. This means we learn the local part templates, the springs, and the mixture co-occurences simultaneously.


Download ppt "Proxemics Recognition"

Similar presentations


Ads by Google