Goal: Predicting the Where and What of actors and actions through Online Action Localization Figure 1.

RET Computer Vision Predicting the location and actions in an online video Christopher Stanley

Goal: Predicting the Where and What of actors and actions through Online Action Localization Figure 1

Conditional probability review
We are interested in finding the reverse conditional probability P(B|A), so we apply some basic Algebra properties to solve for that probability. We end up with the following equation. We’ll work on an example to show the formula in action.

Cards example If the card is a King, what’s the probability that it’s a heart. We’re going to do this in a tedious way, with the intention of showing that the formula does work. This example leads to a more complicated problem using the same ideas. First set up the condition backwards Solving for P(Heart|King) we get

Face Detection Review of Bayes Theorem
Example of how Bayes Theorem can be used in determining if an image has a face. Obtain 50,000 images. 25,000 images have faces and 25,000 do not. or or Figure 2

Face Detection continued
What we know There are 25,000 facial images There are 25,000 non-facial images There are 100,000 feature boxes The computer randomly generates 100,000 feature wavelets. Some of these wavelets have features that match a human face. They look like common QR codes. These pattern boxes are tested in every image. We record if a pattern is found in the image. 100,000 of these different random patterns

Reverse conditioning (Bayes Theorem)
We have images with either faces or not faces, that is given. Given need Given need Use Algebra to obtain Use Algebra to obtain

Facial recognition continued
There are 100,000 random features and each feature is independent of each other, lets call them To estimate the chance that an image has a face, we’ll start by finding is the probability that an image with a face will have that feature pattern. is the probability that an image has a face is the probability that an image has the ith random feature.

Facial recognition continued Two examples: X1 and X2
Since X1’s pattern resembles the eyes of a human, it shows up in 22,000 of the 25,000 facial images and in 5,000 of the 25,000 non-facial images we have. X2’s pattern doesn’t seem to have a facial feature and showed up in 2,500 of facial images and 6,500 of the non-facial images. X1 X2 A probability is found for every random pattern and since they are independent, all probabilities can be multiplied, these results along with the probability that the image is a non-face given the pattern will form a number that can me programmed as an algorithm that detects faces in images.

Goal of the program: Predicting the
Where and What of actors and actions through Online Action Localization

Basics of an Image or frames of a video
Before learning how the program locates and predicts the action in a video, we’ll discuss some basic components of the frames of an video. A video is made up of multiple frames. For instance, if a video records at 30 frames per second, then you’d have 30 frames for every second of video recorded to work with. For example, if the video records for 10 seconds, then we’d have 300 unique frames to analyze. We would see very little differences from frame to frame. Each frame can be thought of as an individual image. Every frame is made up of multiple pixels. Each pixel has a numbering system that represents the color of that particular pixel. In the next few slides, we’ll discuss the pixels in an image and how the number system works that determines the color.

How to determine the total number of pixels there are in an image or frame of a video?
750 The resolution of an image represents how many total pixels an image has; it is represented by the number of pixels there are by width and height. For example, the image to the left has a 750 x 1334 resolution. Therefore, there are 750 pixels across horizontally and 1334 pixels up and down vertically. We can multiple those two numbers to obtain the total number of pixels in the image, thus this picture has 750 x 1334 = 1.5 million pixels total. The more pixels an image has results in a higher resolution, and therefore more clear. But an image with more pixels will take up more storage. 1334

How the computer labels colors in an image
Each pixel in an image is given a numbering system (B,G,R) that represents what color that pixel is. Each color has a range from 0 to 255, the higher the number, the more of that respective color is for that particular pixel. For example, a pixel labeled as (0,0,255) is red and a pixel labeled (255,0,0) is blue. Why is this numbering system important in our program? Computer programs can be written to group numbers together (similar colors), the beginnings of teaching the computer how to determine the important features in the frame vs. the less important background and locating and predicting actions

Superpixels Superpixels
In computer vision, image segmentation is the process of partitioning a digital image in multiple segments, sets of pixels, also known as superpixels. The goal of segmentation is to simplify and change the representation of an image into something that is more meaningful and easier to analyze. Superpixels are grouped together by color and brightness and once an image is partitioned into superpixels, it’s easier to locate objects and boundaries in images. Figure 3

Goal of the program: Predicting the
Where and What of actors and actions through Online Action Localization

Steps for the first 5 frames
Test Video Steps for the first 5 frames Frame 1 Pose Superpixel Appearance Model Bayes thm. Location Prediction Frame 2 Pose Superpixel Use same Appearance Model Bayes thm. Location New Prediction Frame 3 Pose Superpixel Use same Appearance Model Bayes thm. Location New Prediction Frame 4 Pose Superpixel Use same Appearance Model Bayes thm. Location New Prediction Frame 5 Pose Superpixel Create new Appearance Model Bayes thm. Location New Prediction Continue this Process creating a new App. Model every 5th Frame.

Predicting the actions takes place in a series of steps illustrated below.
Figure 4

Bayes Theorem How does this project relate to Statistics?
Bayes Theorem is used to create a confidence score for each superpixel in each frame. This score calculates the likelihood that any given pixel in the frame is “important to the story”, given information that we knew from the past frames and current frame. Once every pixel receives a confidence score, a heat map is generated showing darker colors as more important pixels a lighter colors for less important pixels. The formula to calculate the confidence scores uses the Bayes Theorem formula listed below. These are new code in the program

Input video Extract superpixels and pose estimation
Once the video is started, computer code is used to extract superpixels from each frame as well as a pose estimation. This information is used later in Bayes rule to create the confidence scores when creating the heat map. In the formula, This is where the P(s) and P(p) is generated in the Bayes formula. Figure 5

Appearance Model The Appearance model is a very important part of the process for locating the action in the video after each frame. After the appearance model is constructed, it compares results from within the model with each of the next 4 frames. The results will generate the P(s|x) and P(p|x) from the Bayes theorem formula. A new appearance model is created every 5th frame to take into account major changes in the video. Figure 6

Appearance Model Construction
How is the Appearance model constructed? -Based off the initial extracted superpixels, clusters of superpixels are created based off the color of the initial superpixels. Each cluster may contain a different number of superpixels. -There is also a bounding box created based off the previous information of the pose estimation, it’s the yellow outline of the player. In this appearance model, there could be 5 clusters, where each cluster is a group of superpixels, based off the color of similar superpixels.

Clustering Superpixels Example of clusters that could be formed
Cluster 1 – each superpixel is made up primarily of brown colors Cluster 2 – each superpixel is made up primarily of green colors Cluster 3 – each superpixel is made up primarily of white colors Cluster 4 – each superpixel is made up primarily of tan colors Cluster 5 – each superpixel is made up primarily of red colors

K-means Cluster 5 red Cluster 2 green Every point represents and average color from each superpixel within that cluster Centroid Cluster 4 tan Cluster 1 brown Cluster 3 white

To function below is used to find the P(s|X)
Finding confidence scores for each superpixel Superpixel-based foreground likelihood This formula is used to find a confidence score for each superpixel in a frame. Once all scores are found, a heat map is generated Recall: To function below is used to find the P(s|X)

The first term of the function calculates the distance from every centroid in the appearance model to a superpixel in the 2nd frame. Once the smallest distance is determined, that number gets divided by the radius of that cluster, then that result is multiplied to the percent of overlap with that cluster and the bounding box. This process happens for every superpixel in each frame and every superpixel will be assigned a confidence based off that result. Frame 1-4

Posed-based foreground likelihood
To function below is used to find the P(p|X) Figure 7

Posed-based foreground likelihood estimation procedure
Frame i P(pose|X) = given a pixel X location, what’s the probability the pose is at that location? = the center (x,y) of the bounding box. The x and y values are assumed to follow a normal curve, here we would have a 2 dimensional Gaussian image, where the peak of the curves is the height of the x and y. Bounding box

Posed-based foreground likelihood estimation procedure (continued)
Analysis of two locations X1 and X2, these represent pixels. P(pose|X1) = closer to 1 There is a good chance that the pose is part of the X1 location. X1 P(pose|X2) = very small The probability that pose is part of pixel location X2 is very small. X2

P(X) Probability a location X in an image has a person in it
P(X) Probability a location X in an image has a person in it. (tracking locations as time varies) Procedure: In a bounding box, find the middle (x,y) coordinate in the first frame. Assume that the x and y’s value follow a normal distribution. As time increases, we expect the middle of the bounding box to move depending on the action. The middle coordinates are tracked in every frame and modeled linearly. Each location (x,y) on each frame is normalized as a 2 dimensional Gaussian model. Figure 8

P(X) in Bayes Theorem (x1,y1) (x2,y2) (x3,y3) (x4,y4) (x5,y5) (x6,y6)
Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6 Each (x,y) is linearly modeled. At each center, a Gaussian distribution is applied. In this example, we see small changes in the x values and almost no change in the y values. Each pixel location (x,y) farther from the middle x,y will fall farther away (in sd’s) from mean of the Gaussian model creating a smaller value for P(X), while pixels near the middle x,y will be closer to the mean of the Gaussian model generating a larger value for P(X)

P(Superpixel) and P(Pose)
P(P) and P(S) in the Bayes theorem is a fixed number for every pixel location X. Since this number does not change, it doesn’t have an effect on the Bayes theorem probability from pixel to pixel.

Heat Map A heat map is generated in each frame. The colors from the map are constructed depending on the confidence scores calculated from each superpixel from the Bayes theorem calculations. Blue portions of the heat map indicate a lower confidence score, while a higher score produces more red colors in the image. The location and action should show up darker in the image Figure 9 Figure 10

Action localization Once the heat map is generated, the darker colors on the heat map localize the action. The scores generated from Bayes theorem gives the probability that a pixel is important to the story given known information from past and previous frames.

An action prediction is made after every frame.
After the action is localized with the heat map, we move on to predict the action. An action prediction is made after every frame. The action label is predicted within the localized action bounding box through dynamic programming using scores from Support Vector Machines (SVMs) [1] The computer trains a SVM from previous videos with each action score programmed in 1 second intervals, 0 -> 1, 1 -> 2, ect. Then, dynamic programming is performed to match the scores from the new video with the SVM scores. When there is a higher frequency of matches, the prediction is made.

Analysis of the action prediction
Since the video will make a prediction after each frame, we can analyze the accuracy of the predictions as time varies. We’d expect the accuracy to increase as more of the video is played. The figure below displays the accuracy of the predictions After the computer reads the entire video, it’s accuracy is roughly 70%. Figure 11

Detailed accuracy for each sport
Figure 12

Citations [1] Khurram Soomru, Haroon Idrees, Mubarak Shah (2016). Predicting the Where and What of actors and actions through Online Action Localization. Retrieved from Crcv.ucf.edu/papers/cvpr2016/soomro_CVPR2016.pdf Images Figure 1,4,5,7-12 Screen shot retrieved from Kharrum’s powerpoint for training Figure 2 6 screenshots were taken from the following web addresses q=tbn:ANd9GcR7cvrQSAnQDzkLoUoAn9bX5iNxNGZipa23IHAK9Yt1T3MmUky69Q 3)

Images continued Figure 2 continued
Figure 3 Screenshot retrieved from Figure 6

Goal: Predicting the Where and What of actors and actions through Online Action Localization Figure 1.

Similar presentations

Presentation on theme: "Goal: Predicting the Where and What of actors and actions through Online Action Localization Figure 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Goal: Predicting the Where and What of actors and actions through Online Action Localization Figure 1.

Similar presentations

Presentation on theme: "Goal: Predicting the Where and What of actors and actions through Online Action Localization Figure 1."— Presentation transcript:

Similar presentations

About project

Feedback