Self-paced Learning for Latent Variable Models M.Pawan Kumar Ben Packer Daphne Koller , Stanford University Presented by Zhou Yu TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA
Aim: To learn an accurate set of parameters for latent variable models Intuitions from Human Learning: all information at once may be confusing => bad local minima start with “easy” examples the learner is prepared to handle task-specific onerous on user “easy for human” “easy for computer” “easy for Learner A” “easy for Learner B” “Self-paced” schedule of examples is automatically set by learner Adopted from Kumer’s poster
Latent Variable Models x y h Why use latent variable model? Because usually object localization need to do sliding windows and later to filler some unreasonable results. Here latent variable model integrate the two steps by introducing the hidden structure in the object function. Here the hidden structure is the bonding box. x : input or observed variables y : output or observed variables h : hidden/latent variables
Latent Variable Models x h Latent Variable Model can be used in a lot of topics Object Localization Action Recognition Human Pose Detection The advantage of the latent variable model is it integrates two step work into one object function. x = Entire image y = “Deer” h = Bounding Box
Learning Latent Variable Models Goal: Given D = {(x1,y1), …, (xn,yn)}, learn parameters w. Expectation Maximization: Maximize log likelihood: maxw i log P(xi,yi;w) = maxw (i log P(xi,yi,hi;w) - i log P(hi |xi,yi,,w) ) Iterate: Find expected value of hidden variables using current w Update w to maximize log likelihood subject to this expectation How to solve the latent variable model. The most intuitive way is to use EM. But the problem of EM is that it can easily stuck in local optimum
Learning Latent Variable Models Goal: Given D = {(x1,y1), …, (xn,yn)}, learn parameters w. Latent Structural SVM x h Solver: Concave-convex procedure (CCCP) So here we introduce the latent Structural SVM to solve this latent variable model. The only difference between latent structural SVM and the standard SVM is si(x,y,h), which is the joint feature vector, for instance, in our deer model, the joint feature vector can be modeled as HOG descriptor extracted from the bounding box h. y hat is the predicted output given w. The value of si can be shown an upper bound on the risk. The only trouble that structural SVM compared to standard SVM is it is a non-convex problem. Here the classical solver is concave-convex procedure denoted as CCCP. The idea is just alternate optimizing the hidden variables and other parameters. x = Entire image y = “Deer” h = Bounding Box
Self-Paced Learning Now we do a little change on the object function. We introduce a criteria to make the iteration procedure to iterate from easy sample first and then propagate to hard examples. Here v is the indicate of easiness. Note: vi =1 means it is easy, vi =0 means it is hard
Self-Paced Learning Iteration 1 Iteration 2 Iteration 3 Iteration 1 easy hard Iteration 1 Iteration 2 Iteration 3 CCCP All at once Here the K denotes the threshold of easiness. The larger the K the easier the sample is. Here green means easy, red means hard. The one that are far way from the margin has more confidence in the prediction procedure. So we think that they are easy samples. Below the two examples show the difference. We consider the elephant one is easy, in first iteration, it already got the right location. But for the deer one, it is hard. For the CCCP it never get the right one. Here the red box means wrong results. But the self-paced one get the desired results in the later iteration. Iteration 1 Iteration 2 Iteration 3 Self-paced learning Easy first
Optimization in Self-paced learning using ACS Initialize K to be large Iterate: Run inference over h Alternatively update w and v (ACS alternate convex search): v set by sorting li(w), comparing to threshold 1/K Perform normal update for w over subset of data Until convergence Anneal K K/μ Until all vi = 1, cannot reduce objective within tolerance Here because the special biconvex property. We can alternately optimize the w, which is the model and v which leads to what samples are considered to be easy or hard. Here the mui is constant, so not that reasonable, the reasonable solver maybe based on evaluation the distance of the samples to the marge. If it is fast enough then it can be changed to easy.
Initialization How we get the w0 Initially setting vi =1 for all samples Run original CCCP to solve the structure latent SVM for a fixed number of iteration T0 Concern: It is not reasonable to use the model which we think is not good as the initial value. Different initializations could result in different performance in the end. So it is practically used the CCCP results which they think is not good as the initialization. Which is like chicken egg problem
6 Different mammals(approximately 45 images per mammal) Experiment Object Localization 6 Different mammals(approximately 45 images per mammal) easy hard easy hard CCCP Self-paced learning
5 categories out of 20, random sample 50 percent of the data Experiment Pascal VOC 2007 5 categories out of 20, random sample 50 percent of the data AP A: Use human labeled information to decide which are easy samples, non truncated non occluded are easy. Use this as initialization in self-paced learning model B: Use CCCP results as initialization for Self-paced learning model C: CCCP
Experiment Some random cat images from Google Original Image Results after 10 iterations
Conclusion Latent variable models Initialization Latent structural SVM Eg: object detection, human pose estimation, human action recognition, tracking. Initialization What is a good initialization? Maybe Multiple initialization?