Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Computer Vision (Module 5F16) Carsten Rother Pushmeet Kohli.

Similar presentations


Presentation on theme: "Advanced Computer Vision (Module 5F16) Carsten Rother Pushmeet Kohli."— Presentation transcript:

1 Advanced Computer Vision (Module 5F16) Carsten Rother Pushmeet Kohli

2 Syllabus (updated) L1&2: Intro – Intro: Probabilistic models – Different approaches for learning – Generative/discriminative models, discriminative functions L3&4: Labelling Problems in Computer Vision – Graphical models – Expressing vision problems as labelling problems L5&6: Optimization - Message Passing (BP, TRW) - Submodularity and Graph Cuts - Move Making algorithms (Expansion/Swap/Range/Fusion) - LP Relaxations - Dual Decomposition

3 Syllabus (updated) L7&8 (8.2): Optimization and Learning - c ompare max-margin vs. maximum likelihood L9&10 (15.2): Case Studies - tbd … Decision Trees and Random Fields, Kinect Person detection L11&12 (22.2): Optimization Comparison, Case Studies (tbd)

4 Books 1. Advances in Markov Random Fields for Computer Vision. MIT Press 2011. (Edited by Andrew Blake, Pushmeet Kohli and Carsten Rother) 2. Pattern Recognition and Machine Learning, Springer 2006, by Chris Bishop 3. Structured Learning and Prediction in Computer Vision (Sebastian Nowozin and Christoph H. Lampert; Foundations and Trends in Computer Graphics and Vision series of now publishers, 2011). 4. Computer Vision, Springer 2010, by Rick Szeliski

5 A gentle Start: Interactive Image Segmentation and Probabilities

6 Probabilities Probability distribution: P(x): ∑ P(x) = 1, P(x) ≥ 0 ; discrete x ϵ {0,…L} Joint distribution: P(x,z) Conditional distribution: P(x|z) Sum rule: P(x) = ∑ P(x,z) Product rule: P(x,z) = P(x|z) P(z) Bayes’ rule: P(x|z) = P(z|x) P(x) / P(z) x z

7 Interactive Segmentation Goal Given z and unknown variables x : P(x | z) = P(z|x) P(x) / P(z) ~ P(z|x) P(x) z = (R,G,B) n x = {0,1} n Posterior Probability Likelihood (data- dependent) Maximium a Posteriori (MAP): x* = argmax P(x | z) Prior (data- independent) x x* = argmin E(x) x We will express this as an energy minimization problem: constant (user-specified pixels are not optimized for)

8 Likelihood P(x|z) ~ P(z|x) P(x) Red Green Red Green

9 Likelihood P(x|z) ~ P(z|x) P(x) Maximum likelihood: x* = argmax P(z|x) = = argmax ∏ P(z i |x i ) p(z i |x i =0) p(z i |x i =1) x i x

10 Prior P(x|z) ~ P(z|x) P(x) P(x) = 1/f ∏ θ ij (x i,x j ) f = ∑ ∏ θ ij (x i,x j ) “partition function” θ ij (x i,x j ) = exp{-|x i -x j |} “ising prior” xixi xjxj x i,j Є N 4 i,j Є N (exp{-1}=0.36; exp{0}=1)

11 Prior – 4x4 Grid Best Solutions sorted by probability Pure Prior model: “Smoothness prior needs the likelihood” P(x) = 1/f ∏ exp{-|x i -x j |} i,j Є N 4 Worst Solutions sorted by probability

12 Prior – 4x4 Grid Distribution Pure Prior model: P(x) = 1/f ∏ exp{-|x i -x j |} i,j Є N 4 Samples 2 16 configurations Probability

13 Prior – 4x4 Grid Best Solutions sorted by probability Pure Prior model: P(x) = 1/f ∏ exp{-10|x i -x j |} i,j Є N 4 Worst Solutions sorted by probability

14 Prior – 4x4 Grid Distribution Pure Prior model: P(x) = 1/f ∏ exp{-10|x i -x j |} i,j Є N 4 Samples 2 16 configurations Probability

15 Putting it together… … let us look at this later Posterior: P(x|z) = P(z|x) P(x) / P(z) P(x|z) = 1/P(z) * 1/f ∏ exp{-|x i -x j |} * ∏ p(z i |x i ) Rewriting it… P(x,z) = P(z|x) P(x) Joint: with f(z) = ∑ exp{-E(x,z)} i,j Є N 4 i = 1/f(z) exp{- ( ∑ |x i -x j | + ∑ -log p(z i |x i )) } i i,j Є N 4 = 1/f(z) exp{-E(x,z)} “Gibbs distribution” = 1/f(z) exp{- ( ∑ |x i -x j | + ∑ -log p(z i |x i =0)(1-x i ) -log p(z i |x i =1)x i )} i i,j Є N 4 x

16 Gibbs Distribution is more general -log p(z i |x i =1) x i -log p(z i |x i =0) (1-x i ) θ i (x i,z i ) = θ ij (x i,x j ) = |x i -x j | Unary term “encoded our prior knowledge over labellings” P(x|z) = 1/f(z) exp{-E(x,z)} with f(z) = ∑ exp{-E(x,z)} E(x,z) = ∑ θ i (x i,z) + w ∑ θ ij (x i,x j,z) + ∑ θ ij,k (x i,x j,x k, z) +... i i,j i,j,k Gibbs distribution does not has to decompose into prior and likelihood: x Energy: Pairwise term “encoded our dependency on the data” Higher-order terms In our case:

17 Energy minimization -log P(x|z) = -log (1/f(z)) + E(x,z) Minimum Energy solution is the same as MAP solution MAP; Global min E x* = argmin E(x,z) ML f(z,w) = ∑ exp{-E(x,z)} X X P(x|z) = 1/f(z) exp{-E(x,z)} x*= argmax P(x|z) x maximum-a-posteriori (MAP) solution

18 Recap Posterior, Likelihood, Prior P(x|z) = P(z|x) P(x) / P(z) Gibbs distribution: P(x|z) = 1/f(z) exp{-E(x,z)} Energy minimization same as MAP estimation x* = argmax P(x | z) = argmin E(x) x x

19 Weighting of Unary and Pairwise term w =0 E(x,z,w) = ∑ θ i (x i,z i ) + w ∑ θ ij (x i,x j ) w =10 w =200 w =40

20 Learning versus Optimization/Prediction Gibbs distribution: P(x|z,w) = 1/f(z,w) exp{-E(x,z,w)} Testing phase: infer x which does depends on test image z Training phase: infer w which does not depend on a test image z {x t,z t } => w z,w => x ztzt ztzt xtxt z =>

21 A simple procedure to learn w Questions: - Is it the best and only way? - Can we over-fit to training data? w 1.Iterate w = 0,…,400 1.Compute x* t for all training images {x t,z t } 2.Compute average error Er = 1/|T| ∑ with loss function: (Hamming error) 2.Take w with smallest Er Er Δ(x t,x* t ) Hamming error: number of misclassified pixels i t

22 Model :  discrete or continuous variables?  discrete or continuous space?  Dependence between variables?  … Big Picture: Statistical Models in Computer Vision Optimisation/Prediction/inference Combinatorial optimization: e.g. Graph Cut Message Passing: e.g. BP, TRW Iterated Conditional Modes (ICM) LP-relaxation: e.g. Cutting-plane Problem decomposition + subgradient … Learning: Maximum Likelihood Learning Pseudo-likelihood approximation Loss minimizing Parameter Learning Exhaustive search Constraint generation … Applications:  2D/3D Image segmentation  Object Recognition  3D reconstruction  Stereo matching  Image denoising  Texture Synthesis  Pose estimation  Panoramic Stitching  …

23 Machine Learning view: Structured Learning and Prediction ”Normal” Machine Learning: f : Z N (classification) f : Z R (regression) Input: Image, text Output: real number(s) f : Z X Input: Image, text Output: complex structure object (labelling, parse tree) Parse tree of a sentence Image labelling Chemical structure Structured Output Prediction:

24 Structured Output Ad hoc definition (from [Nowozin et al. 2011]) Data that consists of several parts, and not only the parts themselves contain information, but also the way in which the parts belong together.

25 Learning: A simple toy problem Label generation: Data generation: “small deviation of a 2x2 foreground (white) square at arbitrary position” 1.Foreground pixels are white, Background black 2.Flip label of a few random pixels 3.Add some Gaussian noise Example man-made object detection [Nowozin and Lampert ‘2011]

26 A possible model for the data Ising model on 4x4 grid graph: P(x|z,w) = 1/f(z,w) exp{-( ∑ (z i (1-x i )+(1-z i )x i ) + w ∑ |x i -x j | )} i i,j Є N 4 Unary term Pairwise terms Data z : Label x :

27 Decision Theory Assume w has been learned and P(x|z,w) is: Which solution x* would you choose? Best Solutions sorted by probabilityWorst Solutions sorted by probability Distribution 2 16 configurations Probability

28 How to make a decision Risk R is the expected loss: “loss function” Goal: Choose x* which minimizes the risk R Assume model P(x|z,w) is known R = ∑ P(x|z,w) Δ(x,x*) x

29 Decision Theory Best Solutions sorted by probabilityWorst Solutions sorted by probability 0/1 loss: Δ(x,x*) = 0 if x*=x, 1 otherwise Risk: R = ∑ P(x|z,w) Δ(x,x*) x MAP x* = argmax P(x|z,w) x

30 Decision Theory Best Solutions sorted by probabilityWorst Solutions sorted by probability Risk: R = ∑ P(x|z,w) Δ(x,x*) x Hamming loss: i Maximize Marginals: x i * = argmax P(x i |z,w) xixi

31 Decision Theory Best Solutions sorted by probabilityWorst Solutions sorted by probability Maximize Marginals: x i * = argmax P(x i |z,w) xixi Marginal: P(x i =k) = ∑ P(x 1,…,x i =k,…,x n ) X j\i Computing marginals is sometimes called “probabilistic inference” different to MAP inference.

32 Recap A different loss function gives a very different solution !

33 Two different approaches to learning 1. Probabilistic Parameter Learning: “P(x|z,w) is needed ” 2. Loss-based Parameters Learning “E(x,z,w) is sufficient”

34 Probabilistic Parameter Learning {x t,z t } w* = argmin Π –log P(x t |z t,w)+|w| 2 Choose a Loss 0/1 loss Hamming loss Regularized Maximum Likelihood estimation Construct the decision function Test time: optimize decision function for new test image z, e.g. x* = argmax P(x|z,w) Training database w t x Training: x It is: P(w|z t,x t ) ~ P(x t |w,z t ) P(w|z t ) x* = argmax P(x|z,w) Learn weights xixi x* = argmax P(x i |z,w)

35 ML estimation for our toy image Images z t Labels x t P(x|z,w) = 1/f(z,w) exp{-( ∑ (z i (1-x i )+(1-z i )x i ) + w ∑ |x i -x j | )} i i,j Є N 4 Train: w* = argmin ∑ -log P(x t |z t,w) w PLOT t 1/|T| ∑ -log P(x t |z t,w) t How many training images?

36 ML estimation for or toy image Images z t Labels x t P(x|z,w) = 1/f(z,w) exp{-( ∑ (z i (1-x i )+(1-z i )x i ) + w ∑ |x i -x j | )} i i,j Є N 4 Train: w* = argmin ∑ -log P(x t |z t,w) = 0.8 w t Exhaustive search: Testing (1000 images): 1.MAP (0/1 Loss): av. Error 0/1: 0.99; av. Error Hamming: 0.32 2.Marginals (Hamming Loss): av. Error 0/1: 0.92; av. Error Hamming: 0.17

37 ML estimation for or toy image So, probabilistic inference is better than MAP inference … since better loss function Example test results

38 Two different approaches to learning 1. Probabilistic Parameter Learning: “P(x|z,w) is needed ” 2. Loss-based Parameters Learning “E(x,z,w) is sufficient”

39 Loss-based Parameter learning “loss function” Minimize R = ∑ P(x|z,w) Δ(x,x*) x “Replace this by samples from the true distribution, i.e. training data” How much training data is needed? R = 1/|T| ∑ Δ(x t,x* t ) ~ t with: x* = argmax P(x|z,w) x

40 Loss-based Parameter learning Testing 1.0/1 Loss (w=0.2) Error 0/1: 0.69; Error Hamming: 0.11 2.Hamming Loss (w=0.1) Error 0/1: 0.7; Error Hamming: 0.10 Minimize R = 1/|T| ∑ Δ(x t,x* t ) t x* = argmax P(x|z,w) x Search: 0/1 lossSearch: Hamming loss

41 Loss-based Parameter learning Example test results 0/1 Loss Hamming Loss

42 Which approach is better? Model mismatch: our model cannot represent the true distribution of the training data! … and we probably always have that in vision  Comment: marginals do also give an uncertainty for every pixel which can be used in a bigger systems Hamming Test Error: 1.ML: MAP (0/1 Loss) - Error 0.32 2.ML: Marginals (Hamming Loss) - Error 0.17 3.Loss-based: MAP (0/1 Loss) - Error 0.11 4.Loss-based: MAP (Ham. Loss) - Error 0.10 Why are Loss-based methods much better?

43 Check: sample from true model ( w =0.8) Data: Sampled Label: My toy data labelling: Re-train gives w=0.8

44 A real world application: Image denoising Z 1..m Ground truths Train images Model: 4-connected graph with 64 labels and total 128 weights ML training: MAP (image 0-1 loss) ML training: MMSE (pixel-wise squared loss) Test image - true Input test image - noisy x 1..m [see details in: Putting MAP back on the map, Pletscher et al. DAGM 2010] zoom

45 Example – Image denoising Loss-based MAP (pixel-wise squared loss) Test image - true Input test image - noisy Z 1..m Ground truths Train images x 1..m

46 Comparison of the two pipelines: models Loss-minimizing Probabilistic Unary potential: | z i -x i | Pairwise potential: | x i -x j | Unary potential: | z i -x i | Pairwise potential: | x i -x j | Data z Lable x

47 Comparison of the two pipelines [see details in: Putting MAP back on the map, Pletscher et al. DAGM 2010] Deviation from true model Prediction error

48 Recap Loss functions Two Pipelines for Parameter learning – Loss-based – Probabilistic MAP inference is good, if trained well

49 Another Machine Learning view We can identify 3 different approaches: [see details in Bishop, page 42ff]: Generative (probabilistic) models Discriminative (probabilistic) models Discriminative functions

50 Generative model Models that model explicitly (or implicitly) the distribution of the in- and output Joint Probability: P(x,z) = P(z|x) P(x) Pros: 1. Most elaborate model 2. possible to sample both, x and z Cons: might not always be possible to write down the full distribution (involves a distribution over images). likelihood prior

51 Generative Model: Example P(x,z) = P(z|x) P(x) x P(z|x) as GMMs P(x) = 1/f ∏ exp{-|x i -x j |} Ising Prior i,j Є N 4 z Samples: True image: Most likely:

52 Why does segmentation still work? P(x|z) = 1/P(z) P(z,x) Remember: P(x|z) = 1/f(z) exp{-E(x,z)} We use the posterior not the joint, so image z is given: Comments: -a better likelihood p(z|x) may give a better model -when you test models keep in mind that data is never random it is very structured! z Samples x Samples from the toy-model (with strong likelihood):

53 Discriminative model P(x|z) = 1/f(z) exp{-E(x,z)} Models that model the Posterior directly are discriminative models: We later call them: “Conditional random field” Pros: 1. simpler to write down (no need to model z) and goes directly for the desired output x 2. probability can be used in bigger systems Cons: we can not sample images z

54 Discriminative model - Example Gibbs: P(x|z) = 1/f(z) exp{-E(x,z)} E(x) = ∑ θ i (x i, z i ) + ∑ θ ij (x i,x j,z i,z j ) i i,j Є N 4 θ ij (x i,x j,z i,z j ) = |x i -x j | (-exp{-ß||z i -z j ||}) ß=2(Mean(||z i -z j || 2 ) ) -1 ||z i -z j || θ ij Ising Edge-dependent

55 Discriminative functions E(x,z): L n -> R Models that model the classification problem via a function Examples: - Energy which has been Loss-based trained - support vector machines - decision trees Pros: most direct approach to model the problem Cons: no probabilities x* = argmax E(x,z) x

56 Recap Generative (probabilistic) models Discriminative (probabilistic) models Discriminative functions

57 Image segmentation … the full story … a meeting with the Queen

58 Segmentation [Boykov& Jolly ICCV ‘01] F p = ∞ B p = 0 E(x) = Image z and user input Output x* = argmin E(x) ϵ {0,1} ∑ F p x p + B p (1-x p )+ ∑ w pq |x p -x q | Graph Cut: Global optimum in polynomial time ~0.3sec for 1MPixel image [Boykov, Kolmogorov, PAMI ‘04] w pq = w i + w c exp(-w β ||z p -z q || 2 ) F p = 0 B p = ∞ pq ϵ E p ϵ V x How to prevent the trivial solution?

59 What is a good segmentation? Objects (fore- and background) are self-similar wrt appearance Input Image Option 1 Option 2 Option 3 E unary (x, θ F,θ B ) = -log p(z|x, θ F,θ B ) = ∑ -log p(z p |θ F ) x p -log p(z p |θ B ) (1-x p ) p ϵ V E unary = 460000 E unary = 482000 E unary = 483000 foreground background foreground background foreground background θFθF θBθB θFθF θBθB θFθF θBθB θFθF θBθB x z

60 GrabCut [Rother, Kolmogorov, Blake, Siggraph ‘04] Background Foreground G R F p (θ F ) = -log p(z p |θ F ) B p (θ B ) = -log p(z p |θ B ) E(x,θ F,θ B ) = ∑ F p (θ F )x p + B p (θ B )(1-x p )+ ∑ w pq |x p -x q | pq Є E pЄVpЄV “others” Output GMMs θ F,θ B Problem: Joint optimization of x,θ F,θ B is NP-hard Image z and user input Output x ϵ {0,1} F p = ∞ B p = 0 F p = 0 B p = ∞

61 GrabCut: Optimization [Rother, Kolmogorov, Blake, Siggraph ‘04] Learning of the colour distributions Graph cut to infer segmentation x min E(x, θ F, θ B ) θF,θBθF,θBθF,θBθF,θB Image z and user input Initial segmentation x

62 1 2 34 GrabCut: Optimization Energy after each Iteration Result 0

63 GrabCut: Optimization Background Foreground & Background G R Background Foreground G R Iterated graph cut

64 Summary – Intro: Probabilistic models – Two different approaches for learning – Generative/discriminative models, discriminative functions – Advanced segmentation system: GrabCut


Download ppt "Advanced Computer Vision (Module 5F16) Carsten Rother Pushmeet Kohli."

Similar presentations


Ads by Google