Download presentation
Published byMegan Booker Modified over 9 years ago
1
CS 231A Section 1: Linear Algebra & Probability Review
Jonathan Krause 9/28/2012
2
Topics Support Vector Machines Boosting Linear Algebra Review
Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability Axioms Basic Properties Bayes Theorem, Chain Rule 9/28/2012
3
Which hyperplane is best?
Linear classifiers Find linear function (hyperplane) to separate positive and negative examples w, b Which hyperplane is best? 9/28/2012
4
Support vector machines
Find hyperplane that maximizes the margin between the positive and negative examples Support vectors Margin 9/28/2012
5
Support Vector Machines (SVM)
Wish to perform binary classification, i.e. find a linear classifier Given data and labels where When data is linearly separable we can solve the optimization problem to find our linear classifier 9/28/2012
6
Nonlinear SVMs Datasets that are linearly separable work out great:
But what if the dataset is just too hard? We can map it to a higher-dimensional space: x x x2 One way to deal with non-separable problems x Slide credit: Andrew Moore 9/28/2012
7
Nonlinear SVMs General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) lifting transformation Slide credit: Andrew Moore 9/28/2012
8
SVM – l1 regularization What if data is not linearly separable?
Can use regularization to solve this problem We solve a new optimization problem and “tune” our regularization parameter C Another way to deal with non-separable problems 9/28/2012
9
Solving the SVM There are many different packages for solving SVMs
In PS0 we have you use the liblinear package. This is an efficient implementation but can only use a linear kernel If you wish to have more flexibility with your choice of kernel you can use the LibSVM package Other tricks for e.g. large scale 9/28/2012
10
Topics Support Vector Machines Boosting Linear Algebra Review
Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability Axioms Basic Properties Bayes Theorem, Chain Rule 9/28/2012
11
Boosting It is a sequential procedure: xt=1 Each data point has xt
Y. Freund and R. Schapire, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence, 14(5): , September, 1999. xt=1 Each data point has a class label: xt xt=2 +1 ( ) -1 ( ) yt = and a weight: wt =1 It is a sequential procedure that builds a complex classifier out of simpler ones by combining them additively. It is a sequential procedure: 9/28/2012 11
12
Toy example Weak learners from the family of lines Each data point has
a class label: +1 ( ) -1 ( ) yt = and a weight: wt =1 h => p(error) = 0.5 it is at chance 9/28/2012
13
Toy example Each data point has a class label: +1 ( ) yt = -1 ( )
+1 ( ) -1 ( ) yt = and a weight: wt =1 This one seems to be the best This is a ‘weak classifier’: It performs slightly better than chance. 9/28/2012
14
Toy example Each data point has a class label: +1 ( ) yt = -1 ( )
+1 ( ) -1 ( ) yt = We update the weights: wt wt exp{-yt Ht} 9/28/2012
15
Toy example Each data point has a class label: +1 ( ) yt = -1 ( )
+1 ( ) -1 ( ) yt = We update the weights: wt wt exp{-yt Ht} 9/28/2012
16
Toy example Each data point has a class label: +1 ( ) yt = -1 ( )
+1 ( ) -1 ( ) yt = We update the weights: wt wt exp{-yt Ht} 9/28/2012
17
Toy example Each data point has a class label: +1 ( ) yt = -1 ( )
+1 ( ) -1 ( ) yt = We update the weights: wt wt exp{-yt Ht} 9/28/2012
18
Toy example f1 f2 f4 f3 The strong (non- linear) classifier is built as the combination of all the weak (linear) classifiers. 9/28/2012
19
Boosting Defines a classifier using an additive model: Strong
Strong classifier Weak classifier Weight Features vector 9/28/2012
20
Boosting Defines a classifier using an additive model:
Strong classifier Weak classifier Weight Features vector We need to define a family of weak classifiers form a family of weak classifiers 9/28/2012
21
Why boosting? A simple algorithm for learning robust classifiers
Freund & Shapire, 1995 Friedman, Hastie, Tibshhirani, 1998 Provides efficient algorithm for sparse visual feature selection Tieu & Viola, 2000 Viola & Jones, 2003 Easy to implement, doesn’t require external optimization tools. 9/28/2012
22
Boosting - mathematics
Weak learners value of rectangle feature threshold Final strong classifier 9/28/2012 22
23
Weak classifier 4 kind of Rectangle filters Value =
∑ (pixels in white area) – ∑ (pixels in black area) For real problems results are only as good as the features used... This is the main piece of ad-hoc (or domain) knowledge Rather than the pixels, we have selected a very large set of simple functions Sensitive to edges and other critical features of the image ** At multiple scales Since the final classifier is a perceptron it is important that the features be non-linear… otherwise the final classifier will be a simple perceptron. We introduce a threshold to yield binary features Credit slide: S. Lazebnik 9/28/2012 23
24
Weak classifier Source Result Credit slide: S. Lazebnik 9/28/2012 24
25
Viola & Jones algorithm
1. Evaluate each rectangle filter on each example ….. 0.8 0.7 0.2 0.3 0.8 0.1 Weak classifier threshold P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. 9/28/2012
26
Viola & Jones algorithm
For a 24x24 detection region, P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. 9/28/2012 26
27
Viola & Jones algorithm
2. Select best filter/threshold combination Normalize the weights For each feature, j Choose the classifier, ht with the lowest error 3. Reweight examples P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. 9/28/2012
28
Viola & Jones algorithm
4. The final strong classifier is The final hypothesis is a weighted linear combination of the T hypotheses where the weights are inversely proportional to the training errors P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. 9/28/2012
29
Boosting for face detection
For each round of boosting: Evaluate each rectangle filter on each example Select best filter/threshold combination Reweight examples 9/28/2012 29
30
The implemented system
Training Data 5000 faces All frontal, rescaled to 24x24 pixels 300 million non-faces 9500 non-face images Faces are normalized Scale, translation Many variations Across individuals Illumination Pose This situation with negative examples is actually quite common… where negative examples are free. P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. 9/28/2012 30
31
System performance Training time: “weeks” on 466 MHz Sun workstation
38 layers, total of 6061 features Average of 10 features evaluated per window on test set “On a 700 Mhz Pentium III processor, the face detector can process a 384 by 288 pixel image in about .067 seconds” 15 Hz 15 times faster than previous detector of comparable accuracy (Rowley et al., 1998) 2001 is forever ago P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. 9/28/2012 31
32
Output of Face Detector on Test Images
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. CVPR 2001. 9/28/2012 32
33
Topics Support Vector Machines Boosting Linear Algebra Review
Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability Axioms Basic Properties Bayes Theorem, Chain Rule Probably know most of this, just a review. 9/28/2012
34
Linear Algebra in Computer Vision
Representation 3D points in the scene 2D points in the image (Images are matrices) Transformations Mapping 2D to 2D Mapping 3D to 2D Also explains why so many people use matlab 9/28/2012
35
Notation We adopt the notation for a matrix which is a real valued matrix with m rows, and n columns We adopt the notation for a column vector, and a row vector respectively For vectors, the ‘by 1’ is implicit in R^n 9/28/2012
36
Notation To indicate the element in the ith row and jth column of a matrix we use Similarly to indicate the ith entry in a vector we use 9/28/2012
37
Norms Intuitively the norm of a vector is the measure of its “length”
The l2 norm is defined as in this class we will use the l2 norm unless otherwise noted. Thus we drop the 2 subscript on the norm for convenience. Note that There are other norms Formally, norms have to satisfy some properties (positive scalability, triangle inequality, p(v) = 0 -> v = 0), but that’s not important now 9/28/2012
38
Linear Independence and Rank
A set of vectors is linearly independent if no vector in the set can be represented as a linear combination of the remaining vectors in the set The rank of a matrix is the maximal number of linearly independent column or rows of a matrix A^T A is called the gram matrix 9/28/2012
39
Range and Nullspace The range of a matrix is the span of the columns of the matrix, denoted by the set The nullspace of a matrix, is the set of vectors that when multiplied by the matrix result in 0, given by the set 9/28/2012
40
Eigenvalues and Eigenvectors
Given a matrix, and are said to be an eigenvalue and the corresponding eigenvector of the matrix if We can solve for the eigenvalues by solving for the roots of the polynomial generated by 9/28/2012
41
Eigenvalue Properties
The rank of a matrix is equal to the number of its non-zero eigenvalues Eigenvalues of a diagonal matrix, are simply the diagonal entries A matrix is said to be diagonalizable if we can write Lambda is diagonal whose elements are the eigenvalues, X contains the eigenvectors 9/28/2012
42
Eigenvalues & Eigenvectors of Symmetric Matrices
Eigenvalues of symmetric matrices are real Eigenvectors of symmetric matrices are orthonormal Consider the optimization problem involving the symmetric matrix the maximizing is the eigenvector corresponding to the largest eigenvalue Problem equivalent to max x^T A x s.t. \|x\|_2 = 1 9/28/2012
43
Generalized Eigenvalues
Generalized Eigenvalue problem Generalized eigenvalues must satisfy This reduces to the original eigenvalue problem when exists Generalized eigenvalues are used in Fisherfaces 9/28/2012
44
Singular Value Decomposition (SVD)
The SVD of matrix is given by Where are the columns of and called the left singular vectors is a diagonal matrix whose values are , and called the singular values are the columns of , and are called the right singular vectors 9/28/2012
45
SVD If the matrix has rank , then has non-zero singular values
are an orthonormal basis for Singular values of are the square root of the non-zero eigenvalues of or 9/28/2012
46
Matlab [V,D] = eig(A) The eigenvectors of A are the columns of V. D is a diagonal matrix whose entries are the eigenvalues of A. [V,D] = eig(A,B) The generalized eigenvectors are the columns of V. D is a diagonal matrix whose entries of the generalized eigenvalues. [U,S,V] = svd(X) The columns of U are the left singular vectors of X. S is a diagonal matrix whose entries are the singular values of X. The columns of V are the right singular vectors of X. Recall X = U*S*V’; 9/28/2012
47
Matrix Calculus -- Gradient
Let then the gradient is given by is always the same size as , thus if we just have a vector the gradient is simply Mention transpose notation Really just a way to express many equations at once Hugely useful in ML, optimization 9/28/2012
48
Gradients From partial derivatives Some common gradients 9/28/2012
49
Topics Support Vector Machines Boosting Linear Algebra Review
Viola-Jones face detector Linear Algebra Review Notation Operations & Properties Matrix Calculus Probability Axioms Basic Properties Bayes Theorem, Chain Rule 9/28/2012
50
Probability in Computer Vision
Foundation for algorithms to solve Tracking problems Human activity recognition Object recognition Segmentation 9/28/2012
51
Probability Axioms Sample space: The set of all the outcomes of a random experiment. Denoted by Event space: A set whose elements are subsets of The event space is denoted by For example Probability measure: A function that satisfies Rigorously this is part of measure theory Sigma-algebra Emphasize that a probability measure operates on sets of events 9/28/2012
52
Basic Properties All derivable from the axioms 9/28/2012
53
Conditional Probability
Two events are independent if Conditional Independence Conditional probability is a definition! There are other equivalent ways to express independence 9/28/2012
54
Product Rule From the definition of conditional probability we can write From the product rule we can derive the chain rule of probability 9/28/2012
55
Bayes Theorem Likelihood Prior Probability Posterior Probability
Trivial to prove! Just one step beyond the definition of the conditional probability Bayesian statistics Normalizing Constant 9/28/2012
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.