Linear Discriminators Chapter 20 From Data to Knowledge
Concerns Generalization Accuracy Efficiency Noise Irrelevant features Generality: when does this work?
Linear Model Let f1, …fn be the feature values of an example. Let class be denoted {+1, -1}. Define f0 = -1. (bias weight) Linear model defines weights w0,w1,..wn. –-w0 is the threshold Classification rule: –If w0*f0+w1*f1..+wn*fn> 0, predict class + else predict class -. Briefly: W*F>0 where * is inner product of weight vector and feature weights and F has been augmented with extra 1.
Augmentation Trick Suppose data defined features f1 and f2. 2* f1 + 3*f2 > 4 is classifier Equivalently: * > 0 Mapping data to allows learning/representing threshold as just another featuer. Mapping data into higher dimensions is key idea behind SVMs
Mapping to enable Linear Separation Let xi be m vectors in R^N. Map xi into R^{N+M} by xi -> where 1 in n+i position. For any labelling of xi by classes +/-, the embedding makes data linearly separable. –Define wi = 0 i<N –w(i+n) = 1 if xi is + else 0. –W(i+n) = -1 if xi is negative else 0.
Representational Power “Or” of n features – Wi = 1, threshold = 0 “And” of n features –Wi = 1 threshold = n -1 K of n features (prototype) –Wi =1 threshold = k -1 Can’t do XOR Combining linear threshold units yields any boolean function.
Classical Perceptron Goal: Any W which separates the data. Algorithm (X is augmented with 1) W = 0 Repeat –If X positive and W*X wrong, W = W+X; –Else if X negative & W*X wrong, W = W-X. Until no errors or very large number of times.
Classical Perceptron Theorem: If concept linearly separable, then algorithm finds a solution. Training time can be exponential in number of features. Epoch is single pass through entire data. Convergence can take exponentially many epochs. If |xi|<R and margin = m, then number of mistake is < R^2/m^2.
Neural Net view Goal: minimize Squared-error = Err^2. Let class yi be 1 or -1. Let Err = sum(W*Xi –Yi) where Xi is ith example. This is a function only of the weights. Use Calculus; take partial derivates wrt Wj. To move to lower value, move in direction of negative gradient, i.e. change in Xi is -2*Err*Xj
Neural Net View This is an optimization problem. The solution is by hill-climbing so there is no guarantee of finding the optimal solution. While derivates tell you the direction (the negative gradient) they do not tell you how much to change each Xi. On the plus side it is fast. On the negative side, no guarantee of separation
Support Vector Machine Goal: maximize the margin. Assuming the line separates the data, the margin is the minimum of the closest positive and negative example to the line. Good News: This can be solved by quadratic program. Implemented in Weka as SOM. If not linearly separable, SVM will add more features.
If not Linearly Separable 1.Add more nodes: Neural Nets 1.Can Represent any boolean function: why? 2.No guarantees about learning 3.Slow 4.Incomprehensible 2.Add more features: SVM 1.Can represent any boolean function 2.Learning guarantees 3.Fast 4.Semi-comprehensible
Adding features Suppose pt (x,y) is positive if it lies in the unit disk else negative. Clearly very unlinearly separable Map (x,y) -> (x,y, x^2+y^2) Now in 3-space, easily separable. This works for any learning algorithm, but SVM will almost do it for you. (set parameters).