Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final: MLPs, PCA
Rules for Final Open book, notes, computer, calculator No discussion with others You can ask me or Dona general questions about a topic Read each question carefully Hand in your own work only Turn in to box at CS front desk or to me (hardcopy or ) by 5pm Wednesday, March 21. No extensions
Short recap of important topics
Perceptrons
Training a perceptron 1.Start with random weights, w = (w 1, w 2,..., w n ). 2.Select training example (x k, t k ). 3.Run the perceptron with input x k and weights w to obtain o. 4.Let be the learning rate (a user-set parameter). Now, 5.Go to 2.
Support Vector Machines
Here, assume positive and negative instances are to be separated by the hyperplane Equation of line: x2x2 x1x1
Intuition: the best hyperplane (for future generalization) will “maximally” separate the examples
Definition of Margin
Minimizing ||w|| Find w and b by doing the following minimization: This is a quadratic optimization problem. Use “standard optimization tools” to solve it.
Dual formulation: It turns out that w can be expressed as a linear combination of a small subset of the training examples x i : those that lie exactly on margin (minimum distance to hyperplane): such that x i lie exactly on the margin. These training examples are called “support vectors”. They carry all relevant information about the classification problem.
The results of the SVM training algorithm (involving solving a quadratic programming problem) are the i and the bias b. The support vectors are all x i such that i > 0. Clarification: In the slides below we use i to denote | i | y i, where y i {−1, 1}.
For a new example x, We can now classify x using the support vectors: This is the resulting SVM classifier.
SVM review Equation of line: w 1 x 1 + w 2 x 2 + b = 0 Define margin using: Margin distance: To maximize the margin, we minimize ||w|| subject to the constraint that positive examples fall on one side of the margin, and negative examples on the other side: We can relax this constraint using “slack variables”
SVM review To do the optimization, we use the dual formulation: The results of the optimization “black box” are and b. The support vectors are all x i such that i != 0.
SVM review Once the optimization is done, we can classify a new example x as follows: That is, classification is done entirely through a linear combination of dot products with training examples. This is a “kernel” method.
Example
Example Input to SVM optimzer: x 1 x 2 class
Example Input to SVM optimzer: x 1 x 2 class Output from SVM optimzer: Support vectorα (-1, 0)-.208 (1, 1).416 (0, -1)-.208 b = -.376
Example Input to SVM optimzer: x 1 x 2 class Output from SVM optimzer: Support vectorα (-1, 0)-.208 (1, 1).416 (0, -1)-.208 b = Weight vector:
Example Input to SVM optimzer: x 1 x 2 class Output from SVM optimzer: Support vectorα (-1, 0)-.208 (1, 1).416 (0, -1)-.208 b = Weight vector: Separation line:
Example Classifying a new point:
Precision/Recall/ROC
Results of classifier ThresholdAccuracyPrecisionRecall ∞ Creating a Precision/Recall Curve
Results of classifier ThresholdAccuracyTPRFPR ∞ Creating a ROC Curve
Precision/Recall versus ROC curves 26
27
Decision Trees
Naive Bayes
Naive Bayes classifier: Assume Given this assumption, here’s how to classify an instance x = : We can estimate the values of these various probabilities over the training set.
In-class example Training set: a 1 a 2 a 3 class − 110 − 100 − What class would be assigned by a NB classifier to 111 ?
Laplace smoothing (also called “add-one” smoothing) For each class c j and attribute a i with value z, add one “virtual” instance. That is, recalculate: where k is the number of possible values of attribute a. a 1 a 2 a 3 classSmoothed P(a 1 =1 | +) = Smoothed P(a 1 =0 | +) = 001 +Smoothed P(a 1 =1 | −) = 111 −Smoothed P(a 1 =0 | −) = 110 − 101 −
Bayesian Networks
Methods used in computing probabilities Definition of conditional probability: P(A | B) = P (A,B) / P(B) Bayes theorem: P(A | B) = P(B | A) P(A) / P(B) Semantics of Bayesian networks: P(A ^ B ^ C ^ D) = P(A | Parents(A)) P(B | Parents(B)) P(C | Parents(C)) P(D |Parents(D)) Caculating marginal probabilities
What is P(Cloudy| Sprinkler)?
What is P(Cloudy| Wet Grass)?
Markov Chain Monte Carlo Algorithm Markov blanket of a variable X i : – parents, children, children’s other parents MCMC algorithm: For a given set of evidence variables {X j =x k } Repeat for NumSamples: –Start with random sample from variables, with evidence variables fixed: (x 1,..., x n ). This is the current “state” of the algorithm. –Next state: Randomly sample value for one non-evidence variable X i, conditioned on current values in “Markov Blanket” of X i. Finally, return the estimated distribution of each non-evidence variable X i
Example Query: What is P(Sprinkler =true | WetGrass = true)? MCMC: –Random sample, with evidence variables fixed: [Cloudy, Sprinkler, Rain, WetGrass] = [true, true, false, true] –Repeat: 1.Sample Cloudy, given current values of its Markov blanket: Sprinkler = true, Rain = false. Suppose result is false. New state: [false, true, false, true] Note that current values of Markov blanket remain fixed. 2.Sample Sprinkler, given current values of its Markov blanket: Cloudy = false, Rain= false, Wet = true. Suppose result is true. New state: [false, true, false, true].
Each sample contributes to estimate for query P(Sprinkler = true| WetGrass = true) Suppose we perform 50 such samples, 20 with Sprinkler = true and 30 with Sprinkler= false. Then answer to the query is Normalize ( 20,30 ) = .4,.6
Adaboost
Sketch of algorithm Given data S and learning algorithm L: Repeatedly run L on training sets S t S to produce h 1, h 2,..., h T. At each step, derive S t from S by choosing examples probabilistically according to probability distribution w t. Use S t to learn h t. At each step, derive w t+1 by giving more probability to examples that were misclassified at step t. The final ensemble classifier H is a weighted sum of the h t ’s, with each weight being a function of the corresponding h t ’s error on its training set.
Adaboost algorithm Given S = {(x 1, y 1 ),..., (x N, y N )} where x X, y i {+1, -1} Initialize w 1 (i) = 1/N. (Uniform distribution over data)
For t = 1,..., T: –Select new training set S t from S with replacement, according to w t –Train L on S t to obtain hypothesis h t –Compute the training error t of h t on S : –If t 0.5, break from loop. –Compute coefficient
–Compute new weights on data: where Z t is a normalization factor chosen so that w t+1 will be a probability distribution:
At the end of T iterations of this algorithm, we have h 1, h 2,..., h T We also have 1, 2,..., T, where Ensemble classifier: Note that hypotheses with higher accuracy on their training sets are weighted more strongly.
A Simple Example t =1 S = Spam8.train: x 1, x 2, x 3, x 4 (class +1) x 5, x 6, x 7, x 8 (class -1) w 1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S 1 = {x 1, x 2, x 2, x 5, x 5, x 6, x 7, x 8 } Run svm_light on S 1 to get h 1 Run h 1 on S. Classifications: {1, -1, -1, -1, -1, -1, -1, -1} Calculate error:
Calculate ’s: Calculate new w’s:
t =2 w 2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102} S 2 = {x 1, x 2, x 2, x 3, x 4, x 4, x 7, x 8 } Run svm_light on S 2 to get h 2 Run h 2 on S. Classifications: {1, 1, 1, 1, 1, 1, 1, 1} Calculate error:
Calculate ’s: Calculate w’s:
t =3 w 3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125} S 3 = {x 2, x 3, x 3, x 3, x 5, x 6, x 7, x 8 } Run svm_light on S 3 to get h 3 Run h 3 on S. Classifications: {1, 1, -1, 1, -1,- 1, 1, -1} Calculate error:
Calculate ’s: Ensemble classifier:
On test examples 1-8: S1S1 S2S2 S3S3 x1x1 111 x2x2 1 x3x3 1 x4x4 111 x5x5 11 x6x6 11 x7x7 1 x8x8 11 Test accuracy: 3/8
Genetic Algorithms
Selection methods Fitness proportionate selection Rank selection Elite selection Tournament selection
Example individual 1: 30 individual 2: 20 individual 3: 50 individual 4: 10 Fitness proportionate probabilities? Rank probabilities? Elite probabilities (top 50%)? Fitness
Reinforcement Learning / Q Learning
Q learning algorithm –For each (s, a), initialize Q(s,a) to be zero (or small value). –Observe the current state s. –Do forever: Select an action a and execute it. Receive immediate reward r Learn: –Observe the new state s –Update the table entry for Q(s,a) as follows: Q(s,a) Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a)) s s
Q learning algorithm –For each (s, a), initialize Q(s,a) to be zero (or small value). –Observe the current state s. –Do forever: Select an action a and execute it. Receive immediate reward r Learn: –Observe the new state s –Update the table entry for Q(s,a) as follows: Q(s,a) Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a)) s s
Simple illustration of Q learning C gives reward of 5 points. Each action has reward of -1. No other rewards or penalties. States are numbered squares Actions (N, E, S, W) are selected at random. Assume γ = 0.8, η = 1 R C
Step 1 Current state s = 1 R C Q(s,a)Q(s,a)NSEW
Step 1 Current state s = 1 Select action a = Move South R C Q(s,a)Q(s,a)NSEW
Step 1 Current state s = 1 Select action a = Move South Reward r = -1 New state s´ = 4 R C Q(s,a)Q(s,a)NSEW
Step 1 Current state s = 1 Select action a = Move South Reward r = -1 New state s´ = 4 R C Q(s,a)Q(s,a)NSEW Learn: Q(s, a) Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a))
Step 1 Current state s = 1 Select action a = Move South Reward r = -1 New state s´ = 4 R C Q(s,a)Q(s,a)NSEW Learn: Q(s, a) Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a)) Update state: Current state = 4
Step 2 Current state s = 4 Select action a = Reward r = New state s´ = R C Q(s,a)Q(s,a)NSEW Learn: Q(s, a) Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a)) Update state: Current state =