Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Linear Classifiers/SVMs
Linear Classifiers (perceptrons)
Data Mining Classification: Alternative Techniques
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Homework 3: Naive Bayes Classification
Model Assessment, Selection and Averaging
Second order cone programming approaches for handing missing and uncertain data P. K. Shivaswamy, C. Bhattacharyya and A. J. Smola Discussion led by Qi.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Classification and risk prediction
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Ensemble Learning: An Introduction
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
Support Vector Machines Kernel Machines
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Support Vector Machines
Lecture 10: Support Vector Machines
Today Logistic Regression Decision Trees Redux Graphical Models
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Bayesian Networks Textbook: Probabilistic Reasoning, Sections 1-2, pp
Crash Course on Machine Learning
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
SVM by Sequential Minimal Optimization (SMO)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Bayesian Networks Martin Bachler MLA - VO
Benk Erika Kelemen Zsolt
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Linear Classification with Perceptrons
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Biointelligence Laboratory, Seoul National University
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
CS 416 Artificial Intelligence Lecture 15 Uncertainty Chapter 14 Lecture 15 Uncertainty Chapter 14.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Machine Learning: Ensemble Methods
Support Vector Machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining Lecture 11.
CS 4/527: Artificial Intelligence
Hidden Markov Models Part 2: Algorithms
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Class #19 – Tuesday, November 3
Support Vector Machines
Linear Discrimination
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final: MLPs, PCA

Rules for Final Open book, notes, computer, calculator No discussion with others You can ask me or Dona general questions about a topic Read each question carefully Hand in your own work only Turn in to box at CS front desk or to me (hardcopy or ) by 5pm Wednesday, March 21. No extensions

Short recap of important topics

Perceptrons

Training a perceptron 1.Start with random weights, w = (w 1, w 2,..., w n ). 2.Select training example (x k, t k ). 3.Run the perceptron with input x k and weights w to obtain o. 4.Let  be the learning rate (a user-set parameter). Now, 5.Go to 2.

Support Vector Machines

Here, assume positive and negative instances are to be separated by the hyperplane Equation of line: x2x2 x1x1

Intuition: the best hyperplane (for future generalization) will “maximally” separate the examples

Definition of Margin

Minimizing ||w|| Find w and b by doing the following minimization: This is a quadratic optimization problem. Use “standard optimization tools” to solve it.

Dual formulation: It turns out that w can be expressed as a linear combination of a small subset of the training examples x i : those that lie exactly on margin (minimum distance to hyperplane): such that x i lie exactly on the margin. These training examples are called “support vectors”. They carry all relevant information about the classification problem.

The results of the SVM training algorithm (involving solving a quadratic programming problem) are the  i and the bias b. The support vectors are all x i such that  i > 0. Clarification: In the slides below we use  i to denote |  i |  y i, where y i  {−1, 1}.

For a new example x, We can now classify x using the support vectors: This is the resulting SVM classifier.

SVM review Equation of line: w 1 x 1 + w 2 x 2 + b = 0 Define margin using: Margin distance: To maximize the margin, we minimize ||w|| subject to the constraint that positive examples fall on one side of the margin, and negative examples on the other side: We can relax this constraint using “slack variables”

SVM review To do the optimization, we use the dual formulation: The results of the optimization “black box” are and b. The support vectors are all x i such that  i != 0.

SVM review Once the optimization is done, we can classify a new example x as follows: That is, classification is done entirely through a linear combination of dot products with training examples. This is a “kernel” method.

Example

Example Input to SVM optimzer: x 1 x 2 class

Example Input to SVM optimzer: x 1 x 2 class Output from SVM optimzer: Support vectorα (-1, 0)-.208 (1, 1).416 (0, -1)-.208 b = -.376

Example Input to SVM optimzer: x 1 x 2 class Output from SVM optimzer: Support vectorα (-1, 0)-.208 (1, 1).416 (0, -1)-.208 b = Weight vector:

Example Input to SVM optimzer: x 1 x 2 class Output from SVM optimzer: Support vectorα (-1, 0)-.208 (1, 1).416 (0, -1)-.208 b = Weight vector: Separation line:

Example Classifying a new point:

Precision/Recall/ROC

Results of classifier ThresholdAccuracyPrecisionRecall ∞ Creating a Precision/Recall Curve

Results of classifier ThresholdAccuracyTPRFPR ∞ Creating a ROC Curve

Precision/Recall versus ROC curves 26

27

Decision Trees

Naive Bayes

Naive Bayes classifier: Assume Given this assumption, here’s how to classify an instance x = : We can estimate the values of these various probabilities over the training set.

In-class example Training set: a 1 a 2 a 3 class − 110 − 100 − What class would be assigned by a NB classifier to 111 ?

Laplace smoothing (also called “add-one” smoothing) For each class c j and attribute a i with value z, add one “virtual” instance. That is, recalculate: where k is the number of possible values of attribute a. a 1 a 2 a 3 classSmoothed P(a 1 =1 | +) = Smoothed P(a 1 =0 | +) = 001 +Smoothed P(a 1 =1 | −) = 111 −Smoothed P(a 1 =0 | −) = 110 − 101 −

Bayesian Networks

Methods used in computing probabilities Definition of conditional probability: P(A | B) = P (A,B) / P(B) Bayes theorem: P(A | B) = P(B | A) P(A) / P(B) Semantics of Bayesian networks: P(A ^ B ^ C ^ D) = P(A | Parents(A))  P(B | Parents(B))  P(C | Parents(C))  P(D |Parents(D)) Caculating marginal probabilities

What is P(Cloudy| Sprinkler)?

What is P(Cloudy| Wet Grass)?

Markov Chain Monte Carlo Algorithm Markov blanket of a variable X i : – parents, children, children’s other parents MCMC algorithm: For a given set of evidence variables {X j =x k } Repeat for NumSamples: –Start with random sample from variables, with evidence variables fixed: (x 1,..., x n ). This is the current “state” of the algorithm. –Next state: Randomly sample value for one non-evidence variable X i, conditioned on current values in “Markov Blanket” of X i. Finally, return the estimated distribution of each non-evidence variable X i

Example Query: What is P(Sprinkler =true | WetGrass = true)? MCMC: –Random sample, with evidence variables fixed: [Cloudy, Sprinkler, Rain, WetGrass] = [true, true, false, true] –Repeat: 1.Sample Cloudy, given current values of its Markov blanket: Sprinkler = true, Rain = false. Suppose result is false. New state: [false, true, false, true] Note that current values of Markov blanket remain fixed. 2.Sample Sprinkler, given current values of its Markov blanket: Cloudy = false, Rain= false, Wet = true. Suppose result is true. New state: [false, true, false, true].

Each sample contributes to estimate for query P(Sprinkler = true| WetGrass = true) Suppose we perform 50 such samples, 20 with Sprinkler = true and 30 with Sprinkler= false. Then answer to the query is Normalize (  20,30  ) = .4,.6 

Adaboost

Sketch of algorithm Given data S and learning algorithm L: Repeatedly run L on training sets S t  S to produce h 1, h 2,..., h T. At each step, derive S t from S by choosing examples probabilistically according to probability distribution w t. Use S t to learn h t. At each step, derive w t+1 by giving more probability to examples that were misclassified at step t. The final ensemble classifier H is a weighted sum of the h t ’s, with each weight being a function of the corresponding h t ’s error on its training set.

Adaboost algorithm Given S = {(x 1, y 1 ),..., (x N, y N )} where x  X, y i  {+1, -1} Initialize w 1 (i) = 1/N. (Uniform distribution over data)

For t = 1,..., T: –Select new training set S t from S with replacement, according to w t –Train L on S t to obtain hypothesis h t –Compute the training error  t of h t on S : –If  t  0.5, break from loop. –Compute coefficient

–Compute new weights on data: where Z t is a normalization factor chosen so that w t+1 will be a probability distribution:

At the end of T iterations of this algorithm, we have h 1, h 2,..., h T We also have  1,  2,...,  T, where Ensemble classifier: Note that hypotheses with higher accuracy on their training sets are weighted more strongly.

A Simple Example t =1 S = Spam8.train: x 1, x 2, x 3, x 4 (class +1) x 5, x 6, x 7, x 8 (class -1) w 1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} S 1 = {x 1, x 2, x 2, x 5, x 5, x 6, x 7, x 8 } Run svm_light on S 1 to get h 1 Run h 1 on S. Classifications: {1, -1, -1, -1, -1, -1, -1, -1} Calculate error:

Calculate  ’s: Calculate new w’s:

t =2 w 2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102} S 2 = {x 1, x 2, x 2, x 3, x 4, x 4, x 7, x 8 } Run svm_light on S 2 to get h 2 Run h 2 on S. Classifications: {1, 1, 1, 1, 1, 1, 1, 1} Calculate error:

Calculate  ’s: Calculate w’s:

t =3 w 3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125} S 3 = {x 2, x 3, x 3, x 3, x 5, x 6, x 7, x 8 } Run svm_light on S 3 to get h 3 Run h 3 on S. Classifications: {1, 1, -1, 1, -1,- 1, 1, -1} Calculate error:

Calculate  ’s: Ensemble classifier:

On test examples 1-8: S1S1 S2S2 S3S3 x1x1 111 x2x2 1 x3x3 1 x4x4 111 x5x5 11 x6x6 11 x7x7 1 x8x8 11 Test accuracy: 3/8

Genetic Algorithms

Selection methods Fitness proportionate selection Rank selection Elite selection Tournament selection

Example individual 1: 30 individual 2: 20 individual 3: 50 individual 4: 10 Fitness proportionate probabilities? Rank probabilities? Elite probabilities (top 50%)? Fitness

Reinforcement Learning / Q Learning

Q learning algorithm –For each (s, a), initialize Q(s,a) to be zero (or small value). –Observe the current state s. –Do forever: Select an action a and execute it. Receive immediate reward r Learn: –Observe the new state s –Update the table entry for Q(s,a) as follows: Q(s,a)  Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a)) s  s

Q learning algorithm –For each (s, a), initialize Q(s,a) to be zero (or small value). –Observe the current state s. –Do forever: Select an action a and execute it. Receive immediate reward r Learn: –Observe the new state s –Update the table entry for Q(s,a) as follows: Q(s,a)  Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a)) s  s

Simple illustration of Q learning C gives reward of 5 points. Each action has reward of -1. No other rewards or penalties. States are numbered squares Actions (N, E, S, W) are selected at random. Assume γ = 0.8, η = 1 R C

Step 1 Current state s = 1 R C Q(s,a)Q(s,a)NSEW

Step 1 Current state s = 1 Select action a = Move South R C Q(s,a)Q(s,a)NSEW

Step 1 Current state s = 1 Select action a = Move South Reward r = -1 New state s´ = 4 R C Q(s,a)Q(s,a)NSEW

Step 1 Current state s = 1 Select action a = Move South Reward r = -1 New state s´ = 4 R C Q(s,a)Q(s,a)NSEW Learn: Q(s, a)  Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a))

Step 1 Current state s = 1 Select action a = Move South Reward r = -1 New state s´ = 4 R C Q(s,a)Q(s,a)NSEW Learn: Q(s, a)  Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a)) Update state: Current state = 4

Step 2 Current state s = 4 Select action a = Reward r = New state s´ = R C Q(s,a)Q(s,a)NSEW Learn: Q(s, a)  Q(s,a) + η (r + γ max a´ Q(s´,a´) – Q(s, a)) Update state: Current state =