Classification.. continued
Prediction and Classification Last week we discussed the classification problem.. – Used the Naïve Bayes Method Today..we will dive into more details.. But first how do we evaluate classifier
Abstract Binary Classification Problem Given n data samples where x i is a data vector and y i is label {-1,1}. Aim is to learn a function Such that f is “accurate” on unseen data. [ill-specified as defined]
Algorithms to Learn Classifier We can use an algorithm A to learn the function f: X Y Then we write f as f A One example of A is Naïve Bayes. Other examples {Logistic Regression, Neural Networks, Support Vector Machines, Decision Trees, Random Forests,….}
Training vs. Test Data In practice to take care of the “unseen” part…we split the data into training and test sets We learn f A on the training set using an algorithm A The learned function f A is then evaluated on the test set.
Example Suppose we learn a function F on training set. Our test set consists of four data points (z1,1),(z2,- 1),(z3,1),(z4,-1). We apply F on the four data points (without labels) and we get F(z1)=1, F(z2)=1,F(z3)=-1 and F(z4) = -1. Then F correctly classified z1 and z4 but incorrectly classified z2 and z3.
Confusion Matrix Actual Label (1)Actual Label (-1) Predicted Label (1)True Positive (N1)False Positive (N2) Predicted Label (-1)False Negatives (N3)True Negatives (N4) Label 1 is called Positive, Label -1 is called Negative Let the number of test samples be N N = N1 + N2 + N3 + N4. True Positive Rate (TPR) = N1/(N1+N3) True Negative Rate (TNR) = N4/(N4+N2) False Positive Rate (FPR) = N2/(N2+N4) False Negative Rate (FNR) = N3/(N1+N3) Accuracy = (N1+N4)/(N1+N2+N3+N4) Precision = N1/(N1+N2)Recall = N1/(N1+N3)
Example Actual Label (1)Actual Label (-1) Predicted Label (1)103 Predicted Label (-1)220 TPR = 5/6; TNR = 20/23; FPR = 3/23; FNR = 2/12; Accuracy = 30/35 Precision = 10/13 and Recall = 10/12
ROC (Receiver Operating Characteristic) Curves Generally a learning algorithm A will return a real number…but what we want is a label {1 or -1} We can apply a threshold..T A T= True Label A T= True Label TPR = 3/4 FPR = 2/5 TPR = 2/4 FPR = 2/5
ROC Curve An ROC Curve is the plot where the x-axis is FPR, the y-axis is the TPR and for each threshold t, the point on the plot represents the pair (FPR(t), TPR(t)) Lets Look at the Wikipedia ROC EntryWikipedia ROC Entry
Discussion.. If F: Symptoms {Disease, No-Disease} – Higher Recall or Precision ? – What is the relative cost of a mis-diagnosis (and which way) If F: Banner Ad {Click, No-Click} – Higher Precision means more revenue?
Random Variables A r.v. is a numerical quantity associated with events in an experiment. Suppose we roll two dice. Let X = k be the sum of the two faces. X can take values ranging from {2….12}. P(X=12) = 1/36. Why ? – Event associated with X=12 is {(6,6)} P(X=7) = 6/36 = 1/6 – Associated Event: {(1,6),(6,1),(2,5),(5,2),(3,4),(4,3)}
Random Variable A random variable X can take values in a set which is: – discrete and finite. Lets toss a coin and X = 1 if it’s a head and X=0 if it’s a tail. X is random variable – discrete and infinite (countable) Let X be the number of accidents in Sydney in a day.. Then X = 0,1,2,….. – Infinite (uncountable) Let X be the height of a Sydney-sider. – X = 150, , ,……
Random Variable Properties Let X be a discrete valued random variable taking values in a set S. The Expected (average) Value of X, E(X) is The Variance is
Examples Let X be a random variable which takes values 1 with probability p and 0 with probability 1-p. Then
Examples Let X be a random variable which denotes the number of “spam s” in a batch of n s. Assuming the probability of spam is p. X={0,1,2,3,4,5} X is a r.v. which follows a binomial distribution with parameters (n,p)… X ~ Binomial(n,p) – E(X) = np ; Var(X) = np(1-p)
Examples Let X be a random variable which denotes the number of tcp packets that arrive in a unit time. Then X can be modeled to follow a Poisson distribution.. E(X) = Var(X) = λ
Continuous Distribution Ofcourse the most common continuous distribution is the Normal/Gaussian distribution… denoted
How to use r.v. for classification To use r.v. in classification…we have to make an assumption. – For example..Sepal Length follows a Normal Distribution. – Is this a good/reasonable assumption. Then we use data to estimate the parameters of the distribution.. – The parameters of a Normal distribution are the mean and the variance (square of standard deviation). – For the moment we can just use Matlab/program to do that… – Once we have the parameters we can use the distribution to estimate the “probability” of Sepal Length taking a new value..
Fitting Distributions..Examples 0,1,0,1,0,0 – Assume data from a Binomial distribution with 6 trials and 2 successes In Matlab:>> binofit(2,6) = ,20,5,3,3,100 – Assume data is from a Poisson distribution – X=[ ]; – Poissfit(X); – Ans: What is happening ? We are just taking sample averages. The more data we have the more reliable these estimates become.. Suppose we take Sepal Length…data vector x >> [mean,std] = normfit(x); >> ans: mean = 5.8, std=0.81
Return to the Iris Example We will redo the Iris Classification Example..but now will use “continuous” values for the attributes…