Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},

Similar presentations

Presentation on theme: "Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},"— Presentation transcript:

1 Statistical Classification Rong Jin

2 Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1}, binary class classification problems Y = {1, 2, 3, …, c}, multiple class classification problems Goal: need to learn the function: f: X  Y

3 Examples of Classification Problem  Text categorization:  Input features X: Word frequency {(campaigning, 1), (democrats, 2), (basketball, 0), …}  Class label y: Y = +1: ‘politics’ Y = -1: ‘non-politics’ Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … Topic : Politics Non-politics

4 Examples of Classification Problem  Text categorization:  Input features X: Word frequency {(campaigning, 1), (democrats, 2), (basketball, 0), …}  Class label y: Y = +1: ‘politics’ Y = -1: ‘not-politics’ Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … Topic : Politics Non-politics

5 Examples of Classification Problem  Image Classification:  Input features X Color histogram {(red, 1004), (red, 23000), …}  Class label y Y = +1: ‘bird image’ Y = -1: ‘non-bird image’ Which images are birds, which are not?

6 Examples of Classification Problem  Image Classification:  Input features X Color histogram {(red, 1004), (blue, 23000), …}  Class label y Y = +1: ‘bird image’ Y = -1: ‘non-bird image’ Which images are birds, which are not?

7 Classification Problems X Input Y Output ? Doc: Months of campaigning and weeks of round-the-clock efforts in Iowa all came down to a final push Sunday, … PoliticsNot-politics f: doc  topic BirdsNot-Birds f: image  topic How to obtain f ? Learn classification function f from examples

8 Learning from Examples  Training examples:  Identical Independent Distribution (i.i.d.) Each training example is drawn independently from the identical source Training examples are similar to testing examples

9 Learning from Examples  Training examples:  Identical Independent Distribution (i.i.d.) Each training example is drawn independently from the identical source

10 Learning from Examples  Given training examples  Goal: learn a classification function f(x):X  Y that is consistent with training examples  What is the easiest way to do it ?

11 K Nearest Neighbor (kNN) Approach (k=1) (k=4) How many neighbors should we count ?

12 Cross Validation  Divide training examples into two sets A training set (80%) and a validation set (20%)  Predict the class labels of the examples in the validation set by the examples in the training set  Choose the number of neighbors k that maximizes the classification accuracy

13 Leave-One-Out Method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given K to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal

14 Leave-One-Out Method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given K to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal

15 Leave-One-Out Method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal (k=1)

16 Leave-One-Out Method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal (k=1) Err(1) = 1

17 Leave-One-Out Method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Err(1) = 1

18 Leave-One-Out Method For k = 1, 2, …, K Err(k) = 0; 1.Randomly select a training data point and hide its class label 2.Using the remaining data and given k to predict the class label for the left data point 3.Err(k) = Err(k) + 1 if the predicted label is different from the true label Repeat the procedure until all training examples are tested Choose the k whose Err(k) is minimal Err(1) = 3 Err(2) = 2 Err(3) = 6 k = 2

19 Probabilistic interpretation of KNN  Estimate the probability density function Pr(y|x) around the location of x Count of data points in class y in the neighborhood of x  Bias and variance tradeoff A small neighborhood  large variance  unreliable estimation A large neighborhood  large bias  inaccurate estimation

20 Weighted kNN  Weight the contribution of each close neighbor based on their distances  Weight function  Prediction

21 Estimate  2 in the Weight Function  Leave one cross validation  Training dataset D is divided into two sets Validation set Training set  Compute the

22 Estimate  2 in the Weight Function Pr(y|x 1, D -1 ) is a function of  2

23 Estimate  2 in the Weight Function Pr(y|x 1, D -1 ) is a function of  2

24 Estimate  2 in the Weight Function  In general, we can have expression for Validation set Training set  Estimate  2 by maximizing the likelihood

25 Estimate  2 in the Weight Function  In general, we can have expression for Validation set Training set  Estimate  2 by maximizing the likelihood

26 Optimization  It is a DC (difference of two convex functions) function

27 Challenges in Optimization  Convex functions are easiest to be optimized  Single-mode functions are the second easiest  Multi-mode functions are difficult to be optimized

28 Gradient Ascent

29 Gradient Ascent (cont’d)  Compute the derivative of l(λ), i.e.,  Update λ How to decide the step size t?

30 Gradient Ascent: Line Search Excerpt from the slides by Steven Boyd

31 Gradient Ascent  Stop criterion  is predefined small value  Start λ=0, Define , , and  Compute Choose step size t via backtracking line search Update Repeat till

32 Gradient Ascent  Stop criterion  is predefined small value  Start λ=0, Define , , and  Compute Choose step size t via backtracking line search Update Repeat till

33 ML = Statistics + Optimization  Modeling Pr(y|x;  )  is the parameter(s) involved in the model  Search for the best parameter  Maximum likelihood estimation Construct a log-likelihood function l(  ) Search for the optimal solution 

34 Instance-Based Learning (Ch. 8)  Key idea: just store all training examples  k Nearest neighbor: Given query example, take vote among its k nearest neighbors (if discrete-valued target function) take mean of f values of k nearest neighbors if real-valued target function

35 When to Consider Nearest Neighbor ?  Lots of training data  Less than 20 attributes per example  Advantages: Training is very fast Learn complex target functions Don’t lose information  Disadvantages: Slow at query time Easily fooled by irrelevant attributes

36 KD Tree for NN Search  Each node contains Children information The tightest box that bounds all the data points within the node.

37 NN Search by KD Tree







44 Curse of Dimensionality  Imagine instances described by 20 attributes, but only 2 are relevant to target function  Curse of dimensionality: nearest neighbor is easily mislead when high dimensional X  Consider N data points uniformly distributed in a p- dimensional unit ball centered at original. Consider the nn estimate at the original. The mean distance from the original to the closest data point is:

45 Curse of Dimensionality  Imagine instances described by 20 attributes, but only 2 are relevant to target function  Curse of dimensionality: nearest neighbor is easily mislead when high dimensional X  Consider N data points uniformly distributed in a p- dimensional unit ball centered at origin. Consider the nn estimate at the original. The mean distance from the origin to the closest data point is:

Download ppt "Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},"

Similar presentations

Ads by Google