Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning from Data COMP61011 : Machine Learning and Data Mining

Similar presentations


Presentation on theme: "Learning from Data COMP61011 : Machine Learning and Data Mining"— Presentation transcript:

1 Learning from Data COMP61011 : Machine Learning and Data Mining
Dr Gavin Brown Machine Learning and Optimization Research Group

2 Learning from Data Data is recorded from some real-world phenomenon.
What might we want to do with that data? Prediction - what can we predict about this phenomenon? Description - how can we describe/understand this phenomenon in a new way? Optimization - how can we control and optimize this phenomenon for our own objectives?

3 Prediction Lecturer: Dr Gavin Brown COMP61011 Machine Learning
& Data Mining COMP61021 Modeling & Visualization of High Dimensional Data COMP61032 Optimization for Learning, Planning & Problem Solving Period 1 Oct/Nov Period 2 Nov/Dec Period 3 Feb/Mar Prediction Lecturer: Dr Gavin Brown

4 Machine Learning and Data Mining
Medical Records / Novel Drugs What characteristics of a patient indicate they may react well/badly to a new drug? How can we predict whether it will potentially hurt rather then help them? AstraZeneca Project Research Bursaries Limited number of eligible MSc projects, announced Dec 2011

5 Machine Learning and Data Mining
Handwriting Recognition Google Books is currently digitizing millions of books. Smartphones need to process non-European handwriting to tap into the Asian market. How can we recognize handwritten digits in a huge variety of handwriting styles, in real-time?

6 Learning from Data Where does all this fit? Learning from Data
Artificial Intelligence Statistics / Mathematics Data Mining Learning from Data Computer Vision Robotics (No definition of a field is perfect – the diagram above is just one interpretation, mine ;-)

7 Learn your trade…

8 Learning from Data ….. Prerequisites
MATHEMATICS This is a mathematical subject. You must be comfortable with probabilities and algebra. Maths primer on course website for reviewing. PROGRAMMING You must be able to program, and pick up a new language relatively easily. We use Matlab for the first 2 modules. In the 3rd module, you may use any language. Module codes in this theme: 61011 (prediction) 61021 (description) 61032 (optimization)

9 COMP61011 topic structure Week 1: Some Data and Simple Predictors
Week 2: Support Vector Machines / Model Selection Week 3: Decision Trees / Feature Selection Week 4: Bayes Theorem / Probabilistic Classifiers Week 5: Ensemble Methods / Industry Guest Lectures Week 6: No lecture.

10 COMP61011 assessment structure
50% January exam 50% coursework, broken down as 10% + 10% lab exercises (weeks 2,3) 30% mini-project (weeks 4,5,6) Lab exercises will be marked at the START of the following lab session. You should NOT be still working on the previous week’s work. Extensions will require a medical note.

11 Matlab MATrix LABoratory Interactive scripting language
Interpreted (i.e. no compiling) Objects possible, not compulsory Dynamically typed Flexible GUI / plotting framework Large libraries of tools Highly optimized for maths Available free from Uni, but usable only when connected to our network (e.g. via VPN) Module-specific software supported on school machines only.

12 Books (not compulsory purchase, but recommended)
“Introduction to Machine Learning” By Ethem Alpaydin Very Short Introduction to Statistics By David Hand Technical. Contains all necessary material For modules 1+2 of this theme. Not technical at all. More of a motivational, big-picture read.

13 Some Data, and Simple Predictors

14 A Problem to Solve Distinguish rugby players from ballet dancers.
You are provided with some data. Fallowfield rugby club (16). Rusholme ballet troupe (10). Task Generate a program which will correctly classify ANY player/dancer in the world. Hint We shouldn’t “fine-tune” our system too much so it only works on the local clubs.

15 Taking measurements…. Height Weight Shoe size Sex
We have to process the people with a computer, so it needs to be in a computer-readable form. What are the distinguishing characteristics? Height Weight Shoe size Sex

16 Taking measurements…. Person 1 2 3 4 5 … 16 17 18 19 20 Weight 63kg
Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm height weight

17 The Nearest Neighbour Rule
Person 1 2 3 4 5 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 height weight “TRAINING” DATA Who’s this guy? - player or dancer? height = 180cm weight = 78kg “TESTING” DATA

18 The Nearest Neighbour Rule
Person 1 2 3 4 5 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 height weight “TRAINING” DATA height = 180cm weight = 78kg Find nearest neighbour Assign the same class

19 The K-Nearest Neighbour Classifier
Person 1 2 3 4 5 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 Testing point x For each training datapoint x’ measure distance(x,x’) End Sort distances Select K nearest Assign most common class “TRAINING” DATA height weight

20 Quick reminder: Pythagoras’ theorem
. . . measure distance(x,x’) a c b a.k.a. “Euclidean” distance height weight

21 The K-Nearest Neighbour Classifier
Person 1 2 3 4 5 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 Testing point x For each training datapoint x’ measure distance(x,x’) End Sort distances Select K nearest Assign most common class “TRAINING” DATA height weight Seems sensible. But what are the disadvantages?

22 The K-Nearest Neighbour Classifier
Person 1 2 3 4 5 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 height weight “TRAINING” DATA Here I chose k=3. What would happen if I chose k=5? What would happen if I chose k=26?

23 The K-Nearest Neighbour Classifier
Person 1 2 3 4 5 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 height weight “TRAINING” DATA Any point on the left of this “boundary” is closer to the red circles. Any point on the right of this “boundary” is closer to the blue crosses. This is called the “decision boundary”

24 Not always a simple straight line!
Where’s the decision boundary? height weight Not always a simple straight line!

25 Where’s the decision boundary?
height weight Not always contiguous!

26 So, we have our first “machine learning” algorithm
The K-Nearest Neighbour Classifier Make your own notes on its advantages / disadvantages. Testing point x For each training datapoint x’ measure distance(x,x’) End Sort distances Select K nearest Assign most common class

27 The most important concept in Machine Learning

28 The most important concept in Machine Learning
Looks good so far…

29 The most important concept in Machine Learning
Looks good so far… Oh no! Mistakes! What happened?

30 The most important concept in Machine Learning
Looks good so far… Oh no! Mistakes! What happened? We didn’t have all the data. We can never assume that we do. This is called “OVER-FITTING” to the small dataset.

31 Overfitting Overfitting happens when the classifier is too “flexible” for the problem. If we’d drawn a simpler decision boundary below, maybe a straight line, we may have gotten lower error.

32 Possible uses of your break:
Break for 10 mins Possible uses of your break: Ensure you have a working login for the computer lab this afternoon. Talk to me or a demonstrator about the material. Read ahead in the notes. Go get a coffee.

33 A more simple, compact rule?
height weight

34 What’s an algorithm to find a good threshold?
while ( numMistakes != 0 ) { numMistakes = error( ) } height weight

35 We have our second Machine Learning procedure.
The threshold classifier (also known as a “Decision Stump”) while ( numMistakes != 0 ) { numMistakes = error( ) }

36 Three “ingredients” of a Machine Learning procedure
The final product, the thing you have to package up and send to a customer. A piece of code with some parameters that need to be set. “Model” The performance criterion: the function you use to judge how well the parameters of the model are set. “Error function” The algorithm that optimises the model parameters, using the error function to judge how well it is doing. “Learning algorithm”

37 Three “ingredients” of a Threshold Classifier
Error function while ( numMistakes != 0 ) { numMistakes = error( ) } Learning algorithm Model

38 What’s the “model” for the k-NN classifier?
For the k-nn, the model is the training data itself ! - very good accuracy  - very computationally intensive!  height weight Testing point x For each training datapoint x’ measure distance(x,x’) End Sort distances Select K nearest Assign most common class

39 Our model does not match the problem!
New data: what’s an algorithm to find a good threshold? Our model does not match the problem! 1 mistake… height weight

40   New data: what’s an algorithm to find a good threshold?
height But our current model cannot represent this… weight

41 We need a more sophisticated model…
The Linear Classifier height weight

42 The Linear Classifier Decision boundary
height height weight weight , and change the position of the DECISION BOUNDARY

43 Geometry of the Linear Classifier (1)
In 2-d, this is a line. In higher dimensions, it is a decision “hyper-plane”. Any point on the plane evaluates to 0. Points not on the plane evaluate to +1/-1. The decision boundary is always ORTHOGONAL to the weight vector. See if you can prove this for yourself before going to the notes. f(x)=+1 f(x)=+1 f(x)=0

44 Geometry of the Linear Classifier (2)
We can rearrange the decision rule:

45 Geometry of the Linear Classifier (3)
On the plane: In 2-dimensions: f(x)=+1 f(x)=+1 f(x)=0 This now follows the geometry of a straight line y=mx+c, with

46 The Linear Classifier Model Error function Learning algo.
height weight Note the terminology! See notes for details!

47 Gradient Descent Follow the NEGATIVE gradient.

48 Stochastic Gradient Descent
initialise weight values to random numbers in range -1 to +1 for n = 1 to NUM_ITERATIONS for each training example (x,y) calculate for each weight i end = a small constant, the “learning rate” Convergence theorem: If the data is linearly separable, then application of the learning rule will find a separating decision boundary, within a finite number of iterations.

49 A problem… initialise weight values to random numbers in range -1 to +1 . . . Depending on the random initialisation, the linear classifier will converge to one of the valid boundaries…but randomly! height weight

50 Possible uses of your break:
Break for 30 mins Possible uses of your break: Ensure you have a working login for the computer lab this afternoon. Talk to me or a demonstrator about the material. Read ahead in the notes. Go get a coffee.

51 Another model : logistic regression
Our model f(x) has range plus/minus INFIINTY! Is this really necessary? What is the confidence of our decisions? Can we estimate PROBABILITIES? Logistic regression estimates p( y=1 | x ) Output in range [0,1] Sigmoid function

52 Another error : cross entropy
Above we assume y is either 0 or 1. Derived from the statistical principle of Likelihood. We’ll see this again in a few weeks.

53 Gradient Descent SAME update as for squared error!
Follow the NEGATIVE gradient. SAME update as for squared error!

54 Stochastic Gradient Descent
initialise weight values to random numbers in range -1 to +1 for n = 1 to NUM_ITERATIONS for each training example (x,y) calculate for each weight i end = a small constant, the “learning rate”

55 A natural ‘pairing’ of error function to model

56 Still a problem… initialise weight values to random numbers in range -1 to +1 . . . Depending on the random initialisation, the logistic regression classifier will converge to one of the valid boundaries…but randomly! height weight

57 Geometry of Linear Models (see notes)

58 Our model does not match the problem!
Another problem - new data…. “non-linearly separable” height Our model does not match the problem! We’ll deal with this next week! weight

59 End of Day 1 Now… read the notes. Read the “Surrounded by Statistics” chapter in the handouts. The fog will clear. This afternoon… learn MATLAB. This week’s exercise is unassessed, but you are highly advised to get as much practice in as you can….


Download ppt "Learning from Data COMP61011 : Machine Learning and Data Mining"

Similar presentations


Ads by Google