Learning from Data COMP61011 : Machine Learning and Data Mining

Learning from Data COMP61011 : Machine Learning and Data Mining
Dr Gavin Brown Machine Learning and Optimization Research Group

Learning from Data Data is recorded from some real-world phenomenon.
What might we want to do with that data? Prediction - what can we predict about this phenomenon? Description - how can we describe/understand this phenomenon in a new way? Optimization - how can we control and optimize this phenomenon for our own objectives?

Prediction Lecturer: Dr Gavin Brown COMP61011 Machine Learning
& Data Mining COMP61021 Modeling & Visualization of High Dimensional Data COMP61032 Optimization for Learning, Planning & Problem Solving Period 1 Oct/Nov Period 2 Nov/Dec Period 3 Feb/Mar Prediction Lecturer: Dr Gavin Brown

Machine Learning and Data Mining
Medical Records / Novel Drugs What characteristics of a patient indicate they may react well/badly to a new drug? How can we predict whether it will potentially hurt rather then help them? AstraZeneca Project Research Bursaries Limited number of eligible MSc projects, announced Dec 2011

Machine Learning and Data Mining
Handwriting Recognition Google Books is currently digitizing millions of books. Smartphones need to process non-European handwriting to tap into the Asian market. How can we recognize handwritten digits in a huge variety of handwriting styles, in real-time?

Learning from Data Where does all this fit? Learning from Data
Artificial Intelligence Statistics / Mathematics Data Mining Learning from Data Computer Vision Robotics (No definition of a field is perfect – the diagram above is just one interpretation, mine ;-)

Learn your trade…

Learning from Data ….. Prerequisites
MATHEMATICS This is a mathematical subject. You must be comfortable with probabilities and algebra. Maths primer on course website for reviewing. PROGRAMMING You must be able to program, and pick up a new language relatively easily. We use Matlab for the first 2 modules. In the 3rd module, you may use any language. Module codes in this theme: 61011 (prediction) 61021 (description) 61032 (optimization)

COMP61011 topic structure Week 1: Some Data and Simple Predictors
Week 2: Support Vector Machines / Model Selection Week 3: Decision Trees / Feature Selection Week 4: Bayes Theorem / Probabilistic Classifiers Week 5: Ensemble Methods / Industry Guest Lectures Week 6: No lecture.

COMP61011 assessment structure
50% January exam 50% coursework, broken down as 10% + 10% lab exercises (weeks 2,3) 30% mini-project (weeks 4,5,6) Lab exercises will be marked at the START of the following lab session. You should NOT be still working on the previous week’s work. Extensions will require a medical note.

Matlab MATrix LABoratory Interactive scripting language
Interpreted (i.e. no compiling) Objects possible, not compulsory Dynamically typed Flexible GUI / plotting framework Large libraries of tools Highly optimized for maths Available free from Uni, but usable only when connected to our network (e.g. via VPN) Module-specific software supported on school machines only.

Books (not compulsory purchase, but recommended)
“Introduction to Machine Learning” By Ethem Alpaydin Very Short Introduction to Statistics By David Hand Technical. Contains all necessary material For modules 1+2 of this theme. Not technical at all. More of a motivational, big-picture read.

Some Data, and Simple Predictors

A Problem to Solve Distinguish rugby players from ballet dancers.
You are provided with some data. Fallowfield rugby club (16). Rusholme ballet troupe (10). Task Generate a program which will correctly classify ANY player/dancer in the world. Hint We shouldn’t “fine-tune” our system too much so it only works on the local clubs.

Taking measurements…. Height Weight Shoe size Sex
We have to process the people with a computer, so it needs to be in a computer-readable form. What are the distinguishing characteristics? Height Weight Shoe size Sex

Taking measurements…. Person 1 2 3 4 5 … 16 17 18 19 20 Weight 63kg
Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm height weight

The Nearest Neighbour Rule
Person 1 2 3 4 5 … 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg … 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 height weight “TRAINING” DATA Who’s this guy? - player or dancer? height = 180cm weight = 78kg “TESTING” DATA

The Nearest Neighbour Rule
Person 1 2 3 4 5 … 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg … 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 height weight “TRAINING” DATA height = 180cm weight = 78kg Find nearest neighbour Assign the same class

The K-Nearest Neighbour Classifier
Person 1 2 3 4 5 … 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg … 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 Testing point x For each training datapoint x’ measure distance(x,x’) End Sort distances Select K nearest Assign most common class “TRAINING” DATA height weight

Quick reminder: Pythagoras’ theorem
. . . measure distance(x,x’) a c b a.k.a. “Euclidean” distance height weight

Person 1 2 3 4 5 … 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg … 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 Testing point x For each training datapoint x’ measure distance(x,x’) End Sort distances Select K nearest Assign most common class “TRAINING” DATA height weight Seems sensible. But what are the disadvantages?

Person 1 2 3 4 5 … 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg … 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 height weight “TRAINING” DATA Here I chose k=3. What would happen if I chose k=5? What would happen if I chose k=26?

Person 1 2 3 4 5 … 16 17 18 19 20 Weight 63kg 55kg 75kg 50kg 57kg … 85kg 93kg 99kg 100kg Height 190cm 185cm 202cm 180cm 174cm 150cm 145cm 130cm 163cm 171cm Class -1 +1 height weight “TRAINING” DATA Any point on the left of this “boundary” is closer to the red circles. Any point on the right of this “boundary” is closer to the blue crosses. This is called the “decision boundary”

Not always a simple straight line!
Where’s the decision boundary? height weight Not always a simple straight line!

Where’s the decision boundary?
height weight Not always contiguous!

So, we have our first “machine learning” algorithm
The K-Nearest Neighbour Classifier Make your own notes on its advantages / disadvantages. Testing point x For each training datapoint x’ measure distance(x,x’) End Sort distances Select K nearest Assign most common class

The most important concept in Machine Learning

Looks good so far…

Looks good so far… Oh no! Mistakes! What happened?

Looks good so far… Oh no! Mistakes! What happened? We didn’t have all the data. We can never assume that we do. This is called “OVER-FITTING” to the small dataset.

Overfitting Overfitting happens when the classifier is too “flexible” for the problem. If we’d drawn a simpler decision boundary below, maybe a straight line, we may have gotten lower error.

Possible uses of your break:
Break for 10 mins Possible uses of your break: Ensure you have a working login for the computer lab this afternoon. Talk to me or a demonstrator about the material. Read ahead in the notes. Go get a coffee.

A more simple, compact rule?
height weight

What’s an algorithm to find a good threshold?
while ( numMistakes != 0 ) { numMistakes = error( ) } height weight

We have our second Machine Learning procedure.
The threshold classifier (also known as a “Decision Stump”) while ( numMistakes != 0 ) { numMistakes = error( ) }

Three “ingredients” of a Machine Learning procedure
The final product, the thing you have to package up and send to a customer. A piece of code with some parameters that need to be set. “Model” The performance criterion: the function you use to judge how well the parameters of the model are set. “Error function” The algorithm that optimises the model parameters, using the error function to judge how well it is doing. “Learning algorithm”

Three “ingredients” of a Threshold Classifier
Error function while ( numMistakes != 0 ) { numMistakes = error( ) } Learning algorithm Model

What’s the “model” for the k-NN classifier?
For the k-nn, the model is the training data itself ! - very good accuracy  - very computationally intensive!  height weight Testing point x For each training datapoint x’ measure distance(x,x’) End Sort distances Select K nearest Assign most common class

Our model does not match the problem!
New data: what’s an algorithm to find a good threshold? Our model does not match the problem! 1 mistake… height weight

  New data: what’s an algorithm to find a good threshold?
height But our current model cannot represent this…  weight

We need a more sophisticated model…
The Linear Classifier height weight

The Linear Classifier Decision boundary
height height weight weight , and change the position of the DECISION BOUNDARY

Geometry of the Linear Classifier (1)
In 2-d, this is a line. In higher dimensions, it is a decision “hyper-plane”. Any point on the plane evaluates to 0. Points not on the plane evaluate to +1/-1. The decision boundary is always ORTHOGONAL to the weight vector. See if you can prove this for yourself before going to the notes. f(x)=+1 f(x)=+1 f(x)=0

We can rearrange the decision rule:

On the plane: In 2-dimensions: f(x)=+1 f(x)=+1 f(x)=0 This now follows the geometry of a straight line y=mx+c, with

The Linear Classifier Model Error function Learning algo.
height weight Note the terminology! See notes for details!

Gradient Descent Follow the NEGATIVE gradient.

Stochastic Gradient Descent
initialise weight values to random numbers in range -1 to +1 for n = 1 to NUM_ITERATIONS for each training example (x,y) calculate for each weight i end = a small constant, the “learning rate” Convergence theorem: If the data is linearly separable, then application of the learning rule will find a separating decision boundary, within a finite number of iterations.

A problem… initialise weight values to random numbers in range -1 to +1 . . . Depending on the random initialisation, the linear classifier will converge to one of the valid boundaries…but randomly! height weight

Possible uses of your break:
Break for 30 mins Possible uses of your break: Ensure you have a working login for the computer lab this afternoon. Talk to me or a demonstrator about the material. Read ahead in the notes. Go get a coffee.

Another model : logistic regression
Our model f(x) has range plus/minus INFIINTY! Is this really necessary? What is the confidence of our decisions? Can we estimate PROBABILITIES? Logistic regression estimates p( y=1 | x ) Output in range [0,1] Sigmoid function

Another error : cross entropy
Above we assume y is either 0 or 1. Derived from the statistical principle of Likelihood. We’ll see this again in a few weeks.

Gradient Descent SAME update as for squared error!
Follow the NEGATIVE gradient. SAME update as for squared error!

Stochastic Gradient Descent
initialise weight values to random numbers in range -1 to +1 for n = 1 to NUM_ITERATIONS for each training example (x,y) calculate for each weight i end = a small constant, the “learning rate”

A natural ‘pairing’ of error function to model

Still a problem… initialise weight values to random numbers in range -1 to +1 . . . Depending on the random initialisation, the logistic regression classifier will converge to one of the valid boundaries…but randomly! height weight

Geometry of Linear Models (see notes)

Our model does not match the problem!
Another problem - new data…. “non-linearly separable” height Our model does not match the problem! We’ll deal with this next week! weight

End of Day 1 Now… read the notes. Read the “Surrounded by Statistics” chapter in the handouts. The fog will clear. This afternoon… learn MATLAB. This week’s exercise is unassessed, but you are highly advised to get as much practice in as you can….

Learning from Data COMP61011 : Machine Learning and Data Mining

Similar presentations

Presentation on theme: "Learning from Data COMP61011 : Machine Learning and Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning from Data COMP61011 : Machine Learning and Data Mining

Similar presentations

Presentation on theme: "Learning from Data COMP61011 : Machine Learning and Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback