David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Slides:



Advertisements
Similar presentations
Brief introduction on Logistic Regression
Advertisements

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Testing Theories: Three Reasons Why Data Might not Match the Theory.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Chapter 4 The Relation between Two Variables
Ch11 Curve Fitting Dr. Deshi Ye
Regression What is regression to the mean?
CMPUT 466/551 Principal Source: CMU
x – independent variable (input)
Statistics for the Social Sciences
Chapter 4 Multiple Regression.
The Simple Regression Model
Accuracy of Prediction How accurate are predictions based on a correlation?
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 6: Correlation.
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Introduction to Regression Analysis, Chapter 13,
Correlation Question 1 This question asks you to use the Pearson correlation coefficient to measure the association between [educ4] and [empstat]. However,
Relationships Among Variables
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;
Lecture 3-2 Summarizing Relationships among variables ©
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Simple Linear Regression Models
CORRELATION & REGRESSION
Testing Theories: Three Reasons Why Data Might not Match the Theory Psych 437.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.
© 2008 Pearson Addison-Wesley. All rights reserved Chapter 1 Section 13-6 Regression and Correlation.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Correlation is a statistical technique that describes the degree of relationship between two variables when you have bivariate data. A bivariate distribution.
BIOL 582 Lecture Set 11 Bivariate Data Correlation Regression.
Correlation & Regression
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Regression Analysis A statistical procedure used to find relations among a set of variables.
PS 225 Lecture 20 Linear Regression Equation and Prediction.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Discussion of time series and panel models
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
CORRELATION. Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson’s coefficient of correlation.
Lecture 10: Correlation and Regression Model.
LECTURE 9 Tuesday, 24 FEBRUARY STA291 Fall Administrative 4.2 Measures of Variation (Empirical Rule) 4.4 Measures of Linear Relationship Suggested.
CHAPTER 5 CORRELATION & LINEAR REGRESSION. GOAL : Understand and interpret the terms dependent variable and independent variable. Draw a scatter diagram.
Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)
Correlation & Regression Analysis
1 Data Analysis Linear Regression Data Analysis Linear Regression Ernesto A. Diaz Department of Mathematics Redwood High School.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Simple Linear Regression The Coefficients of Correlation and Determination Two Quantitative Variables x variable – independent variable or explanatory.
ContentDetail  Two variable statistics involves discovering if two variables are related or linked to each other in some way. e.g. - Does IQ determine.
Lecturer’s desk Physics- atmospheric Sciences (PAS) - Room 201 s c r e e n Row A Row B Row C Row D Row E Row F Row G Row H Row A
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Copyright © Cengage Learning. All rights reserved. 8 9 Correlation and Regression.
Regression and Correlation
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
CH 5: Multivariate Methods
Correlation – Regression
Chapter 5 STATISTICS (PART 4).
Data Mining (and machine learning)
Modelling data and curve fitting
Distributions / Histograms
Data Mining (and machine learning)
Data Mining (and machine learning)
CORRELATION & REGRESSION compiled by Dr Kunal Pathak
Presentation transcript:

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Data Mining (and machine learning) DM Lecture 6: Confusion, Correlation and Regression

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Today A note on ‘accuracy’: Confusion matrices A bit more basic statistics: Correlation –Undestanding whether two fields of the data are related –Can you predict one from the other? –Or is there some underlying cause that affects both? Basic Regression: –Very, very often used –Given that there is a correlation between A and B (e.g. hours of study and performance in exams; height and weight ; radon levels and cancer, etc … this is used to predict B from A. –Linear Regression (predict value of B from value of A) –But be careful …

80% ‘accuracy’... David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

80% ‘accuracy’... … means our classifier predicts the correct class 80% of the time, right? David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

80% ‘accuracy’... David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Could be with this confusion matrix: True \ PredABC A8010 B 8010 C51580

80% ‘accuracy’... David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Could be with this confusion matrix: True \ PredABC A8010 B 8010 C51580 True \ PredABC A10000 B0 0 C60040 Or this one

80% ‘accuracy’... David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Could be with this confusion matrix: True \ PredABC A8010 B 8010 C51580 True \ PredABC A10000 B0 0 C60040 Or this one Individual class accuracies 80% 100% 60%

80% ‘accuracy’... David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Could be with this confusion matrix: True \ PredABC A8010 B 8010 C51580 True \ PredABC A10000 B0 0 C60040 Or this one Individual class accuracies 80% 100% 60% 86.7% Mean class accuracy

Another example / unbalanced classes David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: True \ PredABC A10000 B C Individual class accuracies 100% 30% 80% Overall accuracy: ( ) /  75.7%

Another example / unbalanced classes David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: True \ PredABC A10000 B C Individual class accuracies 100% 30% 80% 70% Overall accuracy: ( ) /  75.7% Mean class accuracy

Yet another example / unbalanced classes David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: True \ PredABC A10000 B0 0 C Individual class accuracies 100% 20% 73.3% Overall accuracy: ( ) /  21.6% Mean class accuracy

Another example / unbalanced classes David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: True \ PredABC A10000 B C Individual class accuracies 100% 30% 80% 70% Overall accuracy: ( ) /  75.7% Mean class accuracy

Another example / unbalanced classes David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: True \ PredABC A10000 B C Individual class accuracies 100% 30% 80% 70% Overall accuracy: ( ) /  75.7% Mean class accuracy

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Correlation Phone use (hrs)Life expectancy Are these two things correlated?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Correlation

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: What about these (web credit)

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: What about these (web credit)

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Correlation Measures It is easy to calculate a number that tells you how well two things are correlated. The most common is “Pearson’s r” The r measure is: r = 1 for perfectly positively correlated data (as A increases, B increases, and the line exactly fits the points) r = -1 for perfectly negative correlation (as A increases, B decreases, and the line exactly fits the points)

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Correlation Measures r = 0 No correlation – there seems to me not the slightest hint of any relationship between A and B More general and usual values of r: if r >= 0.9 (r <= -0.9) -- a `strong’ correlation else if r >= 0.65 (r <= -0.65) -- a moderate correlation else if r >= 0.2 (r <= -0.2) -- a weak correlation,

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Calculating r Sample Std is square root of You will remember the Sample standard deviation, when you have a sample of n different values whose mean is

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Calculating r If we have pairs of (x,y) values, Pearson’s r is: Interpretation of this should be obvious (?)

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Correlation (Pearson’s R) and covariance As we just saw, this is Pearson’s R

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Correlation (Pearson’s R) and covariance And this is the covariance between x and y often called cov(x,y)

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: r = cov(x,y) = 6.15 R = −0.556 cov(x,y) = − 4.35 r = cov(x,y) = 61.5 R = −0.556 cov(x,y) = − 43.5

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: So, we calculate r, we think there is a correlation, what next?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: We find the `best’ line through the data

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Linear Regression AKA simple regression, basic regression, etc … Linear regression simply means: find the `best’ line through the data. What could `best line’ mean?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Best line means … We want the “closest” line, which minimizes the error. I.e. it minimizes the degree to which the actual points stray from the line. For any given point, its distance from the line is called a residual. We assume that there really is a linear relationship ( e.g. for a top long distance runner, time = miles x 5 minutes), but we expect that there is also random `noise’ which moves points away from the line (e.g. different weather conditions, different physical form, when collecting the different data points). We want this noise (deviations from line) to be as small as possible. We also want the noise to be unrelated to A (as in predicting B from A) – in other words, we want the correlation between A and the noise to be 0.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Noise/Residuals: the regression line

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Noise/Residuals: the residuals

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Same again, with key bits: We want the line, which minimizes the sum of the (squared) residuals. And we want the residuals themselves to have zero correlation with the variables

The general solution to multiple regression Suppose you have a numeric dataset, e.g … … … …

The general solution to multiple regression Suppose you have a numeric dataset, e.g … … … … X y

The general solution to multiple regression Suppose you have a numeric dataset, e.g … … … … A linear classifier or regressor is: y n = β 1 x n1 + β 2 x n2 + … β m x nm - So, how do you get the β values? X y

The general solution to multiple regression By straightforward linear algebra. Treat the data as matrices and vectors …

The general solution to multiple regression By straightforward linear algebra. Treat the data as matrices and vectors … And the solution is:

And that’s general multivariate linear regression. If the data are all numbers, you can get a prediction for your ‘target’ field by deriving the beta values with linear algebra. Note that this is regression, meaning that you predict an actual real number value, such as wind-speed, IQ, weight, etc…, rather than classification, where you predict a class value, such as ‘high’, ‘low’, ‘clever’, … But … easy to turn this into classification … how?

Pros and Cons Arguably, there’s no ‘learning’ involved, since the beta values are obtained by direct mathematical calculations on the data It’s pretty fast to get the beta values too. However In many, many cases, the ‘linear’ assumption is a poor one – the predictions simply will not be as good as, for example, decision trees, neural networks, k-nearest-neighbour, etc … The maths/implementation involves getting the inverses of large matrices. This sometimes explodes.

Back to ‘univariate’ regression, with one field … When there is only one x, it is called univariate regression, and closely related to the correlation between x and y. In this case the mathematics turns out to be equivalent to simply working out the correlation, plus a bit more

Calculating the regression line in univariate regression We get both of those things like this: First, recall that any line through (x,y) points can be represented by: where m is the gradient, and c is where it crosses the y-axis at x=0 This is just the ‘beta’ equation in the case of a single x.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Calculating the regression line To get m, we can work out Pearson’s r, and then we calculate: to get c, we just do this: Job done

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Now we can: Try to gain some insight from m – e.g. “since m = −4, this suggests that every extra hour of watching TV leads to a reduction in IQ of 4 points” “since m = 0.05, it seems that an extra hour per day of mobile phone use leads to a 5% increase in likelihood to develop brain tumours”.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: BUT Be careful -- It is easy to calculate the regression line, but always remember what the r value actually is, since this gives an idea of how accurate is the assumption that the relationship is really linear. So, the regression line might suggest: “… extra hour per day of mobile phone use leads to a 5% increase in likelihood to develop brain tumours” but if r ~ 0.3 (say) we can’t say we are very confident about this. Either not enough data, or the relationship is different from linear

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Prediction Commonly, we can of course use the regression line for prediction: So, predicted life expectancy for 3 hrs per day is: 82 - fine

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: BUT II We cannot extrapolate beyond the data we have!!!! So, predicted life expectancy for 11.5 hrs per day is: 68 >>>

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: BUT II We cannot extrapolate beyond the data we have!!!! Extrapolation outside the range is invalid Not allowed, not statistically sound. Why?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: What would we predict for 8 hrs if we did regression based only on the first cluster?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: So, always be careful When we look at data like that, we can do piecewise linear regression instead. What do you think that is? Often we may think some kind of curve fits best. So there are things called polynomial regression, logistic regression etc, which deal with that.

Example from current research In a current project I have to ‘learn’ 432 different regression classifiers every week. Each of those 432 classifiers is learned from a dataset that has about 50,000 fields. E.g. one of these classifiers learns to predict the value of wind-speed 6 hours ahead of now, at Findhorn, Scotland. The 50,000 fields are weather observations (current and historic) from several places within 100km of Findhorn. I use multiple regression for this. But first I do feature selection (next lecture)

CW2