Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6.

Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6

Numeric Attributes Numeric prediction and/ or numeric attributes as predictors Linear regression is well established statistical technique –Designed to predict numeric value based on numeric attributes –Determines optimal set of coefficients for linear equation: pred = w 0 + w 1 a 1 + w 2 a 2 + … + w n a n –Optimal means prediction errors squared is minimized –For data mining, this would be done on training data so that it can be tested on test data –I hope that a CSC major could read a statistics book and then write the code to do this –However, there is no need to do this, since this method is so available, unless you are seeking to create an improved version of it

Example <Show Basketball Spreadsheet – Baskball sheet NOTE – input values, weights, prediction vs actual <Show testReg sheet – test on separate instances NOTE – how it did – prediction vs actual – difference, correlation

Using Regression for Classification Perform regression for each class Set output to be predicted = 1 for training instances that belong to a class Set output to be predicted = 0 for training instances that do NOT belong to the class Do this for each class, and you will have an “membership function” equation for each class On test, plug new instance into each equation, and highest value produced will be the prediction to make

Example <Show discretized sheet NOTE – prep of data – into low, medium, high NOTE – Weights for 3 regressions, high, med, low <Show Test sheet NOTE – Calcs Hi, Med, Low (doesn’t do that well, suspect that the data may not be from same source (NBA), and that the discretization was a bit of a problem (very few low)

More sophisticated Do as many pairwise competitions as necessary Training – two classes against each other: –temporarily toss training instances that are not one of the two –Set output = 1 for class to be predicted and –1 for other Test – do all pairwise competitions, winner of each gets a vote –E.g. say – –Medium beats High –Medium beats Low –High beats Low –Medium wins 2-1-0 Conservative approach would be to predict nothing if no prediction dominates

In Context Has been used for decades for various applications (e.g. social science research) Bias – only searches for linear equations – no squares, cubes etc To work well, data must fit a linear model – e.g must be “ linearly separable ” – be able to divide with a line (in 2D, a plane in 3D, a hyperplane in multi-D) To work well, attributes should not be highly correlated with each other Depends on numeric attributes

Let’s Look at WEKA Linear Regression with Basketball data No Correctness measures –Correlations –Error Discretize Points per minute –Try logistic regression – a categorical prediction approach

End Section 4.6

Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6.

Similar presentations

Presentation on theme: "Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6.

Similar presentations

Presentation on theme: "Data Mining – Algorithms: Linear Models Chapter 4, Section 4.6."— Presentation transcript:

Similar presentations

About project

Feedback