Data Mining (and machine learning)

Data Mining (and machine learning)
Correlation and Coursework 3 David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Today A bit more basic statistics: Correlation
Undestanding whether two fields of the data are related Can you predict one from the other? Or is there some underlying cause that affects both? Feature Selection – using correlation Some datasets are very big, with too many fields, making machine learning impossible. We have to select features before we can do DM – which ones? Another reason to select features: some are no help in prediction, and may make results worse. Confusion matrices A simple thing that supports interpretation of ML results. Coursework 2 David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Correlation Co-relation
When two fields seem related to each other – e.g. when an instance has a high value for field A, it tends to have a high value for field B. David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Correlation Are these two things correlated?
Phone use (hrs) Life expectancy 1 84 2 78 3 91 4 79 5 69 6 80 7 76 8 9 75 10 70 Are these two things correlated? David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Correlation David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

What about these (web credit)
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Correlation Measures It is easy to calculate a number that tells you how well two things are correlated. The most common is “Pearson’s r” The r measure is: r = 1 for perfectly positively correlated data (as A increases, B increases, and the line exactly fits the points) r = -1 for perfectly negative correlation (as A increases, B decreases, and the line David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Correlation Measures r = No correlation – there seems to me not the slightest hint of any relationship between A and B. E.g. if the numbers in either field A or B (or both) were generated randomly, you would expect r close to 0. More general and usual values of r: if r >= 0.9 (r <= -0.9) a `strong’ correlation else if r >= 0.65 (r <= -0.65) -- a moderate correlation else if r >= 0.2 (r <= -0.2) -- a weak correlation, David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Calculating r You will remember the Sample standard deviation,
when you have a sample of n different values whose mean is Sample Std is square root of David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Calculating r If we have pairs of (x,y) values, Pearson’s r is:
Interpretation of this should be obvious (?) David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Calculating r Looking at it another way: after z-normalisation
So: X is the z-normalised x value in the sample – indicating how many stds away from the mean it is. Same for Y So when x and y tend to be the same distance, in the same (different) direction from their stds, the contribution is positive (negative). If they are uncorrelated, positive and negative contributions will tend to cancel out. David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

From the communities and crime dataset – names file
min Max mean std correlation median mode Population 1 0.06 0.13 0.37 0.02 0.01 Householdsize 0.46 0.16 -0.03 0.44 0.41 Racepctblack 0.18 0.25 0.63 racePctWhite 0.75 0.24 -0.68 0.85 0.98 racePctAsian 0.15 0.21 0.04 0.07 racePctHisp 0.14 0.23 0.29 agePct12t21 0.42 0.4 0.38 agePct12t29 0.49 0.48 agePct16t24 0.34 0.17 0.1 agePct65up 0.47 numbUrban 0.36 0.03 pctUrban 0.7 0.08 medIncome -0.42 0.32 pctWWage 0.56 -0.31 0.58 pctWFarmSelf 0.2 -0.15 pctWInvInc 0.5 -0.58 pctWSocSec 0.12 0.475 pctWPubAsst 0.22 0.57 0.26 pctWRetire -0.1 medFamInc -0.44 0.33 perCapInc 0.35 0.19 -0.35 0.3 whitePerCap -0.21 blackPerCap -0.28 indianPerCap -0.09 AsianPerCap -0.16 0.28 OtherPerCap -0.13 HispPerCap 0.39 -0.24 0.345 NumUnderPov 0.45 PctPopUnderPov 0.52 PctLess9thGrade 0.27 David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

The top 20 (although the first doesn’t count)
ViolentCrimesPerPop 1 0.24 0.23 0.15 0.03 PctIlleg 0.25 0.74 0.17 0.09 PctKids2Par 0.62 0.21 -0.74 0.64 0.72 PctFam2Par 0.61 0.2 -0.71 0.63 0.7 racePctWhite 0.75 -0.68 0.85 0.98 PctYoungKids2Par 0.66 0.22 -0.67 0.91 PctTeen2Par 0.58 0.19 -0.66 0.6 racepctblack 0.18 0.06 0.01 pctWInvInc 0.5 -0.58 0.48 0.41 pctWPubAsst 0.32 0.57 0.26 0.1 FemalePctDiv 0.49 0.56 0.54 TotalPctDiv 0.55 PctPolicBlack 0.12 1675 MalePctDivorce 0.46 0.53 0.47 PctPersOwnOccup -0.53 PctPopUnderPov 0.3 0.52 0.08 PctUnemployed 0.36 PctHousNoPhone 0.185 PctPolicMinor 0.07 PctNotHSGrad 0.38 0.39 David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

What use is correlation in data mining?

Feature Selection Feature = field
Some datasets have hundreds or thousands of features This is a big problem for data mining and machine learning. Too many features: Means machine learning methods are much too slow Means machine learning methods might be confused by irrelevant features, and subsequently give poor results on test sets So, dataminers do feature selection – reducing the size of the dataset by choosing what seem to be the most relevant features for the machine learning task. David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Feature selection: how?
Simplest and most common approach, e.g. to choose 100 features from a dataset with 1000 features, is simply to work out the Correlation of each feature with the class field, and choose those with the 100 highest (absolute) correlation values. David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Can anyone see a potential problem with this?

Feature selection: how II?
Work out the correlation between every distinct pair of features in the dataset. If two features are strongly correlated with each other, is there any value to the machine learning process for including both of them? David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Feature selection methods
There are many FS methods, all of which try to reduce datasets by including only features that carry relevant information for the machine learning task, and which try to avoid having many different versions of the same feature. There is no agreement or theory that prefers some methods over others. So, if you need to do FS, you might as well do it by correlation-based ranking of features. However … David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Feature selection III In some cases it is interesting or important to find a classifier that uses as small a number of features as possible. In gene expression datasets, there may be ~10,000 features. But if you can find an accurate classifier that uses only 3 or 4 features, this means you have potentially identified, for example, the 3 or 4 genes that are most relevant for a particular form of Leukaemia, or Hepatitis B, etc … For that type of case, FS is not a preparatory step, it is the main part of the DM/ML process. Read the paper that I point to on the www site. David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Confusion matrices Suppose you have four classes in the dataset, and overall accuracy is 80%. Sounds good, but … David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Confusion matrices Suppose you have four classes in the dataset, and overall accuracy is 80%. Sounds good, but … Maybe the situation on the test set is this: Predicted class A B C D Actual class A Actual class B Actual class C Actual class D David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Confusion matrices Suppose you have four classes in the dataset, and overall accuracy is 80%. Sounds good, but … Or maybe it is this, which is rather better? Predicted class A B C D Actual class A Actual class B Actual class C Actual class D David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

CW3 David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Data Mining (and machine learning)

Similar presentations

Presentation on theme: "Data Mining (and machine learning)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining (and machine learning)

Similar presentations

Presentation on theme: "Data Mining (and machine learning)"— Presentation transcript:

Similar presentations

About project

Feedback