Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: Concepts and Techniques — Chapter 3 — Cont.

Similar presentations


Presentation on theme: "Data Mining: Concepts and Techniques — Chapter 3 — Cont."— Presentation transcript:

1 Data Mining: Concepts and Techniques — Chapter 3 — Cont.
More on Feature Selection: Chi-squared test Principal Component Analysis

2 Attribute Selection Question: Are attributes A1 and A2 independent?
ID Outlook Temperature Humidity Windy Play 1 100 40 90 T 2 F 3 50 4 10 30 5 15 70 6 7 8 9 11 12 13 14 Question: Are attributes A1 and A2 independent? If they are very dependent, we can remove either A1 or A2 If A1 is independent on a class attribute A2, we can remove A1 from our training data

3 Deciding to remove attributes in feature selection
Dependent (ChiSq=small) ? A2=class attribute Independent (Chisq=large ? A1 Dependent (ChiSq=small) ? A2 = class attribute Independent (Chisq=large ?

4 Chi-Squared Test (cont.)
Question: Are attributes A1 and A2 independent? These features are nominal valued (discrete) Null Hypothesis: we expect independence Outlook Temperature Sunny High Cloudy Low

5 The Weather example: Observed Count
temperature Outlook High Low Outlook Subtotal Sunny 2 Cloudy 1 Temperature Subtotal: Total count in table =3 Outlook Temperature Sunny High Cloudy Low

6 The Weather example: Expected Count
If attributes were independent, then the subtotals would be Like this (this table is also known as temperature Outlook High Low Subtotal Sunny 2*2/3=4/3=1.3 2*1/3=2/3=0.6 2 Cloudy 2*1/3=0.6 1*1/3=0.3 1 Subtotal: Total count in table =3 Outlook Temperature Sunny High Cloudy Low

7 Question: How different between observed and expected?
If Chi-squared value is very large, then A1 and A2 are not independent  that is, they are dependent! Degrees of freedom: if table has n*m items, then freedom = (n-1)*(m-1) In our example Degree = 1 Chi-Squared=?

8 Chi-Squared Table: what does it mean?
If calculated value is much greater than in the table, then you have reason to reject the independence assumption When your calculated chi-square value is greater than the chi2 value shown in the 0.05 column (3.84) of this table  you are 95% certain that attributes are actually dependent! i.e. there is only a 5% probability that your calculated X2 value would occur by chance

9 Example Revisited (http://helios.bto.ed.ac.uk/bto/statistics/tress9.html)
We don’t have to have two-dimensional count table (also known as contingency table) Suppose that the ratio of male to female students in the Science Faculty is exactly 1:1, But, the Honours class over the past ten years there have been 80 females and 40 males. Question: Is this a significant departure from the (1:1) expectation? Observed Honours Male Female Total 40 80 120

10 Expected (http://helios.bto.ed.ac.uk/bto/statistics/tress9.html)
Suppose that the ratio of male to female students in the Science Faculty is exactly 1:1, but in the Honours class over the past ten years there have been 80 females and 40 males. Question: Is this a significant departure from the (1:1) expectation? Note: the expected is filled in, from 1:1 expectation, instead of calculated Expected Honours Male Female Total 60 120

11 Chi-Squared Calculation
Female Male Total Observed numbers (O) 80 40 120 Expected numbers (E) 60 O - E 20 -20 (O-E)2 400 (O-E)2 / E 6.67 Sum=13.34 = X2

12 Chi-Squared Test (Cont.)
Then, check the chi-squared table for significance Compare our X2 value with a c2 (chi squared) value in a table of c2 with n-1 degrees of freedom n is the number of categories, i.e. 2 in our case -- males and females). We have only one degree of freedom (n-1). From the c2 table, we find a "critical value of 3.84 for p = 0.05. 13.34 > 3.84, and the expectation (that the Male:Female in honours major are 1:1) is wrong!

13 Chi-Squared Test in Weka: weather.nominal.arff

14 Chi-Squared Test in Weka

15 Chi-Squared Test in Weka

16 Example of Decision Tree Induction
Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6? Class 2 Class 1 Class 2 Class 1 Reduced attribute set: {A1, A4, A6} >

17 Principal Component Analysis
Given N data vectors from k-dimensions, find c <= k orthogonal vectors that can be best used to represent data The original data set is reduced to one consisting of N data vectors on c principal components (reduced dimensions) Each data vector Xj is a linear combination of the c principal component vectors Y1, Y2, … Yc Xj= m+W1*Y1+W2*Y2+…+Wk*Yc, i=1, 2, … N M is the mean of the data set W1, W2, … are the ith components Y1, Y2, … are the ith Eigen vectors Works for numeric data only Used when the number of dimensions is large

18 Principal Component Analysis
See online tutorials such as X2 Y1 Y2 x Note: Y1 is the first eigen vector, Y2 is the second. Y2 ignorable. X1 Key observation: variance = largest!

19 Principal Component Analysis: one attribute first
Temperature 42 40 24 30 15 18 35 Question: how much spread is in the data along the axis? (distance to the mean) Variance=Standard deviation^2

20 Now consider two dimensions
X=Temperature Y=Humidity 40 90 30 15 70 Covariance: measures the correlation between X and Y cov(X,Y)=0: independent Cov(X,Y)>0: move same dir Cov(X,Y)<0: move oppo dir

21 More than two attributes: covariance matrix
Contains covariance values between all possible dimensions (=attributes): Example for three attributes (x,y,z):

22 Background: eigenvalues AND eigenvectors
Eigenvectors e : C e = e How to calculate e and : Calculate det(C-I), yields a polynomial (degree n) Determine roots to det(C-I)=0, roots are eigenvalues  Check out any math book such as Elementary Linear Algebra by Howard Anton, Publisher John,Wiley & Sons Or any math packages such as MATLAB

23 Steps of PCA Let be the mean vector (taking the mean of all rows)
Adjust the original data by the mean X’ = X – Compute the covariance matrix C of adjusted X Find the eigenvectors and eigenvalues of C. For matrix C, vectors e (=column vector) having same direction as Ce : eigenvectors of C is e such that Ce=e,  is called an eigenvalue of C. Ce=e  (C-I)e=0 Most data mining packages do this for you.

24 Steps of PCA (cont.) Calculate eigenvalues  and eigenvectors e for covariance matrix: Eigenvalues j corresponds to variance on each component j Thus, sort by j Take the first n eigenvectors ei; where n is the number of top eigenvalues These are the directions with the largest variances

25 An Example Mean1=24.1 Mean2=53.8 X1 X2 X1' X2' 19 63 -5.1 9.25 39 74
14.9 20.25 30 87 5.9 33.25 23 -30.75 15 35 -9.1 -18.75 43 -10.75 32 -21.75 73 19.25

26 Covariance Matrix C= Using MATLAB, we find out: Eigenvectors:
75 106 482 C= Using MATLAB, we find out: Eigenvectors: e1=(-0.98,-0.21), 1=51.8 e2=(0.21,-0.98), 2=560.2 Thus the second eigenvector is more important!

27 If we only keep one dimension: e2
yi -10.14 -16.72 -31.35 31.374 16.464 8.624 19.404 -17.63 We keep the dimension of e2=(0.21,-0.98) We can obtain the final data as

28 Using Matlab to figure it out

29 PCA in Weka

30 Wesather Data from UCI Dataset (comes with weka package)

31 PCA in Weka (I)

32

33 Summary of PCA PCA is used for reducing the number of numerical attributes The key is in data transformation Adjust data by mean Find eigenvectors for covariance matrix Transform data Note: only linear combination of data (weighted sum of original data)

34 Missing and Inconsistent values
Linear regression: Data are modeled to fit a straight line least-square method to fit Y=a+bX Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above.

35 Regression Height Y1 Y1’ y = x + 1 X1 Age

36 Clustering for Outlier detection
Outliers can be incorrect data. Clusters  majority behavior

37 Data Reduction with Sampling
Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew (uneven) classes Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data

38 Sampling SRSWOR (simple random sample without replacement) SRSWR
Raw Data SRSWOR (simple random sample without replacement) SRSWR

39 Sampling Example Cluster/Stratified Sample Raw Data

40 Summary Data preparation is a big issue for data mining
Data preparation includes Data warehousing Data reduction and feature selection Discretization Missing values Incorrect values Sampling


Download ppt "Data Mining: Concepts and Techniques — Chapter 3 — Cont."

Similar presentations


Ads by Google