LINEAR CLASSIFICATION METHODS STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie.

LINEAR CLASSIFICATION METHODS STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie

Introduction The observations in the dataset we will work on (“BUPA liver disorders”) were sampled by BUPA Medical Research Ltd and consist of 7 variables and 345 observed vectors. The first 5 variables are measurements taken by blood tests that are thought to be sensitive to liver disorders and might arise from excessive alcohol consumption. The sixth variable is a sort of selector variable. The subjects are single male individuals. The seventh variable is a selector on the dataset, being used to split it into two sets, indicating the class identity. Among all the observations, there are 145 people belonging to the liver-disorder group (corresponding to selector number 2) and 200 people belonging to the liver-normal group.

Description of variables The description of each variable is below: 1. mcv mean corpuscular volume 2. alkphos alkaline phosphotase 3. sgpt alamine aminotransferase 4. sgot aspartate aminotransferase 5. gammagt gamma-glutamyl transpeptidase 6. drinks number of half-pint equivalents of alcoholic beverages drunk per day 7. selector field used to split data into two sets. It is a binary categorical variable with indicators 1 and 2 ( 2 corresponding to liver disorder)

Matrix Plot of the variables

Logistic regression in full Space Coefficients: Value Std. Error t value (Intercept) 5.99024204 2.684250011 2.231626 mcv -0.06398345 0.029631551 -2.159301 alk -0.01952510 0.006756806 -2.889694 sgpt -0.06410562 0.012283808 -5.218709 sgot 0.12319769 0.024254150 5.079448 gammagt 0.01894688 0.005589619 3.389656 drinks -0.06807958 0.040358528 -1.686870 So the classification rule is: G(x)=

Classification error rate the classification error on the whole training data set. error rate: 0.2956 Sensitivity: 0.825 Specificity: 0.5379 The error rate and it’s standard error obtained by 10-fold cross validation error rate:(Standard Error) 0.307461384336384 (0.0271) Sensitivity:(Standard Error) 0.816280482802222 (0.0203) Specificity:(Standard Error) 0.531134992458522 (0.0699)

Backward step wise model selection based on AIC Five variables are selected after step-wise model selection. The first variable MCV is deleted. error rate:(Standard Error) 0.329460817156602 (0.03051) Sensitivity:(Standard Error) 0.792109881015521 (0.03433) Specificity:(Standard Error) 0.507341628959276 (0.03863) COMMENT: This method has a larger classification error rate than the original one. Using stepwise doesn’t improve classification

Scree plot for the PCA

The performance of the Logistic regression on the reduced space The reduced space is obtained by selecting the first three principle components. The standard error is obtained by 10 fold cross validation. error rate:(Standard Error) 0.456256232089833 (0.023414) Sensitivity:(Standard Error) 0.372869939127443 (0.031675) Specificity:(Standard Error) 0.783003663003663 (0.030785) Comment: the classification error rate is around 50%, which is not much better than the random guessing.

The classification plot on the first two principle components plane

Linear Discriminant Analysis LDA assumes a multivariate normal distribution, so we make some log transformations on some variables. Y1=mac & Y2=log(alk) Y3=log(sgpt) & Y4=log(sgpt) Y5=log(gammat) & Y6=log(dringks+1)

The histogram of the sgpt variable and its log transformation

The performance of the LDA based on Transformed data Comment: the classification error is the smallest among all methods and the sensitivity is the largest error rate: 0.263768115942029 Sensitivity: 0.865 Specificity: 0.558620689655172 By the log transformation, we make the assumption of multivariate normality reasonable. So the classification becomes better.

LDA after PCA error rate: 0.411594202898551 Sensitivity: 0.88 Specificity: 0.186206896551724 Comment: the performance is not improved by PCA

Conclusion Four different methods are applied to the liver disorder data set. The LDA based on the transformed variables works best and the Logistic regression based on the original data set second. The classification method based on the principle component doesn’t work well. Although the first three principle components contain more than 97% variation, we may still lose the most important information for classification. The transformations can make the LDA method work better in some cases. The LDA assumes the normality distribution which is a very strong assumption in many data sets. For example, in our data, all variables except the first one are seriously skewed. That is why log transform works.

LINEAR CLASSIFICATION METHODS STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie.

Similar presentations

Presentation on theme: "LINEAR CLASSIFICATION METHODS STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LINEAR CLASSIFICATION METHODS STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie.

Similar presentations

Presentation on theme: "LINEAR CLASSIFICATION METHODS STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie."— Presentation transcript:

Similar presentations

About project

Feedback