Statistical Techniques

Statistical Techniques
Ethnicity Classification Through Analysis of Facial Features in SAS By: Remy Welch Faculty Advisor: Dr. Cuixian Chen Results Motivation Objective : classify facial images using Linear Discriminant Analysis (LDA), K-Nearest Neighbor (KNN), and k-means clustering in SAS. Applications : visual surveillance, market research, and online photo albums, etc. Technique Accuracy (%) LDA/LOO 96.1 ± 41.7 LDA/5-fold CV 94.4 ± 1.8 KNN/5-fold CV k=3 97.9 ± 2.0 KNN/LOO k=3 70.0 ± 28.5 Dataset Only Black and White (C=5) All Ethnicities Included (C=5) The Dataset consisted of 540 faces, which were categorized into 5 ethnicities based on self-report. Table of CLUSTER by ethnicity CLUSTER ethnicity Frequency Row Pct Col Pct Black White Total 1 2 3 4 5 Table of CLUSTER by ethnicity CLUSTER ethnicity Frequency Row Pct Col Pct Asian Black Hispanic Indian White Total 1 2 3 4 5 When only black and white faces were used, approximately 71% of the black faces were reliably grouped into one cluster, for c=2 to 5. No more than 61% of white faces were ever grouped into a single cluster. When c=5, the largest percentage of white faces in a single cluster was only 36% of the total number of white faces. Statistical Techniques Experiment: The facial images were classified into the 5 ethnicity categories using LDA. The faces were also classified using K-Nearest Neighbor (KNN). 2 different cross-validation (CV) techniques were used to test the classifications: the Leave-one-out (LOO) technique and a 5-fold CV. In addition, the data was also classified using a clustering procedure. LDA: classification tool used to linearly separate a dataset into different groups. The best Linear Discriminant Function Group 2 Data Group 1 Data Cluster 2 Cluster 3 K-means Clustering: Separates the data into clusters based on each data points’ proximity to the clusters’ means. Does not factor in the ground truth (what the peoples’ ethnicities actually are) Cluster 1 When all ethnicities were included in the k-means clustering, a good separation is seen between the five clusters. However, the distribution among those clusters does not reflect the true ethnicity distribution of the data. When c=5, the grouping of faces results in only a majority of Black and Asian faces being placed in a single cluster KNN: a classification tool in which an observation is classified based on the majority vote of its k-nearest neighbors. Conclusions The statistical technique that produced the most accurate classification of ethnicity was the KNN procedure, when cross-validated using a 5-fold CV. When the LOO CV was used, the LDA produced the most accurate classification, however the 5 fold CV is a more valid assessment of a procedure’s accuracy, therefore it can be concluded that the KNN procedure was better at predicting ethnicity. It should also be noted that for KNN/LOO, white faces were classified with 100% accuracy. Clustering did not prove to be very effective at classifying the ethnicities. A large part of this may have been due to the small representation of the Asian, Hispanic, and Indian ethnicities. When only Black and White faces were considered, the clustering procedure actually separated the black faces fairly well. 5-fold CV: The data is separated into 5 partitions. 4/5ths of the data is used to train the LDA, and the remaining 1/5th is used to test the predictions. Each partition of the data is used as the testing group once. LOO CV: One observation is used to test the data, and the rest are used to train the model. This is repeated until every observation has been used to test the model.

Statistical Techniques

Similar presentations

Presentation on theme: "Statistical Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Techniques

Similar presentations

Presentation on theme: "Statistical Techniques"— Presentation transcript:

Similar presentations

About project

Feedback