Practice Project Overview CSCE 4143: Date Mining Yueyang Wang Spring 2019
Data: Adult dataset
Description of dataset Figure1: Boxplots of numeric attributes Online Source: http://www.dataminingmasters.com/uploads/studentProjects/Earning_potential_report.pdf
Data Preprocessing: Remove records with unknown ( Data Preprocessing: Remove records with unknown (?) values from both train and test data sets
Data Preprocessing: Remove all continuous attributes
Q1.a Build a decision tree classifier (single tree) and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data. Apply Weka
Q1.a Build a decision tree classifier (single tree) and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data. Use Scikit-Learn
Q1.b Build a naïve Bayesian classifier and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data. Apply Weka
Q1.b Build a naïve Bayesian classifier and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data. Use ScikitLearn
Data Preprocessing: Use one-hot encoding to transform multi-domain categorical attribute Apply Weka
Data Preprocessing: Use one-hot encoding to transform multi-domain categorical attribute
Data Preprocessing: For each numerical attribute, use the mean value to transform into binary attribute Use Python
Q2.a Build k-means clustering algorithm over train data with varied k values (3, 5, 10) based on your chosen distance function and report the centroids of the clusters K=3 K=5 K=10
Q2.a Build k-means clustering algorithm over train data with varied k values (3, 5, 10) based on your chosen distance function and report the centroids of the clusters Transform income (<=50K, >50K) to binary
Q2.b Use the last 10 records from test data and use kNN algorithm (with varied k values, 3, 5, 10) to report the prediction accuracy. K=3 K=5 K=10
Q2.b Use the last 10 records from test data and use kNN algorithm (with varied k values, 3, 5, 10) to report the prediction accuracy.
Q3. Use the train datasets from step 2, build a SVM classifier and report the predicted accuracy of the test data. Apply Weka
Q3. Use the train datasets from step 2, build a SVM classifier and report the predicted accuracy of the test data.
Q4. Use the train datasets from step 2, build a neural network classifier and report the predicted accuracy of the test data. Apply Weka
Q4. Use the train datasets from step 2, build a neural network classifier and report the predicted accuracy of the test data.
Questions?