Practice Project Overview

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Authorship Verification Authorship Identification Authorship Attribution Stylometry.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Data Mining Classification: Alternative Techniques
Introduction to Data Mining with XLMiner
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Basic Data Mining Techniques Chapter Decision Trees.
K-means Based Unsupervised Feature Learning for Image Recognition Ling Zheng.
Data Mining – Intro.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining Techniques
An Exercise in Machine Learning
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Human Gesture Recognition Using Kinect Camera Presented by Carolina Vettorazzo and Diego Santo Orasa Patsadu, Chakarida Nukoolkit and Bunthit Watanapa.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
2008 NVO Summer School11 Basic Concepts in Data Mining Kirk Borne George Mason University T HE US N ATIONAL V IRTUAL O BSERVATORY.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
An Exercise in Machine Learning
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Intelligent Database Systems Lab Presenter: NENG-KAI, HONG Authors: HUAN LONG A, ZIJUN ZHANG A, ⇑, YAN SU 2014, APPLIED ENERGY Analysis of daily solar.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
CSC 478 Programming Data Mining Applications Course Summary Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Copyright  2004 limsoon wong Using WEKA for Classification (without feature selection)
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Data Mining – Clustering and Classification 1.  Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means.
Enhancing Tor’s Performance using Real- time Traffic Classification By Hugo Bateman.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
Prepared by: Mahmoud Rafeek Al-Farra
Can-CSC-GBE: Developing Cost-sensitive Classifier with Gentleboost Ensemble for breast cancer classification using protein amino acids and imbalanced data.
Efficient Image Classification on Vertically Decomposed Data
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
COMP1942 Classification: More Concept Prepared by Raymond Wong
Source: Procedia Computer Science(2015)70:
SEEM5770/ECLT5840 Course Review
Machine Learning Week 1.
Efficient Image Classification on Vertically Decomposed Data
A Fast and Scalable Nearest Neighbor Based Classification
I don’t need a title slide for a lecture
Prepared by: Mahmoud Rafeek Al-Farra
Prepared by: Mahmoud Rafeek Al-Farra
Machine Learning with Weka
12/2/2018.
Tutorial for WEKA Heejun Kim June 19, 2018.
iSRD Spam Review Detection with Imbalanced Data Distributions
Opening Weka Select Weka from Start Menu Select Explorer Fall 2003
Classification and Prediction
Objectives Data Mining Course
Imputation CSCI 446 Keith Bocian
Classification Boundaries
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CS539 Project Report -- Evaluating hypothesis
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Nearest Neighbors CSC 576: Data Mining.
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Physics-guided machine learning for milling stability:
Presentation transcript:

Practice Project Overview CSCE 4143: Date Mining Yueyang Wang Spring 2019

Data: Adult dataset

Description of dataset Figure1: Boxplots of numeric attributes Online Source: http://www.dataminingmasters.com/uploads/studentProjects/Earning_potential_report.pdf

Data Preprocessing: Remove records with unknown ( Data Preprocessing: Remove records with unknown (?) values from both train and test data sets

Data Preprocessing: Remove all continuous attributes

Q1.a Build a decision tree classifier (single tree) and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data. Apply Weka

Q1.a Build a decision tree classifier (single tree) and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data. Use Scikit-Learn

Q1.b  Build a naïve Bayesian classifier and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data. Apply Weka

Q1.b  Build a naïve Bayesian classifier and report accuracy by class including (TP rate, FP rate, precision, recall, F1) on the test data. Use ScikitLearn

Data Preprocessing: Use one-hot encoding to transform multi-domain categorical attribute Apply Weka

Data Preprocessing: Use one-hot encoding to transform multi-domain categorical attribute

Data Preprocessing:  For each numerical attribute, use the mean value to transform into binary attribute Use Python

Q2.a Build k-means clustering algorithm over train data with varied k values (3, 5, 10) based on your chosen distance function and report the centroids of the clusters K=3 K=5 K=10

Q2.a Build k-means clustering algorithm over train data with varied k values (3, 5, 10) based on your chosen distance function and report the centroids of the clusters Transform income (<=50K, >50K) to binary

Q2.b Use the last 10 records from test data and use kNN algorithm (with varied k values, 3, 5, 10)  to report the prediction accuracy. K=3 K=5 K=10

Q2.b Use the last 10 records from test data and use kNN algorithm (with varied k values, 3, 5, 10)  to report the prediction accuracy.

Q3. Use the train datasets from step 2, build a SVM classifier and report the predicted accuracy of the test data.  Apply Weka

Q3. Use the train datasets from step 2, build a SVM classifier and report the predicted accuracy of the test data. 

Q4. Use the train datasets from step 2, build a neural network classifier and report the predicted accuracy of the test data. Apply Weka

Q4. Use the train datasets from step 2, build a neural network classifier and report the predicted accuracy of the test data.

Questions?