D ON ’ T G ET K ICKED – M ACHINE L EARNING P REDICTIONS FOR C AR B UYING Albert Ho, Robert Romano, Xin Alice Wu – Department of Mechanical Engineering,

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

An Introduction to Boosting Yoav Freund Banter Inc.
Predicting Risk of Re-hospitalization for Congestive Heart Failure Patients (in collaboration with ) Jayshree Agarwal Senjuti Basu Roy, Ankur Teredesai,
Classification / Regression Support Vector Machines
Imbalanced data David Kauchak CS 451 – Fall 2013.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
An Analysis of Machine Learning Algorithms for Condensing Reverse Engineered Class Diagrams Hafeez Osman, Michel R.V. Chaudron and Peter van der Putten.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Feature Selection Presented by: Nafise Hatamikhah
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Maximizing Classifier Utility when Training Data is Costly Gary M. Weiss Ye Tian Fordham University.
Modeling the Cost of Misunderstandings in the CMU Communicator System Dan BohusAlex Rudnicky School of Computer Science, Carnegie Mellon University, Pittsburgh,
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
An Extended Introduction to WEKA. Data Mining Process.
Introduction to WEKA Aaron 2/13/2009. Contents Introduction to weka Download and install weka Basic use of weka Weka API Survey.
Chapter 5 Data mining : A Closer Look.
Who would be a good loanee? Zheyun Feng 7/17/2015.
Yoonjung Choi.  The Knowledge Discovery in Databases (KDD) is concerned with the development of methods and techniques for making sense of data.  One.
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
Evaluating Classifiers
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Appendix: The WEKA Data Mining Software
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Benk Erika Kelemen Zsolt
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Mebi 591D – BHI Kaggle Class Baselines kaggleclass.weebly.com/
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Learning with AdaBoost
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
An Exercise in Machine Learning
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Supervise Learning. 2 What is learning? “Learning denotes changes in a system that... enable a system to do the same task more efficiently the next time.”
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Boosting and Additive Trees
ECE 5424: Introduction to Machine Learning
Asymmetric Gradient Boosting with Application to Spam Filtering
Features & Decision regions
Machine Learning Week 1.
Machine Learning with Weka
Introduction to Boosting
iSRD Spam Review Detection with Imbalanced Data Distributions
Computer Vision Chapter 4
CSCI N317 Computation for Scientific Applications Unit Weka
Intro to Machine Learning
Chapter 7: Transformations
Exploring Complexity Metrics as Indicators of Software Vulnerability
ROC Curves and Operating Points
Presentation transcript:

D ON ’ T G ET K ICKED – M ACHINE L EARNING P REDICTIONS FOR C AR B UYING Albert Ho, Robert Romano, Xin Alice Wu – Department of Mechanical Engineering, Stanford University – CS229: Machine Learning When you go to an auto dealership with the intent to buy a used car, you want a good selection to choose from. Auto dealerships purchase their used cars through auto auctions and they want the same things: to buy as many cars as they can in the best condition possible. Our task was to use machine learning to help auto dealerships avoid bad car purchases, called “kicked cars”, at auto auctions. Introduction Data Characteristics All of our data was obtained from the Kaggle.com challenge “Don’t Get Kicked” hosted by CARVANA. It could be described as follows: Preprocessing The steps we took to preprocess our data changed throughout the project as follows: Visualization Data Characteristics All of our data was obtained from the Kaggle.com challenge “Don’t Get Kicked” hosted by CARVANA. It could be described as follows: Preprocessing The steps we took to preprocess our data changed throughout the project as follows: Visualization Data Preprocessing/Visualization MATLAB Our initial attempts to analyze the data occurred primarily in MATLAB. Because the data was categorized into two labels, good or bad car purchases, we used logistic regression and libLINEAR 1 v Initial attempts at classification went poorly due to heavy overlap between our good and bad training sets. We decided to follow a different approach based on the concept of boosting, which combines various weak classifiers to create a strong classifier 3. Weka To use boosting algorithms, we used the software package called Weka 2 v Using Weka, we could apply libLINEAR and naïve bayes along with a slew of boosting algorithms such as adaBoostM1, logitBoost, and ensemble selection. MATLAB Our initial attempts to analyze the data occurred primarily in MATLAB. Because the data was categorized into two labels, good or bad car purchases, we used logistic regression and libLINEAR 1 v Initial attempts at classification went poorly due to heavy overlap between our good and bad training sets. We decided to follow a different approach based on the concept of boosting, which combines various weak classifiers to create a strong classifier 3. Weka To use boosting algorithms, we used the software package called Weka 2 v Using Weka, we could apply libLINEAR and naïve bayes along with a slew of boosting algorithms such as adaBoostM1, logitBoost, and ensemble selection. Algorithm Selection Performance Evaluation Performance Metric Initially, we evaluated the success of our algorithms based on correctly classified instances(%), but soon realized that even the null hypothesis could achieve 87.7%. We then switched our metrics to AUC, a generally accepted metric for classification performance, and F1, which accounts for the tradeoff between precision/recall. FN and FP may be more important metrics in application because has a direct impact on profit and loss for a car dealership, as illustrated below: Final Result Based on metrics of AUC and F1, LogitBoost did the best for both balanced and unbalanced data sets. Performance Metric Initially, we evaluated the success of our algorithms based on correctly classified instances(%), but soon realized that even the null hypothesis could achieve 87.7%. We then switched our metrics to AUC, a generally accepted metric for classification performance, and F1, which accounts for the tradeoff between precision/recall. FN and FP may be more important metrics in application because has a direct impact on profit and loss for a car dealership, as illustrated below: Final Result Based on metrics of AUC and F1, LogitBoost did the best for both balanced and unbalanced data sets. Discussion 1) Evaluate models on separated data 2) Run RUSBoost, which improves classification performance when training data is skewed 3) Purchase server farms on which to run Weka 1) Evaluate models on separated data 2) Run RUSBoost, which improves classification performance when training data is skewed 3) Purchase server farms on which to run Weka Future Work [1] R.-E. Fan, et al. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), Software available at [2] Mark Hall, et al. (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1. [3] Friedman, Jerome, et al."Additive logistic regression: a statistical view of boosting”. The annals of statistics 28.2 (2000): [1] R.-E. Fan, et al. LIBLINEAR: A Library for Large Linear Classification, Journal of Machine Learning Research 9(2008), Software available at [2] Mark Hall, et al. (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1. [3] Friedman, Jerome, et al."Additive logistic regression: a statistical view of boosting”. The annals of statistics 28.2 (2000): References Algorithms Performance on Unbalanced Training SetPerformance on Balanced Training Set Correctly Classified Instances (%)AUCF1 Score Correctly Classified Instances (%)AUCF1 Score naÏve Bayes libLinear logistic logitBoost a logitBoost b logitBoost c adaBoostM1 a ensemble e ensemble d,e a. Decision Stump, b. Decision Stump 100 Iterations, c. Decision Table, d. J48 Decision Tree, e. Maximize for ROC 1)Contained 32 features and samples 2)Contained binary, nominal, and numeric data 3)Good cars were heavily overrepresented, constituting 87.7% of our entire data set 4)Data was highly inseparable/overlapping 1)Contained 32 features and samples 2)Contained binary, nominal, and numeric data 3)Good cars were heavily overrepresented, constituting 87.7% of our entire data set 4)Data was highly inseparable/overlapping 1)Converting nominal data to numeric and filling in missing data fields 2)Normalizing numeric data from 0 to 1 3)Balancing the data 1)Converting nominal data to numeric and filling in missing data fields 2)Normalizing numeric data from 0 to 1 3)Balancing the data We would like to thank Professor Andrew Ng and the TAs all their help on this project, Kaggle and CARVANA for providing the data set. Acknowledgement Total Profit = TN*Gross Profit + FN*Loss Opportunity Cost = FP*Gross Profit Total Profit = TN*Gross Profit + FN*Loss Opportunity Cost = FP*Gross Profit