Restaurant Revenue Prediction using Machine Learning Algorithms

Slides:

Advertisements

Similar presentations

Random Forest Predrag Radenković 3237/10

Advertisements

Linear Regression.

Introduction to Data Mining with XLMiner

Three kinds of learning

Classification and Prediction: Regression Analysis

CHURN PREDICTION IN THE MOBILE TELECOMMUNICATIONS INDUSTRY An application of Survival Analysis in Data Mining L.J.S.M. Alberts,

Overview DM for Business Intelligence.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

by B. Zadrozny and C. Elkan

Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.

Machine Learning CSE 681 CH2 - Supervised Learning.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Presented by Teererai Marange. According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when.

Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.

Titanic: Machine Learning from Disaster

Kaggle Competition Prudential Life Insurance Assessment

An Exercise in Machine Learning

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

Kaggle Competition Rossmann Store Sales.

Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.

Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:

DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.

Kaggle competition Airbnb Recruiting: New User Bookings

Collage Score Card & Software defect prediction

A Generic Approach to Big Data Alarms Prioritization

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

A Smart Tool to Predict Salary Trends of H1-B Holders

Machine Learning with Spark MLlib

SNS COLLEGE OF TECHNOLOGY

Chapter 7. Classification and Prediction

Bagging and Random Forests

Admission Prediction System

Deep Feedforward Networks

Predicting Azure Consumption using Ensemble Learning

Boosting and Additive Trees (2)

Table 1. Advantages and Disadvantages of Traditional DM/ML Methods

CSE 4705 Artificial Intelligence

Supervised Time Series Pattern Discovery through Local Importance

Boosting and Additive Trees

A Simple Artificial Neuron

Source: Procedia Computer Science（2015）70:

Hybrid Features based Gender Classification

Machine Learning Basics

Predict House Sales Price

NBA Draft Prediction BIT 5534 May 2nd 2018

Kathi Kellenberger Redgate Software

Classification and Prediction

Machine Learning Week 1.

Combining Base Learners

Machine Learning with Weka

What is Regression Analysis?

Overview of Machine Learning

iSRD Spam Review Detection with Imbalanced Data Distributions

Implementing AdaBoost

Biointelligence Laboratory, Seoul National University

Course Introduction CSC 576: Data Mining.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Analysis for Predicting the Selling Price of Apartments Pratik Nikte

Chapter 7: Transformations

CS412 – Machine Learning Sentiment Analysis - Turkish Tweets

Predicting Loan Defaults

Predicting the Sale Price of Homes in Ames,Iowa

Earthquake Prediction

Credit Card Fraudulent Transaction Detection

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Restaurant Revenue Prediction using Machine Learning Algorithms Rajani Suryavanshi Toshit Patil Gaurav Wani Under the guidance of: - Prof. Meiliu Lu

Problem Statement New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred. TFI had organized a competition on Kaggle for prediction of restaurant’s revenue. There are 100,000 regional locations for which revenue needs to be predicted depending upon various data fields mentioned in database. The main purpose of this project is to predict revenue of the restaurants in the given test dataset from the already established restaurant data by using Machine Learning Algorithms.

Literature Review Read discussion forums which described the problem statement more in detail. Found out multiple approaches that could be used to solve this challenge. Helped us realize in advance that our dataset needs a lot of preprocessing. Discovered the over-fitting solution submitted with 0 RMSE value.

Dataset Dataset consists of following fields: - Id : Restaurant id. Open Date : opening date for a restaurant City : City that the restaurant is in. City Group: Type of the city. Big cities, or Other. Type: Type of the restaurant. FC: Food Court IL: Inline DT: Drive Thru MB: Mobile P1 - P37: These are the p-variables and are a measure of the following:- Demographic data: Population, age, gender distribution, development scale Real Estate Data: M2 of the location, front façade, car parking etc. Commercial data: schools, banks etc. REVENUE: The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis.

Insights of the Training Data There are 137 rows and 43 features in training data. Training data shows variation in revenue depending upon City, Open Date and Type. There are 38 unique cities and uni-codes are mentioned in the city names. Revenue column is left skewed and it’s representation after taking logarithm is shown below :-

Variation in P-Values P-values are indicating Real estate data, Commercial data and population related information. Following is chart showing obfuscated values of P- variables.

Representation of City Vs Revenue http://athena.ecs.csus.edu/~patiltr/geomap.html

Insights of Testing Data There are 100,000 rows and 42 features in test data. There are 54 unique cities and uni-codes are mentioned in the city names. There are more than 20 additional cities in test data and one additional type Restaurant Type – MB(Mobile).

Score Measure: RMSE The score measure presented in the competition is the root mean squared error of the test set revenue. Where yi is the predicted revenue of the ith restaurant and y is the actual revenue of the ith restaurant.

PRE-PROCESSING Since the training dataset is very small and we have to make the best out of what is available, we required a lot data preprocessing. This was the most difficult step of our Project.

Formatted Open-Date into three columns date, month and year Formatted Open-Date into three columns date, month and year. Using Boruta package, we found that year plays crucial role in predicting the revenue. Open Date (mm/dd/yyyy) 07/17/1999 03/09/2013 Month Day Year 07 17 1999 03 09 2013

To over-ride issue with an additional type in test data, we have replaced MB with FC and DT with IL.

Removed outliers in revenue column.

Converting all the selected train and test data into numeric. Converted categorical data into numeric. The predictor’s input must be numeric, hence converted all the selected data columns into numeric. Boruta package for feature selection Other Pre-Processing Approaches tried and in-efficient. Mice Imputation for removing Zeroes Reducing values of Obfuscated P-variables Principal Component Analysis of P-variables

SVM Regression Support Vector Machine can be applied not only to classification problems but also to the case of regression. Model in R using SVM: fit<- svm(x=as.matrix(train_cols),y=labels,cost=10,scale=TRUE,type="eps-regression") predict_svm<-as.data.frame(predict(fit,newdata=testdata)) RMSE Value: - Public Score : 1947658.58962 Private Score: 2259411.40638

Random Forest Algorithm Random Forest is one of the best useful technique used for multiple decision tree generation. Since our dataset contains lots of attributes to consider for the prediction, we built the model with it. Model in R using Random Forest: rf = cforest(train$revenue ~., data = train_cols, controls=cforest_unbiased(ntree=1000)) Prediction = predict(rf, testdata, OOB=TRUE, type = "response") RMSE Value :- Public Score : 1847620.47650 Private Score: 2159509.40783

Gradient Boosting Machine One of the most efficient algorithm. Gradient boosting = Gradient descent + Boosting This is much similar to that of Ada-Boosting technique, that we introduce more of weak learner to compensate shortcomings of existing weak learner Gradient boosting was introduced to handle a variety of loss function Basically, consecutive trees are introduced to solve net loss of prior trees Result of new trees are partially applied the entire solution RMSE Value: - Public Score : 1735157.84896 Private Score: 1809970.61757

Conclusion Random Forest works the best on our preprocessed data. Gradient Boosting Machine is working much better than SVM. Preprocessing of data is very important. It improved our RMSE scores drastically. Algorithm Public Score Private Score GBM 1892936.84870 1866654.72370 Random Forest 1765716.60593 1812835.346 SVM 1801175.32591 2104348.55376

References “Dataset: Restaurant Revenue Prediction”, https://www.kaggle.com/c/restaurant-revenue-prediction Blog: http://rohanrao91.blogspot.com/2015/05/tfi-restaurant-revenue- prediction.html Bikash Agrawal ,” Kaggle Walkthrough: Restaurant Sales Prediction”, Boost AI, University of Stavanger. Nataasha Raul, et al. “Restaurant Revenue Prediction using Machine Learning”, International Journal of Engineering Science, Vol.6, Issue 4 SauptikDhar, Vladimir Cherkassky, “Visualization and interpretation of SVM Classifiers”, Wiley interdisciplinary Reviews V2 Maestros, “Applied Datascience with R”, [Lecture Notes]