Restaurant Revenue Prediction using Machine Learning Algorithms

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Linear Regression.
Introduction to Data Mining with XLMiner
Three kinds of learning
Classification and Prediction: Regression Analysis
CHURN PREDICTION IN THE MOBILE TELECOMMUNICATIONS INDUSTRY An application of Survival Analysis in Data Mining L.J.S.M. Alberts,
Overview DM for Business Intelligence.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
by B. Zadrozny and C. Elkan
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Machine Learning CSE 681 CH2 - Supervised Learning.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Presented by Teererai Marange. According to Caliskan-Islam et al.(2015), authorship attribution using the Code Stylometry feature set is possible when.
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
Titanic: Machine Learning from Disaster
Kaggle Competition Prudential Life Insurance Assessment
An Exercise in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.
Kaggle Competition Rossmann Store Sales.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
Kaggle competition Airbnb Recruiting: New User Bookings
Collage Score Card & Software defect prediction
A Generic Approach to Big Data Alarms Prioritization
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
A Smart Tool to Predict Salary Trends of H1-B Holders
Machine Learning with Spark MLlib
SNS COLLEGE OF TECHNOLOGY
Chapter 7. Classification and Prediction
Bagging and Random Forests
Admission Prediction System
Deep Feedforward Networks
Predicting Azure Consumption using Ensemble Learning
Boosting and Additive Trees (2)
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
CSE 4705 Artificial Intelligence
Supervised Time Series Pattern Discovery through Local Importance
Boosting and Additive Trees
A Simple Artificial Neuron
Source: Procedia Computer Science(2015)70:
Hybrid Features based Gender Classification
Machine Learning Basics
Predict House Sales Price
NBA Draft Prediction BIT 5534 May 2nd 2018
Kathi Kellenberger Redgate Software
Classification and Prediction
Machine Learning Week 1.
Combining Base Learners
Machine Learning with Weka
What is Regression Analysis?
Overview of Machine Learning
iSRD Spam Review Detection with Imbalanced Data Distributions
Implementing AdaBoost
Biointelligence Laboratory, Seoul National University
Course Introduction CSC 576: Data Mining.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Chapter 7: Transformations
CS412 – Machine Learning Sentiment Analysis - Turkish Tweets
Predicting Loan Defaults
Predicting the Sale Price of Homes in Ames,Iowa
Earthquake Prediction
Credit Card Fraudulent Transaction Detection
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Restaurant Revenue Prediction using Machine Learning Algorithms Rajani Suryavanshi Toshit Patil Gaurav Wani Under the guidance of: - Prof. Meiliu Lu

Problem Statement New restaurant sites take large investments of time and capital to get up and running. When the wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.  TFI had organized a competition on Kaggle for prediction of restaurant’s revenue. There are 100,000 regional locations for which revenue needs to be predicted depending upon various data fields mentioned in database. The main purpose of this project is to predict revenue of the restaurants in the given test dataset from the already established restaurant data by using Machine Learning Algorithms.

Literature Review Read discussion forums which described the problem statement more in detail. Found out multiple approaches that could be used to solve this challenge. Helped us realize in advance that our dataset needs a lot of pre- processing. Discovered the over-fitting solution submitted with 0 RMSE value.

Dataset Dataset consists of following fields: - Id : Restaurant id. Open Date : opening date for a restaurant City : City that the restaurant is in.   City Group: Type of the city. Big cities, or Other.  Type: Type of the restaurant.  FC: Food Court IL: Inline DT: Drive Thru MB: Mobile P1 - P37: These are the p-variables and are a measure of the following:- Demographic data: Population, age, gender distribution, development scale Real Estate Data: M2 of the location, front façade, car parking etc. Commercial data: schools, banks etc. REVENUE: The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. 

Insights of the Training Data There are 137 rows and 43 features in training data. Training data shows variation in revenue depending upon City, Open Date and Type. There are 38 unique cities and uni-codes are mentioned in the city names. Revenue column is left skewed and it’s representation after taking logarithm is shown below :-

Variation in P-Values P-values are indicating Real estate data, Commercial data and population related information. Following is chart showing obfuscated values of P- variables.

Representation of City Vs Revenue http://athena.ecs.csus.edu/~patiltr/geomap.html

Insights of Testing Data There are 100,000 rows and 42 features in test data. There are 54 unique cities and uni-codes are mentioned in the city names. There are more than 20 additional cities in test data and one additional type Restaurant Type – MB(Mobile).

Score Measure: RMSE The score measure presented in the competition is the root mean squared error of the test set revenue. Where yi is the predicted revenue of the ith restaurant and y is the actual revenue of the ith restaurant.

PRE-PROCESSING Since the training dataset is very small and we have to make the best out of what is available, we required a lot data preprocessing. This was the most difficult step of our Project.

Formatted Open-Date into three columns date, month and year Formatted Open-Date into three columns date, month and year. Using Boruta package, we found that year plays crucial role in predicting the revenue. Open Date (mm/dd/yyyy) 07/17/1999 03/09/2013 Month Day Year 07 17 1999 03 09 2013

To over-ride issue with an additional type in test data, we have replaced MB with FC and DT with IL.

Removed outliers in revenue column.

Converting all the selected train and test data into numeric. Converted categorical data into numeric. The predictor’s input must be numeric, hence converted all the selected data columns into numeric. Boruta package for feature selection Other Pre-Processing Approaches tried and in-efficient. Mice Imputation for removing Zeroes Reducing values of Obfuscated P-variables Principal Component Analysis of P-variables

SVM Regression Support Vector Machine can be applied not only to classification problems but also to the case of regression. Model in R using SVM: fit<- svm(x=as.matrix(train_cols),y=labels,cost=10,scale=TRUE,type="eps-regression") predict_svm<-as.data.frame(predict(fit,newdata=testdata)) RMSE Value: - Public Score : 1947658.58962 Private Score: 2259411.40638

Random Forest Algorithm Random Forest is one of the best useful technique used for multiple decision tree generation. Since our dataset contains lots of attributes to consider for the prediction, we built the model with it. Model in R using Random Forest: rf = cforest(train$revenue ~., data = train_cols, controls=cforest_unbiased(ntree=1000)) Prediction = predict(rf, testdata, OOB=TRUE, type = "response") RMSE Value :- Public Score : 1847620.47650 Private Score: 2159509.40783

Gradient Boosting Machine One of the most efficient algorithm. Gradient boosting = Gradient descent + Boosting This is much similar to that of Ada-Boosting technique, that we introduce more of weak learner to compensate shortcomings of existing weak learner Gradient boosting was introduced to handle a variety of loss function Basically, consecutive trees are introduced to solve net loss of prior trees Result of new trees are partially applied the entire solution RMSE Value: - Public Score : 1735157.84896 Private Score: 1809970.61757

Conclusion Random Forest works the best on our preprocessed data. Gradient Boosting Machine is working much better than SVM. Preprocessing of data is very important. It improved our RMSE scores drastically. Algorithm Public Score Private Score GBM 1892936.84870 1866654.72370 Random Forest 1765716.60593 1812835.346 SVM 1801175.32591 2104348.55376

References “Dataset: Restaurant Revenue Prediction”, https://www.kaggle.com/c/restaurant-revenue-prediction Blog: http://rohanrao91.blogspot.com/2015/05/tfi-restaurant-revenue- prediction.html Bikash Agrawal ,” Kaggle Walkthrough: Restaurant Sales Prediction”, Boost AI, University of Stavanger. Nataasha Raul, et al. “Restaurant Revenue Prediction using Machine Learning”, International Journal of Engineering Science, Vol.6, Issue 4 SauptikDhar, Vladimir Cherkassky, “Visualization and interpretation of SVM Classifiers”, Wiley interdisciplinary Reviews V2 Maestros, “Applied Datascience with R”, [Lecture Notes]