Creative Activity and Research Day (CARD) Forecasting Smart Meter Energy Usage using Distributed Systems and Machine Learning Chris Dong, Lingzhi Du, Feiran Ji, Amber Song, Vanessa Zheng (USF Master of Data Science) Feiran We are team sparkling smartwater, and we are going to talk about our project on . I am Feiran. (...)
Why? Data | MongoDB | Spark | ML Outcome | Conclusion We choose smart meter data because… Smart meters in every home in London Optimize energy usage High electricity prices Helps consumer understand their own energy consumption Feiran The government plans to install smart meters in every home in London Optimize energy usage High electricity prices Helps consumer understand their own energy consumption Data | MongoDB | Spark | ML Outcome | Conclusion
Outline Data | MongoDB | Spark | ML Outcome | Conclusion Data Pipeline MongoDB Spark ML Model & Outcome Conclusion Feiran Data | MongoDB | Spark | ML Outcome | Conclusion
Data Pipeline Data | MongoDB | Spark | ML Outcome | Conclusion 1 Data Pipeline Choose data, save to S3, import to EC2 Chris 1.28 Data | MongoDB | Spark | ML Outcome | Conclusion
Data | MongoDB | Spark | ML Outcome | Conclusion Chris So here is the data pipeline. The data comes from Kaggle, to our local machines, to S3, to EC2, to MongoDB, and finally to Spark Data | MongoDB | Spark | ML Outcome | Conclusion
Amazon S3 Wide Data | MongoDB | Spark | ML Outcome | Conclusion Chris On the left we see the half hourly data in long format. Ultimately we decided to use the wide data make it easier for preprocessing and feature engineering. Each block represents a neighborhood. Data | MongoDB | Spark | ML Outcome | Conclusion
Importing data from S3 to EC2 aws s3 cp s3://smart-meters/halfhourly/ . --recursive --acl public-read aws s3 cp s3://smart-meters/hhblock_dataset/ . --recursive --acl public- read aws s3 cp s3://smart-meters/other/ . --recursive --acl public-read Chris Here we see the data loaded into EC2, which goes from block 0 to block 111. Data | MongoDB | Spark | ML Outcome | Conclusion
MongoDB Data | MongoDB | Spark | ML Outcome | Conclusion 2 MongoDB Importing from S3; database size; MongoDB query Chris Data | MongoDB | Spark | ML Outcome | Conclusion
Importing data into MongoDB from S3 Add a new column indicating the filename for i in *.csv; do mongoimport -d smart -c energy --type csv --file $i -- headerline ; done for i in *.csv; do mongoimport -d wide -c energy --type csv --file $i -- headerline ; done for i in *.csv; do mongoimport -d other -c $i - -type csv --file $i -- headerline ; done Chris Here we import the data into separate MongoDB databases and run a simple query. I also add a new column indicating the filename Data | MongoDB | Spark | ML Outcome | Conclusion
Spark Data | MongoDB | Spark | ML Outcome | Conclusion 3 Spark Create RDD; Spark DataFrame; Instance specs 2.59 2.04 Data | MongoDB | Spark | ML Outcome | Conclusion
Creating RDD from MongoDB on EC2 Amber Data | MongoDB | Spark | ML Outcome | Conclusion
Querying Spark DataFrame data on EC2 Amber Data | MongoDB | Spark | ML Outcome | Conclusion
AWS Instance Specs Data | MongoDB | Spark | ML Outcome | Conclusion EC2: 1 Instance: t2.large 2 cores 8 GB RAM 300 GB Storage 0.09/hour Standalone:5 instances, 4 workers C3.2xlarge 8 cores 15 GB RAM 160 GB Storage 2.1/hour YARN:5 instances, 4 workers C3.2xlarge 16 cores 15 GB RAM 160 GB Storage 2.1/hour YARN: 1 Instance C3.8xlarge 32 cores 60 GB RAM 640 GB Storage 1.68/hour Amber 4.26 3.07 Data | MongoDB | Spark | ML Outcome | Conclusion
Machine Learning Data | MongoDB | Spark | ML Outcome | Conclusion 4 Machine Learning Data Overview; Analytic Goals; Feature Engineering; Model; Specs 2.37 Data | MongoDB | Spark | ML Outcome | Conclusion
Data Overview Data | MongoDB | Spark | ML Outcome | Conclusion
Analytical Goals Data | MongoDB | Spark | ML Outcome | Conclusion To predict bi-hourly energy usage for one day ahead (12 data points) Model approach: Spark ML RandomForestRegressor() Challenge: To solve a time series problem with ML Key: Feature engineering 6.26 4.54 4.04 Data | MongoDB | Spark | ML Outcome | Conclusion
Feature Engineering Data | MongoDB | Spark | ML Outcome | Conclusion * day1 = one day before label day * day2 = two days before label day * e.g., if label day is 2012-02-01, day1 = 2012-01-31, day2 = 2012-01-30 Lingzhi. Let me talk about how we do feature engineering and build our training dataset. Basically we pick one timestamp as our response and use the data before this timestamp to build our features. Features we built include daily average energy usage for some days before the timestamp. This is used to capture the trend of the time series. We also include weather data in these days because temperature is highly correlated to energy usage. Then we add hourly average energy usage. For example, we calculated the average energy usage at 12 am for day 1 to day 7 as a feature. There are 48 half-hours so there are 48 features here. This work as a seasonal component to capture the daily seasonality of the time series. We also include neighborhood info as a categorical feature and feed it into random forest with label encoding. For training response, as we are predicting for every 2 hours, we actually have 12 values at that date. So we are building 12 different random forest models and they share the same features but are trained separately. Data | MongoDB | Spark | ML Outcome | Conclusion
Training Validation Data | MongoDB | Spark | ML Outcome | Conclusion As you noticed, we are picking only one timestamp now, and this may lead our model to overfit to that day only. To make it generalize to the future, Data | MongoDB | Spark | ML Outcome | Conclusion
Model Performance Data | MongoDB | Spark | ML Outcome | Conclusion Model approach: Spark ML RandomForestRegressor() Evaluation Metric: RMSE Data | MongoDB | Spark | ML Outcome | Conclusion
Energy Usage Predictions Data | MongoDB | Spark | ML Outcome | Conclusion
Time per model Data | MongoDB | Spark | ML Outcome | Conclusion Attempt Data preprocessing Model training Number of Trees in RF 1 Spark SQL Spark ML 10 2 Pandas 3 300 Attempt single ec2 instance Standalone 5 instances Yarn 5 instances 3 instances 1 instances 1 forever 3602 s 3478 s 3186 s 2 34.8 s 36.7 s 35.3 s 43.6 s 3 error 500.0 s 546.0 s 937.4 s Amber 6.52 Data | MongoDB | Spark | ML Outcome | Conclusion
AWS Instance Specs Data | MongoDB | Spark | ML Outcome | Conclusion EC2: 1 Instance: Not enough memory YARN:5 instances, 4 workers 8.3 min/model $0.29/model YARN:3 instances, 2 workers 9.1 min/model $0.38/model YARN: 1 Instance 15.7 min/model $0.44/model Amber Data | MongoDB | Spark | ML Outcome | Conclusion
Conclusion Data | MongoDB | Spark | ML Outcome | Conclusion 5 Conclusion Data | MongoDB | Spark | ML Outcome | Conclusion
Research Result Data | MongoDB | Spark | ML Outcome | Conclusion There are significant computational advantages to using distributed systems when applying machine learning algorithms on large-scale data. Distributed systems can be computationally burdensome when the amount of data being processed is below a threshold. Amber Data | MongoDB | Spark | ML Outcome | Conclusion
Github link: https://github.com/LenzDu/Smart-Meter Thanks! Github link: https://github.com/LenzDu/Smart-Meter