Download presentation
Presentation is loading. Please wait.
1
Creative Activity and Research Day (CARD)
Forecasting Smart Meter Energy Usage using Distributed Systems and Machine Learning Chris Dong, Lingzhi Du, Feiran Ji, Amber Song, Vanessa Zheng (USF Master of Data Science) Feiran We are team sparkling smartwater, and we are going to talk about our project on . I am Feiran. (...)
2
Why? Data | MongoDB | Spark | ML Outcome | Conclusion
We choose smart meter data because… Smart meters in every home in London Optimize energy usage High electricity prices Helps consumer understand their own energy consumption Feiran The government plans to install smart meters in every home in London Optimize energy usage High electricity prices Helps consumer understand their own energy consumption Data | MongoDB | Spark | ML Outcome | Conclusion
3
Outline Data | MongoDB | Spark | ML Outcome | Conclusion
Data Pipeline MongoDB Spark ML Model & Outcome Conclusion Feiran Data | MongoDB | Spark | ML Outcome | Conclusion
4
Data Pipeline Data | MongoDB | Spark | ML Outcome | Conclusion
1 Data Pipeline Choose data, save to S3, import to EC2 Chris 1.28 Data | MongoDB | Spark | ML Outcome | Conclusion
5
Data | MongoDB | Spark | ML Outcome | Conclusion
Chris So here is the data pipeline. The data comes from Kaggle, to our local machines, to S3, to EC2, to MongoDB, and finally to Spark Data | MongoDB | Spark | ML Outcome | Conclusion
6
Amazon S3 Wide Data | MongoDB | Spark | ML Outcome | Conclusion
Chris On the left we see the half hourly data in long format. Ultimately we decided to use the wide data make it easier for preprocessing and feature engineering. Each block represents a neighborhood. Data | MongoDB | Spark | ML Outcome | Conclusion
7
Importing data from S3 to EC2
aws s3 cp s3://smart-meters/halfhourly/ . --recursive --acl public-read aws s3 cp s3://smart-meters/hhblock_dataset/ . --recursive --acl public- read aws s3 cp s3://smart-meters/other/ . --recursive --acl public-read Chris Here we see the data loaded into EC2, which goes from block 0 to block 111. Data | MongoDB | Spark | ML Outcome | Conclusion
8
MongoDB Data | MongoDB | Spark | ML Outcome | Conclusion
2 MongoDB Importing from S3; database size; MongoDB query Chris Data | MongoDB | Spark | ML Outcome | Conclusion
9
Importing data into MongoDB from S3
Add a new column indicating the filename for i in *.csv; do mongoimport -d smart -c energy --type csv --file $i -- headerline ; done for i in *.csv; do mongoimport -d wide -c energy --type csv --file $i -- headerline ; done for i in *.csv; do mongoimport -d other -c $i - -type csv --file $i -- headerline ; done Chris Here we import the data into separate MongoDB databases and run a simple query. I also add a new column indicating the filename Data | MongoDB | Spark | ML Outcome | Conclusion
10
Spark Data | MongoDB | Spark | ML Outcome | Conclusion
3 Spark Create RDD; Spark DataFrame; Instance specs 2.59 2.04 Data | MongoDB | Spark | ML Outcome | Conclusion
11
Creating RDD from MongoDB on EC2
Amber Data | MongoDB | Spark | ML Outcome | Conclusion
12
Querying Spark DataFrame data on EC2
Amber Data | MongoDB | Spark | ML Outcome | Conclusion
13
AWS Instance Specs Data | MongoDB | Spark | ML Outcome | Conclusion
EC2: 1 Instance: t2.large 2 cores 8 GB RAM 300 GB Storage 0.09/hour Standalone:5 instances, 4 workers C3.2xlarge 8 cores 15 GB RAM 160 GB Storage 2.1/hour YARN:5 instances, 4 workers C3.2xlarge 16 cores 15 GB RAM 160 GB Storage 2.1/hour YARN: 1 Instance C3.8xlarge 32 cores 60 GB RAM 640 GB Storage 1.68/hour Amber 4.26 3.07 Data | MongoDB | Spark | ML Outcome | Conclusion
14
Machine Learning Data | MongoDB | Spark | ML Outcome | Conclusion
4 Machine Learning Data Overview; Analytic Goals; Feature Engineering; Model; Specs 2.37 Data | MongoDB | Spark | ML Outcome | Conclusion
15
Data Overview Data | MongoDB | Spark | ML Outcome | Conclusion
16
Analytical Goals Data | MongoDB | Spark | ML Outcome | Conclusion
To predict bi-hourly energy usage for one day ahead (12 data points) Model approach: Spark ML RandomForestRegressor() Challenge: To solve a time series problem with ML Key: Feature engineering 6.26 4.54 4.04 Data | MongoDB | Spark | ML Outcome | Conclusion
17
Feature Engineering Data | MongoDB | Spark | ML Outcome | Conclusion
* day1 = one day before label day * day2 = two days before label day * e.g., if label day is , day1 = , day2 = Lingzhi. Let me talk about how we do feature engineering and build our training dataset. Basically we pick one timestamp as our response and use the data before this timestamp to build our features. Features we built include daily average energy usage for some days before the timestamp. This is used to capture the trend of the time series. We also include weather data in these days because temperature is highly correlated to energy usage. Then we add hourly average energy usage. For example, we calculated the average energy usage at 12 am for day 1 to day 7 as a feature. There are 48 half-hours so there are 48 features here. This work as a seasonal component to capture the daily seasonality of the time series. We also include neighborhood info as a categorical feature and feed it into random forest with label encoding. For training response, as we are predicting for every 2 hours, we actually have 12 values at that date. So we are building 12 different random forest models and they share the same features but are trained separately. Data | MongoDB | Spark | ML Outcome | Conclusion
18
Training Validation Data | MongoDB | Spark | ML Outcome | Conclusion
As you noticed, we are picking only one timestamp now, and this may lead our model to overfit to that day only. To make it generalize to the future, Data | MongoDB | Spark | ML Outcome | Conclusion
19
Model Performance Data | MongoDB | Spark | ML Outcome | Conclusion
Model approach: Spark ML RandomForestRegressor() Evaluation Metric: RMSE Data | MongoDB | Spark | ML Outcome | Conclusion
20
Energy Usage Predictions
Data | MongoDB | Spark | ML Outcome | Conclusion
21
Time per model Data | MongoDB | Spark | ML Outcome | Conclusion
Attempt Data preprocessing Model training Number of Trees in RF 1 Spark SQL Spark ML 10 2 Pandas 3 300 Attempt single ec2 instance Standalone 5 instances Yarn 5 instances 3 instances 1 instances 1 forever 3602 s 3478 s 3186 s 2 34.8 s 36.7 s 35.3 s 43.6 s 3 error 500.0 s 546.0 s 937.4 s Amber 6.52 Data | MongoDB | Spark | ML Outcome | Conclusion
22
AWS Instance Specs Data | MongoDB | Spark | ML Outcome | Conclusion
EC2: 1 Instance: Not enough memory YARN:5 instances, 4 workers 8.3 min/model $0.29/model YARN:3 instances, 2 workers 9.1 min/model $0.38/model YARN: 1 Instance 15.7 min/model $0.44/model Amber Data | MongoDB | Spark | ML Outcome | Conclusion
23
Conclusion Data | MongoDB | Spark | ML Outcome | Conclusion
5 Conclusion Data | MongoDB | Spark | ML Outcome | Conclusion
24
Research Result Data | MongoDB | Spark | ML Outcome | Conclusion
There are significant computational advantages to using distributed systems when applying machine learning algorithms on large-scale data. Distributed systems can be computationally burdensome when the amount of data being processed is below a threshold. Amber Data | MongoDB | Spark | ML Outcome | Conclusion
25
Github link: https://github.com/LenzDu/Smart-Meter
Thanks! Github link:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.