Presentation is loading. Please wait.

Presentation is loading. Please wait.

Creative Activity and Research Day (CARD)

Similar presentations


Presentation on theme: "Creative Activity and Research Day (CARD)"— Presentation transcript:

1 Creative Activity and Research Day (CARD)
Forecasting Smart Meter Energy Usage using Distributed Systems and Machine Learning Chris Dong, Lingzhi Du, Feiran Ji, Amber Song, Vanessa Zheng (USF Master of Data Science) Feiran We are team sparkling smartwater, and we are going to talk about our project on . I am Feiran. (...)

2 Why? Data | MongoDB | Spark | ML Outcome | Conclusion
We choose smart meter data because… Smart meters in every home in London Optimize energy usage High electricity prices Helps consumer understand their own energy consumption Feiran The government plans to install smart meters in every home in London Optimize energy usage High electricity prices Helps consumer understand their own energy consumption Data | MongoDB | Spark | ML Outcome | Conclusion

3 Outline Data | MongoDB | Spark | ML Outcome | Conclusion
Data Pipeline MongoDB Spark ML Model & Outcome Conclusion Feiran Data | MongoDB | Spark | ML Outcome | Conclusion

4 Data Pipeline Data | MongoDB | Spark | ML Outcome | Conclusion
1 Data Pipeline Choose data, save to S3, import to EC2 Chris 1.28 Data | MongoDB | Spark | ML Outcome | Conclusion

5 Data | MongoDB | Spark | ML Outcome | Conclusion
Chris So here is the data pipeline. The data comes from Kaggle, to our local machines, to S3, to EC2, to MongoDB, and finally to Spark Data | MongoDB | Spark | ML Outcome | Conclusion

6 Amazon S3 Wide Data | MongoDB | Spark | ML Outcome | Conclusion
Chris On the left we see the half hourly data in long format. Ultimately we decided to use the wide data make it easier for preprocessing and feature engineering. Each block represents a neighborhood. Data | MongoDB | Spark | ML Outcome | Conclusion

7 Importing data from S3 to EC2
aws s3 cp s3://smart-meters/halfhourly/ . --recursive --acl public-read aws s3 cp s3://smart-meters/hhblock_dataset/ . --recursive --acl public- read aws s3 cp s3://smart-meters/other/ . --recursive --acl public-read Chris Here we see the data loaded into EC2, which goes from block 0 to block 111. Data | MongoDB | Spark | ML Outcome | Conclusion

8 MongoDB Data | MongoDB | Spark | ML Outcome | Conclusion
2 MongoDB Importing from S3; database size; MongoDB query Chris Data | MongoDB | Spark | ML Outcome | Conclusion

9 Importing data into MongoDB from S3
Add a new column indicating the filename for i in *.csv; do mongoimport -d smart -c energy --type csv --file $i -- headerline ; done for i in *.csv; do mongoimport -d wide -c energy --type csv --file $i -- headerline ; done for i in *.csv; do mongoimport -d other -c $i - -type csv --file $i -- headerline ; done Chris Here we import the data into separate MongoDB databases and run a simple query. I also add a new column indicating the filename Data | MongoDB | Spark | ML Outcome | Conclusion

10 Spark Data | MongoDB | Spark | ML Outcome | Conclusion
3 Spark Create RDD; Spark DataFrame; Instance specs 2.59 2.04 Data | MongoDB | Spark | ML Outcome | Conclusion

11 Creating RDD from MongoDB on EC2
Amber Data | MongoDB | Spark | ML Outcome | Conclusion

12 Querying Spark DataFrame data on EC2
Amber Data | MongoDB | Spark | ML Outcome | Conclusion

13 AWS Instance Specs Data | MongoDB | Spark | ML Outcome | Conclusion
EC2: 1 Instance: t2.large 2 cores 8 GB RAM 300 GB Storage 0.09/hour Standalone:5 instances, 4 workers C3.2xlarge 8 cores 15 GB RAM 160 GB Storage 2.1/hour YARN:5 instances, 4 workers C3.2xlarge 16 cores 15 GB RAM 160 GB Storage 2.1/hour YARN: 1 Instance C3.8xlarge 32 cores 60 GB RAM 640 GB Storage 1.68/hour Amber 4.26 3.07 Data | MongoDB | Spark | ML Outcome | Conclusion

14 Machine Learning Data | MongoDB | Spark | ML Outcome | Conclusion
4 Machine Learning Data Overview; Analytic Goals; Feature Engineering; Model; Specs 2.37 Data | MongoDB | Spark | ML Outcome | Conclusion

15 Data Overview Data | MongoDB | Spark | ML Outcome | Conclusion

16 Analytical Goals Data | MongoDB | Spark | ML Outcome | Conclusion
To predict bi-hourly energy usage for one day ahead (12 data points) Model approach: Spark ML RandomForestRegressor() Challenge: To solve a time series problem with ML Key: Feature engineering 6.26 4.54 4.04 Data | MongoDB | Spark | ML Outcome | Conclusion

17 Feature Engineering Data | MongoDB | Spark | ML Outcome | Conclusion
* day1 = one day before label day * day2 = two days before label day * e.g., if label day is , day1 = , day2 = Lingzhi. Let me talk about how we do feature engineering and build our training dataset. Basically we pick one timestamp as our response and use the data before this timestamp to build our features. Features we built include daily average energy usage for some days before the timestamp. This is used to capture the trend of the time series. We also include weather data in these days because temperature is highly correlated to energy usage. Then we add hourly average energy usage. For example, we calculated the average energy usage at 12 am for day 1 to day 7 as a feature. There are 48 half-hours so there are 48 features here. This work as a seasonal component to capture the daily seasonality of the time series. We also include neighborhood info as a categorical feature and feed it into random forest with label encoding. For training response, as we are predicting for every 2 hours, we actually have 12 values at that date. So we are building 12 different random forest models and they share the same features but are trained separately. Data | MongoDB | Spark | ML Outcome | Conclusion

18 Training Validation Data | MongoDB | Spark | ML Outcome | Conclusion
As you noticed, we are picking only one timestamp now, and this may lead our model to overfit to that day only. To make it generalize to the future, Data | MongoDB | Spark | ML Outcome | Conclusion

19 Model Performance Data | MongoDB | Spark | ML Outcome | Conclusion
Model approach: Spark ML RandomForestRegressor() Evaluation Metric: RMSE Data | MongoDB | Spark | ML Outcome | Conclusion

20 Energy Usage Predictions
Data | MongoDB | Spark | ML Outcome | Conclusion

21 Time per model Data | MongoDB | Spark | ML Outcome | Conclusion
Attempt Data preprocessing Model training Number of Trees in RF 1 Spark SQL Spark ML 10 2 Pandas 3 300 Attempt single ec2 instance Standalone 5 instances Yarn 5 instances 3 instances 1 instances 1 forever 3602 s 3478 s 3186 s 2 34.8 s 36.7 s 35.3 s 43.6 s 3 error 500.0 s 546.0 s 937.4 s Amber 6.52 Data | MongoDB | Spark | ML Outcome | Conclusion

22 AWS Instance Specs Data | MongoDB | Spark | ML Outcome | Conclusion
EC2: 1 Instance: Not enough memory YARN:5 instances, 4 workers 8.3 min/model $0.29/model YARN:3 instances, 2 workers 9.1 min/model $0.38/model YARN: 1 Instance 15.7 min/model $0.44/model Amber Data | MongoDB | Spark | ML Outcome | Conclusion

23 Conclusion Data | MongoDB | Spark | ML Outcome | Conclusion
5 Conclusion Data | MongoDB | Spark | ML Outcome | Conclusion

24 Research Result Data | MongoDB | Spark | ML Outcome | Conclusion
There are significant computational advantages to using distributed systems when applying machine learning algorithms on large-scale data. Distributed systems can be computationally burdensome when the amount of data being processed is below a threshold. Amber Data | MongoDB | Spark | ML Outcome | Conclusion

25 Github link: https://github.com/LenzDu/Smart-Meter
Thanks! Github link:


Download ppt "Creative Activity and Research Day (CARD)"

Similar presentations


Ads by Google