Creative Activity and Research Day (CARD)

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Spark: Cluster Computing with Working Sets
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
NTC 2014 : Mobile, Cloud Track Vijay Gabale. Suggestions This presentation provides links to data sets as well as tools and resources for working on mobile.
Anomaly detection Problem motivation Machine Learning.
Custom Reporting in Blackboard Learn. What happens between clicking run and getting the report? Connect to a data source Where is the information?
CT108 – Energy Audit Assignment. Top Down And Bottom Up Approach The Top Down Approach The “top down” approach assesses the total energy inputs which.
Weka: Experimenter and Knowledge Flow interfaces Neil Mac Parthaláin
Cloud Computing is a Nebulous Subject Or how I learned to love VDF on Amazon.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Ensemble Learning, Boosting, and Bagging: Scaling up Decision Trees (with thanks to William Cohen of CMU, Michael Malohlava of 0xdata, and Manish Amde.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
PREDICTING SONG HOTNESS
Development Overview Authors: Eric Graubins Fermi National Accelerator Laboratory Batavia, Illinois.
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Big Data Analytics and HPC Platforms
A Smart Tool to Predict Salary Trends of H1-B Holders
Restaurant Revenue Prediction using Machine Learning Algorithms
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
Let’s Build a Tabular Model in Azure
Resource Management IB Computer Science.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Introduction to Spark Streaming for Real Time data analysis
Predicting Azure Consumption using Ensemble Learning
The Life of a MongoDB GitHub Commit
International Conference on Data Engineering (ICDE 2016)
SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
Distributed Network Traffic Feature Extraction for a Real-time IDS
Assurance Scoring: Using Machine Learning and Analytics to Reduce Risk in the Public Sector Matt Thomson 17/11/2016.
Parts of a Lab Write-up.
Big-Data Fundamentals
System Control based Renewable Energy Resources in Smart Grid Consumer
Introduction to Data Science Lecture 7 Machine Learning Overview
Classification with Perceptrons Reading:
Powering real-time analytics on Xfinity using Kudu
September 11, Ian R Brooks Ph.D.
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
CMPT 733, SPRING 2016 Jiannan Wang
Machine Learning practical
SETL: Efficient Spark ETL on Hadoop
Preparing your Data using Python
CS110: Discussion about Spark
Why Local Energy Systems are Essential for a Low Carbon Future
Preparing your Data using Python
Tutorial for LightSIDE
Show suggestions and borderlines Hierarchical Clustering
Cloud computing mechanisms
STAT 689 Class Project STAT 689 Class Project
Interpret the execution mode of SQL query in F1 Query paper
Climate Group 2 Jiajun LI, Serena DONG, Charis DENG.
Declarative Transfer Learning from Deep CNNs at Scale
CSCI N317 Computation for Scientific Applications Unit Weka
Intro to Machine Learning
CherryPick: Adaptively Unearthing the Best
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Introduction to Dataflows in Power BI
Let’s Build a Tabular Model in Azure
Machine Learning with Databricks
Distributed Edge Computing
CMPT 733, SPRING 2017 Jiannan Wang
Cameron Bell Luke Doolittle Amod Ghangurde
Jia-Bin Huang Virginia Tech
Fast, Interactive, Language-Integrated Cluster Computing
Siemens Education Topic 7: eZero Island.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Credit Card Fraudulent Transaction Detection
Machine Learning for Cyber
Presentation transcript:

Creative Activity and Research Day (CARD) Forecasting Smart Meter Energy Usage using Distributed Systems and Machine Learning Chris Dong, Lingzhi Du, Feiran Ji, Amber Song, Vanessa Zheng (USF Master of Data Science) Feiran We are team sparkling smartwater, and we are going to talk about our project on . I am Feiran. (...)

Why? Data | MongoDB | Spark | ML Outcome | Conclusion We choose smart meter data because… Smart meters in every home in London Optimize energy usage High electricity prices Helps consumer understand their own energy consumption Feiran The government plans to install smart meters in every home in London Optimize energy usage High electricity prices Helps consumer understand their own energy consumption Data | MongoDB | Spark | ML Outcome | Conclusion

Outline Data | MongoDB | Spark | ML Outcome | Conclusion Data Pipeline MongoDB Spark ML Model & Outcome Conclusion Feiran Data | MongoDB | Spark | ML Outcome | Conclusion

Data Pipeline Data | MongoDB | Spark | ML Outcome | Conclusion 1 Data Pipeline Choose data, save to S3, import to EC2 Chris 1.28 Data | MongoDB | Spark | ML Outcome | Conclusion

Data | MongoDB | Spark | ML Outcome | Conclusion Chris So here is the data pipeline. The data comes from Kaggle, to our local machines, to S3, to EC2, to MongoDB, and finally to Spark Data | MongoDB | Spark | ML Outcome | Conclusion

Amazon S3 Wide Data | MongoDB | Spark | ML Outcome | Conclusion Chris On the left we see the half hourly data in long format. Ultimately we decided to use the wide data make it easier for preprocessing and feature engineering. Each block represents a neighborhood. Data | MongoDB | Spark | ML Outcome | Conclusion

Importing data from S3 to EC2 aws s3 cp s3://smart-meters/halfhourly/ . --recursive --acl public-read aws s3 cp s3://smart-meters/hhblock_dataset/ . --recursive --acl public- read aws s3 cp s3://smart-meters/other/ . --recursive --acl public-read Chris Here we see the data loaded into EC2, which goes from block 0 to block 111. Data | MongoDB | Spark | ML Outcome | Conclusion

MongoDB Data | MongoDB | Spark | ML Outcome | Conclusion 2 MongoDB Importing from S3; database size; MongoDB query Chris Data | MongoDB | Spark | ML Outcome | Conclusion

Importing data into MongoDB from S3 Add a new column indicating the filename for i in *.csv; do mongoimport -d smart -c energy --type csv --file $i -- headerline ; done for i in *.csv; do mongoimport -d wide -c energy --type csv --file $i -- headerline ; done for i in *.csv; do mongoimport -d other -c $i - -type csv --file $i -- headerline ; done Chris Here we import the data into separate MongoDB databases and run a simple query. I also add a new column indicating the filename Data | MongoDB | Spark | ML Outcome | Conclusion

Spark Data | MongoDB | Spark | ML Outcome | Conclusion 3 Spark Create RDD; Spark DataFrame; Instance specs 2.59 2.04 Data | MongoDB | Spark | ML Outcome | Conclusion

Creating RDD from MongoDB on EC2 Amber Data | MongoDB | Spark | ML Outcome | Conclusion

Querying Spark DataFrame data on EC2 Amber Data | MongoDB | Spark | ML Outcome | Conclusion

AWS Instance Specs Data | MongoDB | Spark | ML Outcome | Conclusion EC2: 1 Instance: t2.large 2 cores 8 GB RAM 300 GB Storage 0.09/hour Standalone:5 instances, 4 workers C3.2xlarge 8 cores 15 GB RAM 160 GB Storage 2.1/hour YARN:5 instances, 4 workers C3.2xlarge 16 cores 15 GB RAM 160 GB Storage 2.1/hour YARN: 1 Instance C3.8xlarge 32 cores 60 GB RAM 640 GB Storage 1.68/hour Amber 4.26 3.07 Data | MongoDB | Spark | ML Outcome | Conclusion

Machine Learning Data | MongoDB | Spark | ML Outcome | Conclusion 4 Machine Learning Data Overview; Analytic Goals; Feature Engineering; Model; Specs 2.37 Data | MongoDB | Spark | ML Outcome | Conclusion

Data Overview Data | MongoDB | Spark | ML Outcome | Conclusion

Analytical Goals Data | MongoDB | Spark | ML Outcome | Conclusion To predict bi-hourly energy usage for one day ahead (12 data points) Model approach: Spark ML RandomForestRegressor() Challenge: To solve a time series problem with ML Key: Feature engineering 6.26 4.54 4.04 Data | MongoDB | Spark | ML Outcome | Conclusion

Feature Engineering Data | MongoDB | Spark | ML Outcome | Conclusion * day1 = one day before label day * day2 = two days before label day * e.g., if label day is 2012-02-01, day1 = 2012-01-31, day2 = 2012-01-30 Lingzhi. Let me talk about how we do feature engineering and build our training dataset. Basically we pick one timestamp as our response and use the data before this timestamp to build our features. Features we built include daily average energy usage for some days before the timestamp. This is used to capture the trend of the time series. We also include weather data in these days because temperature is highly correlated to energy usage. Then we add hourly average energy usage. For example, we calculated the average energy usage at 12 am for day 1 to day 7 as a feature. There are 48 half-hours so there are 48 features here. This work as a seasonal component to capture the daily seasonality of the time series. We also include neighborhood info as a categorical feature and feed it into random forest with label encoding. For training response, as we are predicting for every 2 hours, we actually have 12 values at that date. So we are building 12 different random forest models and they share the same features but are trained separately. Data | MongoDB | Spark | ML Outcome | Conclusion

Training Validation Data | MongoDB | Spark | ML Outcome | Conclusion As you noticed, we are picking only one timestamp now, and this may lead our model to overfit to that day only. To make it generalize to the future, Data | MongoDB | Spark | ML Outcome | Conclusion

Model Performance Data | MongoDB | Spark | ML Outcome | Conclusion Model approach: Spark ML RandomForestRegressor() Evaluation Metric: RMSE Data | MongoDB | Spark | ML Outcome | Conclusion

Energy Usage Predictions Data | MongoDB | Spark | ML Outcome | Conclusion

Time per model Data | MongoDB | Spark | ML Outcome | Conclusion Attempt Data preprocessing Model training Number of Trees in RF 1 Spark SQL Spark ML 10 2 Pandas 3 300 Attempt single ec2 instance Standalone 5 instances Yarn 5 instances 3 instances 1 instances 1 forever 3602 s 3478 s 3186 s 2 34.8 s 36.7 s 35.3 s 43.6 s 3 error 500.0 s 546.0 s 937.4 s Amber 6.52 Data | MongoDB | Spark | ML Outcome | Conclusion

AWS Instance Specs Data | MongoDB | Spark | ML Outcome | Conclusion EC2: 1 Instance: Not enough memory YARN:5 instances, 4 workers 8.3 min/model $0.29/model YARN:3 instances, 2 workers 9.1 min/model $0.38/model YARN: 1 Instance 15.7 min/model $0.44/model Amber Data | MongoDB | Spark | ML Outcome | Conclusion

Conclusion Data | MongoDB | Spark | ML Outcome | Conclusion 5 Conclusion Data | MongoDB | Spark | ML Outcome | Conclusion

Research Result Data | MongoDB | Spark | ML Outcome | Conclusion There are significant computational advantages to using distributed systems when applying machine learning algorithms on large-scale data. Distributed systems can be computationally burdensome when the amount of data being processed is below a threshold. Amber Data | MongoDB | Spark | ML Outcome | Conclusion

Github link: https://github.com/LenzDu/Smart-Meter Thanks! Github link: https://github.com/LenzDu/Smart-Meter