Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cloud Big Data Decision Support System for Machine Learning on AWS

Similar presentations


Presentation on theme: "Cloud Big Data Decision Support System for Machine Learning on AWS"— Presentation transcript:

1 Cloud Big Data Decision Support System for Machine Learning on AWS
Alex Kaplunovich, Yelena Yesha University of Maryland Baltimore County December 2017

2 Rational Where to run BigData machine learning jobs
How to optimize expenses on Computing Memory How to optimize running time Collect data on the fly Analyze and recommend using Analytics of Analytics AWS instance Predict Time to execute

3 Machine Learning Methods
Name Library Class name K-Means Clustering sklearn.cluster KMeans Mini Batch K-Means Clustering MiniBatchKMeans Birch Clustering Birch Hierarchical Clustering AgglomerativeClustering DBSCAN Clustering DBSCAN Naive Bayes Classification sklearn.naive_bayes GaussianNB Decision Tree Classification sklearn.tree DecisionTreeClassifier Random Forest Classification sklearn.ensemble RandomForestClassifier Polynomial Regression sklearn.preprocessing PolynomialFeatures Support Vector Regression (SVR) sklearn.svm SVR Decision Tree Regression DecisionTreeRegressor Random Forest Regression RandomForestRegressor

4 Datasets (selected) Speed_Camera_Vi olations_6.csv on 14.8MB 112606
Name Size Rows Speed_Camera_Vi olations_6.csv on 14.8MB 112606 olations_14.json 39.9MB Crimes_- _2001_to_present _19.csv 1.4GB Taxi_Trips_17.csv 41GB reviews_Automoti ve_5.json 14MB 20473 Tools_and_Home_ Improvement_5.js 107MB 134476

5 Why Cloud? No need to procure and maintain hardware
Do not pay for idle == Pay for what you use only Scalability Easy to create more powerful machine/cluster Increase disk size Run several tasks independently Maintainability No patching Software installation (almost)

6 Why AWS (Amazon Web Services)
Leading Cloud Provider (Gartner) Most Features Services The oldest (11 years old) Automation Easy to use Secure

7 AWS EC2 Instance Types (what does it mean?)
General Purpose Compute Optimized GPU Instances Memory Optimized Storage Optimized Example: Name vCPU ECU Memory (GiB) Instance Storage (GB) Price i3.large 2 7 15.25 1 x 475 NVMe SSD $0.156 per Hour i3.xlarge 4 13 30.5 1 x 950 NVMe SSD $0.312 per Hour i3.2xlarge 8 27 61 1 x 1900 NVMe SSD $0.624 per Hour

8 What do we do? Run assorted Collect analytical data
ML methods on big datasets (different formats) on AWS instances Collect analytical data in NoSQL DynamoDb Run regression methods on collected analytics to Predict time Recommend Instance

9 Architecture EBS DynamoDB CLI IAM EC2 Deep Learning AMI S3
Cloud Formation

10 Challenges Overwrite AWS defaults Data Cleaning Systematic approach
Disk sizes Instance number limits Data Cleaning Verify, filter nulls and errors Systematic approach All instances New instance integration Results Expandability

11 Languages and Tools Python Cloud Formation Boto3 – AWS integration
Matplotlib – graphing Sklearn – ML methods implementations Nltk – natural language processing Pandas – csv processing Ijson – json processing Numpy – scientific computing library Cloud Formation automation

12 Limitations CloudFormation does not support loops Solution
Spawn multiple CloudFormation stacks from loop in python AWS provides instance launch limits Create AWS service ticket to increase limits Wait till the limit is increased AWS console does now allows to remove multiple CloudFormation stacks Run python program that destroys multiple stacks

13 Optimization and Automation
CloudFormation AWS service – (json or yaml) Pass parameters Spawn instances Pass shell scripts Launch python jobs Attach Disks Images Security Groups Define output resources Terminate instance upon test completion

14 CloudFormation example (yaml)
Resources: EC2: Type: "AWS::EC2::Instance" DeletionPolicy: Delete Properties: ImageId: ami-228dbc34 InstanceType: !Ref InstanceTypeParameterG SecurityGroupIds: - sg-6d413ccc KeyName: testkey Outputs: instance: Description: Created Instance Value: !Join ["", [!Ref InstanceTypeParameterG, ' ip ', !GetAtt EC2.PublicIp, ' reg ', !Ref "AWS::Region"]]

15 Destroy stacks from python
import time import sys import boto3 client = boto3.client('cloudformation',region_name='us-east-1', aws_access_key_id=‘AAAKK2HPQ', aws_secret_access_key=‘dasfdag/Gv8oKrsdg') total=0 response = client.list_stacks() for sum in response['StackSummaries']: print('Destr ', sum['StackName']) response = client.delete_stack( StackName=sum['StackName'] )

16 Implementation Each run includes the following steps for each data file Clean data Run machine learning methods on data Create model Split data into test and train Fit Predict Record detailed timing results into DynamoDB for each ML method Load Processing Preprocessing Generate graphs on collected data using regression Expected time

17 Result Processing Once data is collected
Use analytical methods to predict Execution time Price New test will just extend DynamoDB data Predictions more precise Clustering Data is divided into clusters and graph is created NLP Classification Model is trained and Test set reviews are classified Confusion matrix created (5 start – 5x5 matrix) Correct predictions ratios are saved [[ ] [ ] [ ] [ ] [ ]]

18 Results

19 Results

20 Results

21 Results

22

23 Lessons Learnt Nltk is slow for stemming
3-4 mins for 1000 reviews Large Memory constraints for training Hierarchical Clustering DBSCAN AgglomerativeClustering Prefer using csv files to json Optimize code Be skeptical about online examples Experiment Automate when possible

24 Future Work Serverless Microservices Neural Networks NLP
Lambda Neural Networks TensorFlow MxNet Torch NLP Word databases Spark clusters


Download ppt "Cloud Big Data Decision Support System for Machine Learning on AWS"

Similar presentations


Ads by Google