Cloud Big Data Decision Support System for Machine Learning on AWS Alex Kaplunovich, Yelena Yesha University of Maryland Baltimore County akaplun1@umbc.edu, yeyesha@umbc.edu December 2017
Rational Where to run BigData machine learning jobs How to optimize expenses on Computing Memory How to optimize running time Collect data on the fly Analyze and recommend using Analytics of Analytics AWS instance Predict Time to execute
Machine Learning Methods Name Library Class name K-Means Clustering sklearn.cluster KMeans Mini Batch K-Means Clustering MiniBatchKMeans Birch Clustering Birch Hierarchical Clustering AgglomerativeClustering DBSCAN Clustering DBSCAN Naive Bayes Classification sklearn.naive_bayes GaussianNB Decision Tree Classification sklearn.tree DecisionTreeClassifier Random Forest Classification sklearn.ensemble RandomForestClassifier Polynomial Regression sklearn.preprocessing PolynomialFeatures Support Vector Regression (SVR) sklearn.svm SVR Decision Tree Regression DecisionTreeRegressor Random Forest Regression RandomForestRegressor
Datasets (selected) Speed_Camera_Vi olations_6.csv on 14.8MB 112606 Name Size Rows Speed_Camera_Vi olations_6.csv on 14.8MB 112606 olations_14.json 39.9MB Crimes_- _2001_to_present _19.csv 1.4GB 6347694 Taxi_Trips_17.csv 41GB 110841806 reviews_Automoti ve_5.json 14MB 20473 Tools_and_Home_ Improvement_5.js 107MB 134476
Why Cloud? No need to procure and maintain hardware Do not pay for idle == Pay for what you use only Scalability Easy to create more powerful machine/cluster Increase disk size Run several tasks independently Maintainability No patching Software installation (almost)
Why AWS (Amazon Web Services) Leading Cloud Provider (Gartner) Most Features Services The oldest (11 years old) Automation Easy to use Secure
AWS EC2 Instance Types (what does it mean?) General Purpose Compute Optimized GPU Instances Memory Optimized Storage Optimized Example: Name vCPU ECU Memory (GiB) Instance Storage (GB) Price i3.large 2 7 15.25 1 x 475 NVMe SSD $0.156 per Hour i3.xlarge 4 13 30.5 1 x 950 NVMe SSD $0.312 per Hour i3.2xlarge 8 27 61 1 x 1900 NVMe SSD $0.624 per Hour
What do we do? Run assorted Collect analytical data ML methods on big datasets (different formats) on AWS instances Collect analytical data in NoSQL DynamoDb Run regression methods on collected analytics to Predict time Recommend Instance
Architecture EBS DynamoDB CLI IAM EC2 Deep Learning AMI S3 Cloud Formation
Challenges Overwrite AWS defaults Data Cleaning Systematic approach Disk sizes Instance number limits Data Cleaning Verify, filter nulls and errors Systematic approach All instances New instance integration Results Expandability
Languages and Tools Python Cloud Formation Boto3 – AWS integration Matplotlib – graphing Sklearn – ML methods implementations Nltk – natural language processing Pandas – csv processing Ijson – json processing Numpy – scientific computing library Cloud Formation automation
Limitations CloudFormation does not support loops Solution Spawn multiple CloudFormation stacks from loop in python AWS provides instance launch limits Create AWS service ticket to increase limits Wait till the limit is increased AWS console does now allows to remove multiple CloudFormation stacks Run python program that destroys multiple stacks
Optimization and Automation CloudFormation AWS service – (json or yaml) Pass parameters Spawn instances Pass shell scripts Launch python jobs Attach Disks Images Security Groups Define output resources Terminate instance upon test completion
CloudFormation example (yaml) Resources: EC2: Type: "AWS::EC2::Instance" DeletionPolicy: Delete Properties: ImageId: ami-228dbc34 InstanceType: !Ref InstanceTypeParameterG SecurityGroupIds: - sg-6d413ccc KeyName: testkey Outputs: instance: Description: Created Instance Value: !Join ["", [!Ref InstanceTypeParameterG, ' ip ', !GetAtt EC2.PublicIp, ' reg ', !Ref "AWS::Region"]]
Destroy stacks from python import time import sys import boto3 client = boto3.client('cloudformation',region_name='us-east-1', aws_access_key_id=‘AAAKK2HPQ', aws_secret_access_key=‘dasfdag/Gv8oKrsdg') total=0 response = client.list_stacks() for sum in response['StackSummaries']: print('Destr ', sum['StackName']) response = client.delete_stack( StackName=sum['StackName'] )
Implementation Each run includes the following steps for each data file Clean data Run machine learning methods on data Create model Split data into test and train Fit Predict Record detailed timing results into DynamoDB for each ML method Load Processing Preprocessing Generate graphs on collected data using regression Expected time
Result Processing Once data is collected Use analytical methods to predict Execution time Price New test will just extend DynamoDB data Predictions more precise Clustering Data is divided into clusters and graph is created NLP Classification Model is trained and Test set reviews are classified Confusion matrix created (5 start – 5x5 matrix) Correct predictions ratios are saved [[1289 73 12 0 0] [ 384 55 12 0 0] [ 103 23 5 2 1] [ 51 4 0 0 0] [ 33 5 0 0 1]]
Results
Results
Results
Results
Lessons Learnt Nltk is slow for stemming 3-4 mins for 1000 reviews Large Memory constraints for training Hierarchical Clustering DBSCAN AgglomerativeClustering Prefer using csv files to json Optimize code Be skeptical about online examples Experiment Automate when possible
Future Work Serverless Microservices Neural Networks NLP Lambda Neural Networks TensorFlow MxNet Torch NLP Word databases Spark clusters