Cloud Big Data Decision Support System for Machine Learning on AWS

Slides:

Advertisements

Similar presentations

Cloud Computing Mick Watson Director of ARK-Genomics The Roslin Institute.

Advertisements

Amazon Web Services Justin DeBrabant CIS Advanced Systems - Fall 2013.

1 NETE4631 Cloud deployment models and migration Lecture Notes #4.

Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.

Python In The Cloud PyHou MeetUp, Dec 17 th 2013 Chris McCafferty, SunGard Consulting Services.

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.

Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.

Accessing the Amazon Elastic Compute Cloud (EC2) Angadh Singh Jerome Braun.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”

Modeling Big Data Execution speed limited by: –Model complexity –Software Efficiency –Spatial and temporal extent and resolution –Data size & access speed.

Apache Mahout Qiaodi Zhuang Xijing Zhang.

An Exercise in Machine Learning

Data analysis tools Subrata Mitra and Jason Rahman.

***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Launch Amazon Instance. Amazon EC2 Amazon Elastic Compute Cloud (Amazon EC2) provides resizable computing capacity in the Amazon Web Services (AWS) cloud.

Cloud Computing from a Developer’s Perspective Shlomo Swidler CTO & Founder mydrifts.com 25 January 2009.

In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.

INTRODUCTION TO AMAZON WEB SERVICES (EC2). AMAZON WEB SERVICES  Services  Storage (Glacier, S3)  Compute (Elastic Compute Cloud, EC2)  Databases (Redshift,

GROUP GOAL Learn and understand python programing language Libraries: Pandas Numpy SKlearn Use machine learning algorithms Decision trees Random Forests.

Cloud Analytics Platforms Christian Frey. About AIDA Our mission is to advance knowledge in data analytics through research, education and outreach Our.

9/24/2017 7:27 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.

Image taken from: slideshare

TensorFlow The Deep Learning Library You Should Be Using.

Deep Neural Networks: A Hands on Challenge

data & analytics beyond dashboards

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

A Smart Tool to Predict Salary Trends of H1-B Holders

Deep Learning Software: TensorFlow

Modeling Big Data Execution speed limited by: Model complexity

AWS Integration in Distributed Computing

Working With Azure Batch AI

Mini Places Challenge Adrià Recasens, Nov 21.

Distributed Network Traffic Feature Extraction for a Real-time IDS

| A Comparative Study on I/O Performance between Compute and Storage Optimized Instances of Amazon.

Nebula A cloud-based back end for

Agenda Who am I? Whirlwind introduction to the Cloud

Spark Presentation.

Spatial Analysis With Big Data

AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.

Classification with Perceptrons Reading:

IBM DATASTAGE online Training at GoLogica

PLOTr -KUSHAL MEHTA.

Cloud Analytics Platforms

High Performance Data Scientist

Classifying enterprises by economic activity

Machine Learning with Weka

Overview of big data tools

Analytics: Its More than Just Modeling

Course Introduction CSC 576: Data Mining.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Creative Activity and Research Day (CARD)

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Deploying Your First Full Stack Application to the Cloud

Building Serverless Enterprise Applications

Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.

Building a Threat-Analytics Multi-Region Data Lake on AWS

Agenda Need of Cloud Computing What is Cloud Computing

SQL Server on Amazon Web Services

Overview of Computer system

Igor Stančin, Alan Jović to: {igor.stancin,

SQL Server on Amazon Web Services

Machine Learning for Cyber

Machine Learning for Cyber

Data Mining CSCI 307, Spring 2019 Lecture 8

Presentation transcript:

Cloud Big Data Decision Support System for Machine Learning on AWS Alex Kaplunovich, Yelena Yesha University of Maryland Baltimore County akaplun1@umbc.edu, yeyesha@umbc.edu December 2017

Rational Where to run BigData machine learning jobs How to optimize expenses on Computing Memory How to optimize running time Collect data on the fly Analyze and recommend using Analytics of Analytics AWS instance Predict Time to execute

Machine Learning Methods Name Library Class name K-Means Clustering sklearn.cluster KMeans Mini Batch K-Means Clustering MiniBatchKMeans Birch Clustering Birch Hierarchical Clustering AgglomerativeClustering DBSCAN Clustering DBSCAN Naive Bayes Classification sklearn.naive_bayes GaussianNB Decision Tree Classification sklearn.tree DecisionTreeClassifier Random Forest Classification sklearn.ensemble RandomForestClassifier Polynomial Regression sklearn.preprocessing PolynomialFeatures Support Vector Regression (SVR) sklearn.svm SVR Decision Tree Regression DecisionTreeRegressor Random Forest Regression RandomForestRegressor

Datasets (selected) Speed_Camera_Vi olations_6.csv on 14.8MB 112606 Name Size Rows Speed_Camera_Vi olations_6.csv on 14.8MB 112606 olations_14.json 39.9MB Crimes_- _2001_to_present _19.csv 1.4GB 6347694 Taxi_Trips_17.csv 41GB 110841806 reviews_Automoti ve_5.json 14MB 20473 Tools_and_Home_ Improvement_5.js 107MB 134476

Why Cloud? No need to procure and maintain hardware Do not pay for idle == Pay for what you use only Scalability Easy to create more powerful machine/cluster Increase disk size Run several tasks independently Maintainability No patching Software installation (almost)

Why AWS (Amazon Web Services) Leading Cloud Provider (Gartner) Most Features Services The oldest (11 years old) Automation Easy to use Secure

AWS EC2 Instance Types (what does it mean?) General Purpose Compute Optimized GPU Instances Memory Optimized Storage Optimized Example: Name vCPU ECU Memory (GiB) Instance Storage (GB) Price i3.large 2 7 15.25 1 x 475 NVMe SSD $0.156 per Hour i3.xlarge 4 13 30.5 1 x 950 NVMe SSD $0.312 per Hour i3.2xlarge 8 27 61 1 x 1900 NVMe SSD $0.624 per Hour

What do we do? Run assorted Collect analytical data ML methods on big datasets (different formats) on AWS instances Collect analytical data in NoSQL DynamoDb Run regression methods on collected analytics to Predict time Recommend Instance

Architecture EBS DynamoDB CLI IAM EC2 Deep Learning AMI S3 Cloud Formation

Challenges Overwrite AWS defaults Data Cleaning Systematic approach Disk sizes Instance number limits Data Cleaning Verify, filter nulls and errors Systematic approach All instances New instance integration Results Expandability

Languages and Tools Python Cloud Formation Boto3 – AWS integration Matplotlib – graphing Sklearn – ML methods implementations Nltk – natural language processing Pandas – csv processing Ijson – json processing Numpy – scientific computing library Cloud Formation automation

Limitations CloudFormation does not support loops Solution Spawn multiple CloudFormation stacks from loop in python AWS provides instance launch limits Create AWS service ticket to increase limits Wait till the limit is increased AWS console does now allows to remove multiple CloudFormation stacks Run python program that destroys multiple stacks

Optimization and Automation CloudFormation AWS service – (json or yaml) Pass parameters Spawn instances Pass shell scripts Launch python jobs Attach Disks Images Security Groups Define output resources Terminate instance upon test completion

CloudFormation example (yaml) Resources: EC2: Type: "AWS::EC2::Instance" DeletionPolicy: Delete Properties: ImageId: ami-228dbc34 InstanceType: !Ref InstanceTypeParameterG SecurityGroupIds: - sg-6d413ccc KeyName: testkey Outputs: instance: Description: Created Instance Value: !Join ["", [!Ref InstanceTypeParameterG, ' ip ', !GetAtt EC2.PublicIp, ' reg ', !Ref "AWS::Region"]]

Destroy stacks from python import time import sys import boto3 client = boto3.client('cloudformation',region_name='us-east-1', aws_access_key_id=‘AAAKK2HPQ', aws_secret_access_key=‘dasfdag/Gv8oKrsdg') total=0 response = client.list_stacks() for sum in response['StackSummaries']: print('Destr ', sum['StackName']) response = client.delete_stack( StackName=sum['StackName'] )

Implementation Each run includes the following steps for each data file Clean data Run machine learning methods on data Create model Split data into test and train Fit Predict Record detailed timing results into DynamoDB for each ML method Load Processing Preprocessing Generate graphs on collected data using regression Expected time

Result Processing Once data is collected Use analytical methods to predict Execution time Price New test will just extend DynamoDB data Predictions more precise Clustering Data is divided into clusters and graph is created NLP Classification Model is trained and Test set reviews are classified Confusion matrix created (5 start – 5x5 matrix) Correct predictions ratios are saved [[1289 73 12 0 0] [ 384 55 12 0 0] [ 103 23 5 2 1] [ 51 4 0 0 0] [ 33 5 0 0 1]]

Results

Results

Results

Results

Lessons Learnt Nltk is slow for stemming 3-4 mins for 1000 reviews Large Memory constraints for training Hierarchical Clustering DBSCAN AgglomerativeClustering Prefer using csv files to json Optimize code Be skeptical about online examples Experiment Automate when possible

Future Work Serverless Microservices Neural Networks NLP Lambda Neural Networks TensorFlow MxNet Torch NLP Word databases Spark clusters