Download presentation
Presentation is loading. Please wait.
1
Open Systems Technologies Data Analyst Internship:
AWS Recommendation System Masters Project Joe McCartney Data Science and Analytics, Grand Valley State University, Allendale, Michigan 49401 Introduction Open Systems Technologies (OST) is an integrated, cross-functional business technology firm bringing together strategy & insights, digital experiences, connected products, data center transformation and enterprise managed services for clients to optimize and grow their businesses. I had the opportunity to be a part of the connected products team as a data analytics intern. The main project I had was designing and creating a data pipeline for a recommendation system using Amazon Web Services (AWS) for Herman Miller. Pipeline Steps 1) AWS RDS An AWS RDS Microsoft SQL instance holds all of the data. The database contains 20 tables and 14 are currently used in modeling. To use the data, routine extracts occur as opposed to having a live connection. 4) AWS SageMaker Using a jupyter notebook, it creates a model that is hosted on an endpoint. Currently it is built using an XGBoost model. Resulting model can easily be referenced to make new predictions. SageMaker simplifies machine learning with being highly customizable and its connections to the rest of the AWS platform 6) Website A user can input what kind of product they are looking for and the main drivers behind their choice. The system then takes those inputs and returns a Herman Miller product based on the model. 2) AWS GLUE The glue step contains 4 different parts: A Glue connection allows for Glue to access the data in RDS. A scheduled Glue Crawler checks for any changes to the databases schema. The Glue Job transfers the data from RDS to S3 as CSV files A scheduled Glue Trigger routinely launches the Glue Job. Need this ETL process to change format and structure of data for appropriate use in SageMaker. Pipeline Overview 1) Data is stored in a RDS MS SQL Database 2) Glue functions transfer data from RDS to S3 3) A S3 bucket holds data from RDS for SageMaker 4) SageMaker uses machine learning to create a model and the result is hosted on an endpoint 5) Lambda and CloudFormation are used to recreate the pipeline in other AWS environments 6) Website uses the endpoint to make recommendations to users Figure 2. An example of the algorithm making a recommendation based on certain values. Figure 4. The website interface that utilizes the endpoint to recommend products to users. What’s Next New machine learning algorithm ETL improvements Retraining the model Scheduling with CloudWatch or Step Functions Allow model to include user feedback CloudFormation improvements 5) AWS Lambda & CloudFormation Generates a majority of the pipeline in any AWS environment. Takes less than a minute to generate all Glue and S3 functions and portions of SageMaker. Code is all dynamic so setting values takes little time. Will allow for quick creation of other recommendation projects 3) AWS S3 Contains a different folder and CSV file for each of the tables from RDS. S3 Bucket is needed because it’s the only way SageMaker will ingest data. S3 bucket acts as intermediate storage for the process. Using a batch process to be able to work with known quantity and not interrupt users. OST AWS HMI AWS Acknowledgments I thank all of the members of the Connected Products team at OST for their help and guidance. Figure 3. CloudFormation allows for the duplication of AWS processes in the Herman Miller AWS environment based on what was made in OST’s Figure 1. The overall pipeline of the AWS project
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.