Data Warehousing The Easy Way with AWS Redshift Case study
Landed $30M Growth Equity With Susquehanna in 2015 Q4 About Field Nation Field Nation is the contingent work platform for business. We are the business hub helping enterprises get their critical work done through freelancers, service providers & their own workforce. Landed $30M Growth Equity With Susquehanna in 2015 Q4 Tekne Award Winner for Top Information Technology Services ~ MHTA, 2015 About Me Data Scientist at Field Nation Worked in a variety of data warehouse teams as a consultant or employee M.S. Predictive Analytics
Agenda Background Introduction to Amazon Redshift Solution Approach Solution Stages Data Pipeline Data Staging Data Presentation Advantages
Background Situation Multiple data warehouse platforms Data not accessible to users Short timeline for results Lynn Langit Big Data and Cloud Architect. Technical Author. Community technical education partner awards from AWS, Google and Microsoft https://lynnlangit.com/ @lynnlangit Big Relational Use cloud data warehouse platform Postpones need to introduce complexities of Hadoop Origin story Microsoft and Redshift w/custom scripting Main dashboard was Excel spreadsheet http://www.kdnuggets.com/2015/02/big-data-trends-strata-hadoop-san-jose.html
Amazon’s hosted data warehouse platform Redshift Amazon’s hosted data warehouse platform Fully managed AWS handles back ups, resizing, fault tolerance, etc. Strong partner network Integrates well with other AWS services like S3, Kinesis Familiar interface Acts like a Postgres-standard relational database Use SQL for querying and management Optimized for performance Column-store Massively Parallel Processing (MPP) architecture Can scale up to multiple petabytes Data compression Interleaved sorting Similar to Azure SQL Warehouse and Google Big Query
Solution Approach Philosophy Optimize user experience over data processing complexity Spend the bulk of time solving unique business problems which can’t be out-sourced Any commodity work that can be automated or out-sourced to other parties should be Assumptions Data sources accessible from the cloud Budget exists to cover software licensing fees Data is structured Storage / computational complexity Your data is stored in the cloud either in hosted databases like Amazon Web Services (AWS) Relational Database Service (RDS) and popular SaaS platforms such as Salesforce and Zendesk
Getting data into Redshift Data Pipeline Getting data into Redshift Moves data from raw data sources into Redshift Service provided by third-party vendor Can load from a variety of data sources Each data source is loaded into a different schema First of three stages
Data Pipeline Example First of three stages
Data Pipeline Vendors Databases MySQL / Aurora Postgres SQL Server MongoDB Elasticsearch SaaS Salesforce Zendesk Google Analytics Snowplow MailChimp Mixpanel
Transforming data inside Redshift Data Staging Transforming data inside Redshift All raw data available inside Redshift schemas Use a transform tool to convert data into staging area Results in clean, normalized schema
Data Staging Example First of three stages
Data Staging Tools ETL tools Talend SnapLogic Informatica ELT tools Matillion Scripts Python scripts SQL Big data Spark
Making data available for users Data Presentation Making data available for users Use fully denormalized schema Star schema is unnecessary Eliminates need for slowly changing dimensions, bridge tables, etc. Joins are expensive in Redshift Simple to query for end users Star schema is unnecessary
Data Presentation Example First of three stages
Data Presentation Tools
Advantages Low start up effort Can leverage robust partner network Easy to make changes or additions Wide selection of tools Can build temporary presentation views before committing to building full ETL
Questions eric.ness@fieldnation.com https://www.linkedin.com/in/ericnessdata