Download presentation
Presentation is loading. Please wait.
Published byLynette Allison Modified over 8 years ago
1
Joe Caserta President Elliott Cordo Chief Architect September 30, 2015, Javits Center, New York City Building a Data Lake for Digital Music Dominance
2
Big Data Strategy Innovation Technical Implementation Awards and Recognition
5
The Music Maze
7
Build a Dynamic Platform – Paradigm Shift OLD WAY: Structure Ingest Analyze Fixed Capacity Monolith NEW WAY: Ingest Analyze Structure Dynamic Capacity Ecosystem RECIPE: Cloud Data Lake Polyglot Warehouse
8
Move to the Cloud Existing On-Premise Solution Challenges with operations of Hadoop servers in Data Center Increasing infrastructure complexity Keeping up with data growth Cloud Advantages Reduced upfront capital investment Faster speed to value Elasticity “Those that go out and buy expensive infrastructure find that the problem scope and domain shift really quickly. By the time they get around to answering the original question, the business has moved on.” - Matt Wood, AWS
9
Cost savings of dynamic capacity
10
Elasticity not only saves money
11
Essentially, Servers Suck But more importantly think Infrastructure as code Your servers should be API calls Use stateless processes Make all resources ephemeral Make everything scalable and elastic!
12
Ephemeral? Disposable: Processing Fleets Elastic Map Reduce Clusters Redshift Clusters Use distributed services and systems to maintain state and preserve your data: Cassandra, Dynamo S3
13
Anatomy of our Processing Fleet S3 Input Buckets Auto-scaling Queuing service S3 Output Buckets
14
Elastic Map Reduce Hadoop on Demand No Operations –your cluster dies so what Bootstrap whatever processing engine makes sense Programmatically estimate instance type and cluster size
15
You May Need Some Persistent Servers If at all possible they should be inherently scalable, distributed, and elastic
16
Move to a Data Lake Paradigm Technology: Scalable distributed storage S3 Pluggable fit-for-purpose processing EMR Functional Capabilities: Remove barriers from data ingestion and analysis Storage and processing for all data Tunable Governance
17
Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM, Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Usage PatternData Governance Metadata, ILM, Security Putting it together: The Big Data Pyramid
18
Data Ingestion and Onboarding Incoming to S3: – Lightweight API wrapper – Web front end – Direct writes to S3 Ingest the data in a reasonable partitioning schema: Bucket and Keys Turn analysts and data scientists loose Late bind analytics
19
But we need to feed the cash register Data needs to be refined and mapped: – Processing Fleet – EMR 80/20 rule: metadata driven when possible Abstract away “Big Data” And make sure it’s right! – Automated data quality checks using HAMBOT, soon to be open sourced
20
“…any decent sized enterprise will have a variety of different data technologies for different kinds of data. There will still be large amounts of it managed in relational stores, but increasingly we'll be first asking how we want to manipulate the data and only then figuring out what technology is the best bet for it.” - Martin Fowler Think Data Ecosystem, Not Tech Stack
21
Polyglot in Practice Best practices from traditional EDW Consolidation Data Governance Master Data Tuned for analytics Applied to: Fit-for-purpose technologies and approaches Relational, MPP, Graph, KV, TimeseriesDB, Data Lake Apply “tunable governance” and traditional principles Use the right tool for the job
22
The Landscape for Digital Dominance Landing Queue Data Lake BDW Data Science API Data Providers Near Real-time Batch Data Science Clusters EDW Graph RDS Metastore
23
Joe Caserta President, Caserta Concepts joe@casertaconcepts.com @joe_Caserta Elliott Cordo Chief Architect, Caserta Concepts elliott@casertaconcepts.com Award-winning company Transformative Data Strategies Modern Data Engineering Advanced Architecture Innovation Partner Strategic Consulting Advanced Technical Design Build & Deploy Solutions BDW Meetup New York City 3,000+ members Knowledge sharing Data is not important, it’s what you do with it that’s important! Thank You
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.