Download presentation
Presentation is loading. Please wait.
Published byCornelia Watkins Modified over 6 years ago
1
How to build consistent, scalable workspaces for data science teams
Elaine Lee
2
Data science is hard. Doing data science is even harder.
Managing dependencies Ensuring enough resources
3
Nail it down Identify system requirements for base Docker image
Stabilize dependencies for data science work environment Increase test coverage Get continuous integration (CI) platform on the same page
4
Scale it up Create a pool of worker machines ready to accept jobs
Set up an asynchronous task queue Provide a simple command line interface for data scientists
5
Putting it all together
Pull changes Start Docker container Run test suite Report Pass/Fail Export image for commit Commit pushed to Github Report result Get image for commit Start container from image Run task Request arrives in queue workers 123abc… s3 dockeRization image
6
Benefits Flexible to any composition of EC2 instances
Extensible to EMR One-time configuration EC2 AMI Task environment guaranteed Isolated from other tasks Identical to conditions at time of development Extensible command line interface R interface Cluster management Job monitoring
7
Use case: Quality assurance
CI testing Other tests Data validation Model consistency
8
Use case: Parallelizable tasks
Data manipulation Feature engineering Model builds Advanced machine learning algorithms Hyperparameter search
9
Elaine Lee Data Engineer elaine@elaineklee.com @elaineklee avant.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.