Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to build consistent, scalable workspaces for data science teams

Similar presentations


Presentation on theme: "How to build consistent, scalable workspaces for data science teams"— Presentation transcript:

1 How to build consistent, scalable workspaces for data science teams
Elaine Lee

2 Data science is hard. Doing data science is even harder.
Managing dependencies Ensuring enough resources

3 Nail it down Identify system requirements for base Docker image
Stabilize dependencies for data science work environment Increase test coverage Get continuous integration (CI) platform on the same page

4 Scale it up Create a pool of worker machines ready to accept jobs
Set up an asynchronous task queue Provide a simple command line interface for data scientists

5 Putting it all together
Pull changes Start Docker container Run test suite Report Pass/Fail Export image for commit Commit pushed to Github Report result Get image for commit Start container from image Run task Request arrives in queue workers 123abc… s3 dockeRization image

6 Benefits Flexible to any composition of EC2 instances
Extensible to EMR One-time configuration EC2 AMI Task environment guaranteed Isolated from other tasks Identical to conditions at time of development Extensible command line interface R interface Cluster management Job monitoring

7 Use case: Quality assurance
CI testing Other tests Data validation Model consistency

8 Use case: Parallelizable tasks
Data manipulation Feature engineering Model builds Advanced machine learning algorithms Hyperparameter search

9 Elaine Lee Data Engineer elaine@elaineklee.com @elaineklee avant.com


Download ppt "How to build consistent, scalable workspaces for data science teams"

Similar presentations


Ads by Google