Presentation is loading. Please wait.

Presentation is loading. Please wait.

Starting a Data Science Team Dr. Jonathan D. Adler

Similar presentations


Presentation on theme: "Starting a Data Science Team Dr. Jonathan D. Adler"— Presentation transcript:

1 Starting a Data Science Team Dr. Jonathan D. Adler jonathandadler@outlook.com http://jadler.info

2 About me Degrees: BS – Mathematical Sciences (Worcester Polytechnic Institute) MS – Applied Mathematics (Worcester Polytechnic Institute) PhD – Industrial Engineering (Arizona State University) Research – real-time optimization in situations with uncertainty Jobs: Vistaprint – Forecasting, customer segmentation, product recommendation engine Boeing – Market forecasting Promontory Growth and Innovation – Data science consulting Created the data science team from scratch Microsoft Studios – Xbox and Windows 10 analytics 2

3 The realization When a company gets to a certain size, they often realize a few things: They have a lot of data, and lots of questions This data probably has useful information in it There are people out there who can turn the data into useful insights “So lets hire some people to turn that data into useful information” 3

4 There are a LOT of open questions What can we really do with data science? What makes a data science project successful? What are the skills the employees should have? Who should be our first hire? Our second? Our fifth? What are best practices for running the team? 4

5 Outline 1.What can we really do with data science? 2.What makes a data science project successful? 3.What are the skills the employees should have? 4.Who should be our first hire? Our second? Our fifth? 5.What are best practices for running the team? 5

6 1. What you can do with data science 6

7 Types of data science work Data science tends to fall into three broad categories: Investigating – aggregating and inspecting data to get basic insights on what is currently happening Predicting – taking the data and using it to understand what will happen in the future Optimizing – using the data to choose what the best choice of actions will be 7 Complex Simple

8 Investigating Look at historic data to answer direct questions: If you have two products, which is selling better? How many people are buying both? How frequently do customers order? How are sales changing each month? These questions are generally quick to answer and don’t require a mathematical model Difficulty is in knowing which measures to use and how to visualize/represent them Unfortunately, they don’t tell you much (“so what?”) 8

9 Predicting Look at historic data to predict: How likely is a customer going to come back? How will a customer respond to a sale? Is revenue going to increase over time? This information is a lot more meaningful; you are more likely to be able to act on it Requires mathematical modeling: regressions, clustering algorithms, etc. Sometimes the data isn’t there to make the prediction, sometimes the prediction is wrong, or requires more skill to do well 9

10 Optimizing Look at the historic data to make the best decisions How much inventory should be held, and when should you reorder? Which product should you recommend to a website visitor for the most profit? What price should you set for a product? When should it go on sale? These problems are the hardest to get right They also directly provide the most profit 10

11 Things you can’t do with data science Solve problems where the core drivers aren’t in the data, or the signal is too weak in the noise Which start-up companies are going to succeed When is the next recession going to hit Having data isn’t sufficient for success Knowing the last 10,000 flips of a fair coin won’t help me predict the next flip 11

12 2. What makes a data science project successful 12

13 Case Study 1: an e-commerce company A company that sound brownies online wanted to improve their marketing. They had two types of customers: Consumers ordering for friends and family Businesses ordering for their clients Wanted to target their customers differently, but couldn’t consistently tell if a customer was a business or a consumer Data science approach: analyze the text on the gift message to determine if it had “business” or “consumer words” Result: a continuously running script determined if each new order was for a business or consumer, and the customer was put into one of the two categories 13 Gift message Probability of business THANK YOU ALL FOR YOUR AND HARD WORK, IT IS TRULY APPRECIATED BY THE MANAGEMENT TEAM0.989084 CONGRATULATIONS AND BEST OF LUCK ON YOUR NEW JOB! WE ARE VERY PROUD OF YOU! LOVE MOM0.019581

14 The data science process 14 QuestionsData Analysis Modeling ResultAbort Productionize All analyses begin with questions A data scientist will take the question and investigate the data Cases: it’s clear the data isn’t right to answer the question advanced modeling is required the answer to the question can be found immediately in the data If a result is found It can be productionized It can raise more questions

15 Case Study 2: a manufacturing company A manufacturing company made highly customized machines “build a machine that builds lightbulbs” The company needs to makes quotes for the price without ever having made that machine before Costs could be substantially higher or lower than quoted Problem wasn’t with price but with quality of estimates Data science used to better predict how much a machine would cost 15 QuestionsData Analysis Modeling ResultAbort Productionize Question: how can we predict the cost of a machine Data: features of previous machines, their estimates, and the actual costs Analysis: there is a relationship between features and the estimate/actual ratio Model: a GAMLSS to predict the true cost and the possible error band Result: model successfully predicted costs better Productionize: a simple GUI for the company, and a contract for refitting the model

16 The data science process: problems 16 QuestionsData Analysis Modeling ResultAbort Productionize Possible problems Data inconsistent or poorly formatted Question ill formed Model can be faulty – overfitting or incorrect assumptions Signs the model won’t work are ignored Stuck in a “what about” loop Model not built in a way that makes for simple productionizing

17 Case Study 3: a distributor Situation: A company sells products in bulk to state governments Want to discount each quote to a price that brings in the most profit Data Science used to determine which price Problems: No clear relationship found between customer and chance of accepting a quote A highly advanced model was used that found a relationship, but was unintuitive and not robust, and therefore hard to tell if working correctly Model was built on training data that was a different format from the production data, so entire model had to be rebuilt to productionize Was stuck in a “what about loop:” continuously cutting the data in different ways to satisfy customer End result: project was over-budget and under-delivered, and now is extremely difficult to maintain 17

18 3. Skills needed to be a data scientist 18

19 The five skills needed to do data science Technical Skills Statistics and Math – The different techniques used on data: regressions, clustering algorithms, time series models Software Development – How to write code, how to manage a code base, how to store data in a database Business Experience – Where companies waste money, what makes a project succeed, how to get data from within a company Personal Skills Leadership – How to help other data scientists, how to train them, and how to work with a customer to produce good results Adaptability – the ability to figure out a solution when presented with an entirely new problem 19

20 Data scientist archetypes Junior data scientist (J) – Has a BS, and less than three years of experience in industry. Tends to know the only simple statistical techniques. Requires a lot of guidance, but is happy to do the less interesting work. ($) 20 Expert junior scientist (E) – A junior data scientist who has been working for 5+ years. Gets very comfortable doing the simple stuff and knows deeply about their business area. May have gotten an MS to help career. ($$)

21 Data scientist archetypes 21 Senior data scientist (S) – A person with an advanced degree, and enough business experience to know what to do with it. Understands coding well enough to do things right. Is still less willing to do work than a junior data scientist, but will if no one else is around. Big difference between senior and expert junior is ability to independently learn. ($$$) Principal data scientist (P) – Just like a senior data scientist, but also with experience leading a team and a project. Difficult to find. ($$$$)

22 Data scientist archetypes: danger 22 Business intelligence analyst (B) – Understands a lot about the business and the data powering it. Doesn’t know much about statistics or what to do with the data. Can be dangerous without proper guidance. ($) Academic (A) – Has an MS/PhD and not too much business experience. Loves to think about interesting problems, but is less willing to spend the time doing the mundane work to get a project done (and might not know how!). ($$$)

23 4. Who you should hire 23

24 Hiring You only need to hire one data scientist, they’ll hire the rest Who you choose for the first hire dramatically alters how your team will end up 24

25 First hire choice [Expert] Junior data scientist: “the blind leading the blind.” This team will know how to do simple data science but won’t know how to do advanced work. Often won’t even know the advanced things exist. Often very inefficient because they don’t know any better. Academic: “the ivory tower.” A team of people who look at only the most complex problems, and spend tons of time talking about how they are interesting and innovative solutions for them. Won’t product very many solutions. 25 EJ JJ JJ AAAA

26 First hire choice Senior data scientist: “the very strong and expensive team.” This team will be efficient and generally produce good results. The team members won’t enjoy doing the simpler work, but it’ll get done. Principal data scientist: “the balanced.” the principal data scientist will be expensive, but the people they hire won’t be. They will set the groundwork for the team to run efficiently, but will be able to support a junior team. 26 SSSS P S JEJJ S

27 5. Best practices 27

28 From data to a result 28 QuestionsData Analysis Modeling ResultAbort Productionize Tables in a database A presentation Tools are needed for this process

29 From data to a result: work streams Least efficient Tables in a database 1.Database queries to aggregate and join the data 2.SAS code to analyze the aggregated data and run a model 3.Excel worksheets to visualize the data 4.PowerPoint to make the presentation A.pptx file Most efficient Tables in a database 1.A single set of R code to aggregate and join the data, run the model, visualize the output, and make a presentation A.pdf file 29 Firms underestimate the total loss from an inefficient work stream

30 Storing knowledge Data science produces lots of types of files Raw data from the client Processed intermediate data Code to do the analysis Base results Finalized reports and presentations Often an analysis is done once May never be looked at again In a year, someone might ask to do the analysis again with changes 30

31 Storing knowledge: methods Least robust Each data scientist has a folder on a share drive for each project containing all of the data, code, and results  Doesn’t make clear what files are used for  Doesn’t track changes over time  Doesn’t indicate what was delivered to client Most robust Materials split into three components 1.Input data is stored in folders for each project sharing a consistent scheme 2.Code for analysis is stored in version control to track changes 3.Output is stored in folders with marked versions connected the code  Anything delivered to the client is marked with how it was created  Allows for clear change logs to see differences in versions  Splitting input data allows for easy data updates 31

32 Project management 32 QuestionsData Analysis Modeling ResultAbort Productionize Data science process involves many small tasks Finding data Initial analysis Attempting multiple models With multiple projects and multiple people, coordination is non-trivial

33 Project management: methods For project set up, have a standard expected timeline, example: Initial investigation: 2 weeks Modeling: 4 weeks Result validation: 2 weeks Productionizing: 4 weeks In the timeline, have set points to meet with the client and review Use a card-based tool like trello to track the process of individual steps. 33

34 Conclusion 34

35 Conclusion 35 1.What can we really do with data science? Many problems can be solved that rely on data. From simple investigation of the data to building predictive models and optimization algorithms. 2.What makes a data science project successful? Having a clear path from the data to the result, and ensuring the project gets completed (or aborted) at the right time. 3.What are the skills the employees should have? Statistics, software development, business experience, leadership, and adaptability. 4.Who should be our first hire? Our second? Our fifth? Someone with all five of those skills, or failing that, someone with all of them but leadership. 5.What are best practices for running the team? Have a clear, efficient process for doing data science and storing the results.

36 Questions? 36

37 Appendix 37

38 Hiring roadmap If you can find a principal data scientist Hire him or her, have them set up the groundwork for the team 3 months in, hire a senior data scientist 6 months in, hire a junior data scientist By 18 months in, have a team of 5-6 people If you can’t find a principal data scientist Hire a senior data scientist to work independently Every 3 months hire an additional senior data scientist If at any point there seems to be too much simple work, start hiring junior data scientists and assign them senior data scientists as mentors 38

39 Ensuring candidates have these skills During the hiring process, check: Statistics and Math – Do they know how to use a linear regression? What overfitting is? Supervised vs. unsupervised learning? Software Development – Have they used: R, python, or MATLAB? Have they used source control? Have they pulled data from a SQL database, and understand how to do joins? Business Experience – Do they have experience working in a company? Have they seen a project through to completion? Can they reflect on why a project succeeded or failed? Leadership – Have they managed a project? Have they lead employees? Adaptability – Do they have experience in figuring out a solution to an entirely new problem without substantial guidance? 39


Download ppt "Starting a Data Science Team Dr. Jonathan D. Adler"

Similar presentations


Ads by Google