Scaling the Data Scientist Dr. Ira Cohen, Chief Data Scientist, HP Software
2 Data Science HPSW HP-Software and Data Science HP-Software products collect huge amounts of IT data Customers want us to transform the data to actionable information System Monitoring Events Defects Incidents Logs Changes Configuration Test data Requirements “Big Data & Predictive Analytics: The Future of IT Management” Mike Gualtieri, Forrester Security events Network data App Monitoring
3 Data Science HPSW Need Expertise Expertise in machine learning Expertise in the products domain Infrastructure Data platformsDevelopment Tools
4 Data Science HPSW A tale of two worlds Data Scientists Few Limited domain knowledge Tools: R, Matlab, Mahout, Knime, Weka, Sas, … Developers/SMEs Plentiful Limited data science knowledge Tools: IDEs, Excel
5 Data Science HPSW Developer Data analytics specialist Our solution
6 Data Science HPSW How? Training Mentoring Community Training Mentoring Community Data infrastructure New Dev tool Data infrastructure New Dev tool
7 Data Science HPSW Training: Practical Machine Learning 4 day training Commitment to complete first project
Practical Machine Learning Ohad Assulin, Efrat Egozi Levi, Ira Cohen Automatic Event Prioritization Anat Levinger & Roy Wallerstein Automatic Vulnerability Categorization Barak Raz & Ben Feher Classifying Security Events Yoni Roit & Omer Weissman Early detection of anomalous behavior in IT systems Yonatan Ben Simhon & Yaneeve Shekel Cloud Delivery Optimization (CDO) Ran, Levi URL to Action Classification Boaz Shor & Eyal Kenigsberg Predictive Analytics in Release Management Sigalit Sade Sales Pipeline Early Warning Gabriel, Alvarado
Pushing My Buttons Gil Zieder, Ofer Eliassaf, Boris Kozorovitzky
10 Data Science HPSW The work Problem definition Data Attribute construction Normalization Processing Attribute selection Filtering Supervised Classification Learning Minimize false negatives Testing 9 open source projects, 8806 individual commits Get labels of “good” or “bad” commit by running tests after each commit “good” – tests pass, “bad” – tests fail 9 open source projects, 8806 individual commits Get labels of “good” or “bad” commit by running tests after each commit “good” – tests pass, “bad” – tests fail As a Pusher or DevOps of a project you would like to know if the given change set is safe to push into the production branch. As a Pusher or DevOps of a project you would like to know if the given change set is safe to push into the production branch. 80 attributes per commit source control, previous commits, and code complexity based attributes: e.g., average change frequency, previous commit state, cyclomatic complexity 80 attributes per commit source control, previous commits, and code complexity based attributes: e.g., average change frequency, previous commit state, cyclomatic complexity Rank based attribute selection Classification algorithms K-NN, SVM, Decision Tree, Random Forest, … Classification algorithms K-NN, SVM, Decision Tree, Random Forest, … 87% Accuracy with K-NN
11 Data Science HPSW Analytic specialist program: Results > 70 developers trained Before: 4 > 30 new capabilities since April 2013 Before: 1 1 Data scientist per 10 new capabilities Before: 1:1 Development time reduced by 70% Before: 12 months
12 Data Science HPSW Can we do better? Yes. From months to days! How? – Create a simple tool for analytic specialists – Automate the data scientist as much as possible
13 Data Science HPSW Project Titan
14 Data Science HPSW Titan: Demo
15 Data Science HPSW Scaling the data scientist Analytic specialists Develops using standard machine learning Uses simplified tool Data Scientist Provides expert advice Develops new types of machine learning solutions