Scaling the Data Scientist Dr. Ira Cohen, Chief Data Scientist, HP Software.

Slides:



Advertisements
Similar presentations
Big Data & Predictive Analytics Michael Stencl. Agenda  Big Data  Predictive Analytics  So what?
Advertisements

If you knew what I know or CloudWave - Improving services in the Cloud through collaborative adaptation Eliot Salant IBM Haifa Research.
Computational Learning An intuitive approach. Human Learning Objects in world –Learning by exploration and who knows? Language –informal training, inputs.
Decision Tree Approach in Data Mining
Quantitative Research and Analytics, Proprietary and Confidential1 Ryan Michaluk
Bug Isolation via Remote Program Sampling Ben Liblit, Alex Aiken, Alice X.Zheng, Michael I.Jordan Presented by: Xia Cheng.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
SBSE Course 3. EA applications to SE Analysis Design Implementation Testing Reference: Evolutionary Computing in Search-Based Software Engineering Leo.
*As of April, 2015 Most Common Path.
The Decision-Making Process IT Brainpower
Software Quality Analysis with Limited Prior Knowledge of Faults Naeem (Jim) Seliya Assistant Professor, CIS Department University of Michigan – Dearborn.
Burton D. Morgan Entrepreneurial Competition Are you the entrepreneurial type? Do you want to start your own business and be your own boss? Do you have.
Data Analytics Program at Drake Brad C. Meyer, Chair Information Management and Business Analytics.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Classifiers, Part 3 Week 1, Video 5 Classification  There is something you want to predict (“the label”)  The thing you want to predict is categorical.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
Water Contamination Detection – Methodology and Empirical Results IPN-ISRAEL WATER WEEK (I 2 W 2 ) Eyal Brill Holon institute of Technology, Faculty of.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
& Dev Ops. Sherwin-Williams & DevOps Introduction to Sherwin-Williams.
353(0) TECHNOLOGY SOLUTION CeADAR’s standalone platform personalises the forecasting model for monitored.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Top Down View of Estimation Test Managers Forum 25 th April 2007.
Arben Asllani University of Tennessee at Chattanooga Prescriptive Analytics CHAPTER 8 Marketing Analytics with Linear Programming Business Analytics with.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Spam Detection Ethan Grefe December 13, 2013.
DEPLOYMENT AUTOMATION & CONTINUOUS DEPLOYMENT Szymon Pobiega.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Automatic Transformation of Raw Clinical Data into Clean Data Using Decision Tree Learning Jian Zhang Supervised by: Karen Petrie 1.
Security Analytics Thrust Anthony D. Joseph (UCB) Rachel Greenstadt (Drexel), Ling Huang (Intel), Dawn Song (UCB), Doug Tygar (UCB)
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
Optimal Pipeline Using Perforce, Jenkins & Puppet Nitin Pathak Works on
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED Machine Learning and Failure Prediction in Hard Disk Drives Dr. Amit Chattopadhyay Director.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
Kaggle Competition Rossmann Store Sales.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
D ATA S CIENTISTS Who are they and what do they do?
Microsoft NDA Material Adwait Joshi Sr. Technical Product Manager Microsoft Corporation.
Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA
The Fallacy Behind “There’s Nothing to Hide” Why End-to-End Encryption Is a Must in Today’s World.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
SOFTWARE TESTING TRAINING TOOLS SUPPORT FOR SOFTWARE TESTING Chapter 6 immaculateres 1.
A Generic Approach to Big Data Alarms Prioritization
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Advanced data mining with TagHelper and Weka
Siemens Enables Digitalization: Data Analytics & Artificial Intelligence Dr. Mike Roshchin, CT RDA BAM.
School of Computer Science & Engineering
Mike Gualtieri, Principal Analyst Rowan Curran, Researcher
A UNIFIED ECOSYSTEM FOR MARKET DATA VISUALIZATION
Prepared by: Mahmoud Rafeek Al-Farra
Active Cyber Security, OnDemand
CSE 4705 Artificial Intelligence
Release Management with Visual Studio Team Services
Dr. Morgan C. Wang Department of Statistics
Simplified Development Toolkit
Prepared by: Mahmoud Rafeek Al-Farra
Experiments in Machine Learning
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Daniel Mennell Mesosphere, Inc.
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
DEVOPS & THE FUTURE OF TESTING
Practice Project Overview
Presentation transcript:

Scaling the Data Scientist Dr. Ira Cohen, Chief Data Scientist, HP Software

2 Data Science HPSW HP-Software and Data Science HP-Software products collect huge amounts of IT data Customers want us to transform the data to actionable information System Monitoring Events Defects Incidents Logs Changes Configuration Test data Requirements “Big Data & Predictive Analytics: The Future of IT Management” Mike Gualtieri, Forrester Security events Network data App Monitoring

3 Data Science HPSW Need Expertise Expertise in machine learning Expertise in the products domain Infrastructure Data platformsDevelopment Tools

4 Data Science HPSW A tale of two worlds Data Scientists Few Limited domain knowledge Tools: R, Matlab, Mahout, Knime, Weka, Sas, … Developers/SMEs Plentiful Limited data science knowledge Tools: IDEs, Excel

5 Data Science HPSW Developer Data analytics specialist Our solution

6 Data Science HPSW How? Training Mentoring Community Training Mentoring Community Data infrastructure New Dev tool Data infrastructure New Dev tool

7 Data Science HPSW Training: Practical Machine Learning 4 day training Commitment to complete first project

Practical Machine Learning Ohad Assulin, Efrat Egozi Levi, Ira Cohen Automatic Event Prioritization Anat Levinger & Roy Wallerstein Automatic Vulnerability Categorization Barak Raz & Ben Feher Classifying Security Events Yoni Roit & Omer Weissman Early detection of anomalous behavior in IT systems Yonatan Ben Simhon & Yaneeve Shekel Cloud Delivery Optimization (CDO) Ran, Levi URL to Action Classification Boaz Shor & Eyal Kenigsberg Predictive Analytics in Release Management Sigalit Sade Sales Pipeline Early Warning Gabriel, Alvarado

Pushing My Buttons Gil Zieder, Ofer Eliassaf, Boris Kozorovitzky

10 Data Science HPSW The work Problem definition Data Attribute construction Normalization Processing Attribute selection Filtering Supervised Classification Learning Minimize false negatives Testing 9 open source projects, 8806 individual commits Get labels of “good” or “bad” commit by running tests after each commit “good” – tests pass, “bad” – tests fail 9 open source projects, 8806 individual commits Get labels of “good” or “bad” commit by running tests after each commit “good” – tests pass, “bad” – tests fail As a Pusher or DevOps of a project you would like to know if the given change set is safe to push into the production branch. As a Pusher or DevOps of a project you would like to know if the given change set is safe to push into the production branch. 80 attributes per commit source control, previous commits, and code complexity based attributes: e.g., average change frequency, previous commit state, cyclomatic complexity 80 attributes per commit source control, previous commits, and code complexity based attributes: e.g., average change frequency, previous commit state, cyclomatic complexity Rank based attribute selection Classification algorithms K-NN, SVM, Decision Tree, Random Forest, … Classification algorithms K-NN, SVM, Decision Tree, Random Forest, … 87% Accuracy with K-NN

11 Data Science HPSW Analytic specialist program: Results > 70 developers trained Before: 4 > 30 new capabilities since April 2013 Before: 1 1 Data scientist per 10 new capabilities Before: 1:1 Development time reduced by 70% Before: 12 months

12 Data Science HPSW Can we do better? Yes. From months to days! How? – Create a simple tool for analytic specialists – Automate the data scientist as much as possible

13 Data Science HPSW Project Titan

14 Data Science HPSW Titan: Demo

15 Data Science HPSW Scaling the data scientist Analytic specialists Develops using standard machine learning Uses simplified tool Data Scientist Provides expert advice Develops new types of machine learning solutions