Data science and machine learning at scale, powered by Jupyter

Slides:



Advertisements
Similar presentations
1 Atigeo Confidential Building an intelligent big data app in 30 minutes Strata Barcelona Nov 2014 David TalbyClaudiu Barbura SVP EngineeringSr. Director,
Advertisements

Using the WDK for Windows Logo and Signature Testing Craig Rowland Program Manager Windows Driver Kits Microsoft Corporation.
Rodney Holman Mandip Kaur Information Builders  Company Name: Information Builders  CEO and Founder: Gerald D. Cohen  Address: Two Penn Plaza, New.
© Hortonworks Inc Hortonworks Page 1. © Hortonworks Inc Big Data Changes the Game Megabytes Gigabytes Terabytes Petabytes Purchase detail.
Matthew Winter and Ned Shawa
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
AZ PASS User Group Azure Data Factory Overview Josh Sivey, Solution Partner October
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
SAS® Viya™ Overview ANDRĖ DE WAAL, GLOBAL ACADEMIC PROGRAM
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT
Organizations Are Embracing New Opportunities
SAS users meeting in Halifax
Big Data Enterprise Patterns
Scalable Web Apps Target this solution to brand leaders responsible for customer engagement and roll-out of global marketing campaigns. Implement scenarios.
Data Platform and Analytics Foundational Training
Smart Building Solution
Data Analytics and CERN IT Hadoop Service
Hadoop and Analytics at CERN IT
BigDL Deep Learning Library on HDInsight
Build interactive data analysis environments using Apache Spark
Microsoft Machine Learning & Data Science Summit
Working With Azure Batch AI
Docker Birthday #3.
Spark Presentation.
Smart Building Solution
Platform as a Service.
Introduction to R Programming with AzureML
Building Analytics At Scale With USQL and C#
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Hosted on Azure, LoginRadius’ Customer Identity
Scalable Web Apps Target this solution to brand leaders responsible for customer engagement and roll-out of global marketing campaigns. Implement scenarios.
Pentaho 7.1.
NGAGE Intelligence Leverages Microsoft Azure Platform to Provide Essential Analytics for Hybrid SharePoint Server/Office 365 Environments MICROSOFT AZURE.
Sas is open (for business)
Enterprise security for big data solutions on Azure HDInsight
Operationalize your data lake Accelerate business insight
Hybrid Cloud Strategies for Big Data
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Yellowfin: An Azure-Compatible Business Intelligence Platform That Connects People with Their Data for Better Decision Making MICROSOFT AZURE APP BUILDER.
Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.
Accelerate Your Self-Service Data Analytics
Principal Product Manager Oracle Data Science Platform
Intro about Contanier and Docker Technology
: Infrastructure for Complete Machine Learning Lifecycle
Module 01 ETICS Overview ETICS Online Tutorials
Databricks: the new kid on the block
Spark and Scala.
Using the Microsoft AI Platform for next generation applications
Technical Capabilities
Power BI with Analysis Services
Azure Machine Learning on Databricks
School Districts Can Analyze and Report on Data Across Multiple Systems with EdWire, a Powerful Integration Solution that Utilizes Microsoft Azure MICROSOFT.
Agenda Need of Cloud Computing What is Cloud Computing
A platform for the Complete Machine Learning Lifecycle
Big-Data Analytics with Azure HDInsight
Server & Tools Business
Enol Fernandez & Giuseppe La Rocca EGI Foundation
Databricks and End-to-End Processes Demo Links & Help
Mark Quirk Head of Technology Developer & Platform Group
A DevOps process for deploying R to production
Introduction to Azure Data Lake
SQL Server 2019 Bringing Apache Spark to SQL Server
Architecture of modern data warehouse
Presentation transcript:

Data science and machine learning at scale, powered by Jupyter PayPal Notebooks Data science and machine learning at scale, powered by Jupyter Romit Mehta, Praveen Kanamarlapudi • August 24, 2018 © 2018 PayPal Inc. Confidential and proprietary.

Agenda Introductions PayPal Key Metrics and analytics ecosystem Enabling data science at scale Jupyter platform PPMagics Collaboration and deployment Data access with Gimel Big data enhancements Open source plans Agenda © 2018 PayPal Inc. Confidential and proprietary.

Introductions Romit Mehta Praveen Kanamarlapudi Product manager, data processing products at PayPal 20 years in data and analytics across networking, semi-conductors, telecom, security and fintech industries Data warehouse developer, BI program manager, Data product manager romehta@paypal.com https://www.linkedin.com/in/romit-mehta Software engineer, Big data platform engineering at PayPal 5 years in building distributed and scalable applications PayPal Notebooks lead engineer pkanamarlapudi@paypal.com  https://www.linkedin.com/in/praveenkanamarlapudi © 2018 PayPal Inc. Confidential and proprietary.

PayPal Key Metrics © 2018 PayPal Inc. Confidential and proprietary.

PayPal Customers, Transactions and Growth © 2018 PayPal Inc. Confidential and proprietary. From: PayPal’s Q2 2018 Investor Update

PayPal Analytics Ecosystem © 2018 PayPal Inc. Confidential and proprietary.

PayPal Big Data Platform 13 prod clusters, 12 non-prod clusters GPU co-located with Hadoop 75,000+ YARN jobs/day 160+ PB Data One of the largest Aerospike, Teradata, Hortonworks and Oracle installations Compute supported: MR, Pig, Hive, Spark, Beam © 2018 PayPal Inc. Confidential and proprietary.

Infrastructure services leveraged for elasticity and redundancy Developer Data scientist Analyst Operator Logging User Experience and Access Gimel SDK Notebooks R Studio BI tools Monitoring Compute Framework and APIs Gimel Data Platform PCatalog Data API Application Lifecycle Management Alerting Security Infrastructure services leveraged for elasticity and redundancy Multi-DC Predictive resource allocation Public cloud © 2018 PayPal Inc. Confidential and proprietary.

PayPal Notebooks Platform Zeppelin Individual use Jupyter deployed PayPal Notebooks Beta PayPal Notebooks Generally Available PayPal Notebooks Today Q2 2017 Q3 2017 ~50 users Feb 2018 ~100 users ~1,300 users SQL, Spark/PySpark, Python, R 2016 © 2018 PayPal Inc. Confidential and proprietary.

Demo PayPal Notebooks in action © 2018 PayPal Inc. Confidential and proprietary.

Tracking payments by geo Two customer segments Analyst Data scientist Tracking payments by geo POC with static (csv) data Collaborate with team Visualize results Switch to live data Deploy/productionalize Build and train models Fetch data Prep/cleanse data Use algorithm to build model Tweak model Finalize model © 2018 PayPal Inc. Confidential and proprietary. 11

Jupyter deployed as a platform From Jupyter to PayPal Notebooks © 2018 PayPal Inc. Confidential and proprietary.

PayPal Notebooks Platform Highly available JupyterHub SSO + 2FA integration Grid of JupyterHub hosts Highly available and distributed Kerberos + LDAP integration Standalone Docker Container image with all PPExtensions Required to deploy across various security zones at PayPal Foundation for open sourcing PPExtensions GPU integration Enable deep learning through notebooks Distributed TensorFlow training enabled with dynamic GPU resource management © 2018 PayPal Inc. Confidential and proprietary.

PPExtensions Set of extensions to improve user experience and reduce time to market ©2018 PayPal Inc. Confidential and proprietary.

Query data from Spark Thrift Server PPMagics Query data from Hive (or Teradata) Insert data using csv/dataframes Publish to Tableau %hive, %teradata Run any notebook from another notebook Run multiple notebooks in parallel Execute a pipeline of notebooks %run, %run_pipeline Run SQL on csv files %csv Query data from Presto %presto Query data from Spark Thrift Server Includes progress bar for SQL execution %sts © 2018 PayPal Inc. Confidential and proprietary.

Collaboration and deployment Github sharing Project collaboration Push notebook to common org-wide repo View full fidelity notebook on Github Share link to notebook instead of .ipynb file or code snippets Share notebook to personal and team repos Resolve conflicts between remote and local notebooks with nbdime Tableau publishing Deployment/scheduling Seamlessly publish to Tableau Download TDE to use Tableau Desktop, or directly publish as a data source Integrate with Airflow Set up frequency, alerts, optionally push to Github after every run Add Celery executor for scalability © 2018 PayPal Inc. Confidential and proprietary.

Gimel Unified Data API to access any data store ©2018 PayPal Inc. Confidential and proprietary.

Simplified access to big data systems with GSQL Single unified data API to access any data store SQL capabilities against any data store Switch between interactive, batch and streaming modes Centralized metadata catalog (PCatalog) to abstract the physical complexities of accessing data Open sourced in April: gimel.io Integrated with Jupyter through GSQL Dataset browser in notebooks powered by PCatalog © 2018 PayPal Inc. Confidential and proprietary.

Big data and machine learning ©2018 PayPal Inc. Confidential and proprietary.

Big data and machine learning Apache Spark Updates to sparkmagic Enabled progress bar for Spark jobs Apache Livy Enabled SQL session support with Apache Livy Tensorflow Integrated Tensorflow for distributed model training Enabled Tensorflow with GPU © 2018 PayPal Inc. Confidential and proprietary.

PayPal Notebooks Open Source Plans © 2018 PayPal Inc. Confidential and proprietary.

Open sourcing PPExtensions PPMagics: available now Gimel: available now Airflow integration Github sharing Project collaboration Tableau integration Config UI Dataset browser Pipelines Data Science Workbench © 2018 PayPal Inc. Confidential and proprietary.

Open source links and info Install pip install ppextensions Github http://ppextensions.io Google Group https://groups.google.com/d/forum/ppextensions Slack https://ppextensions.slack.com Gimel http://gimel.io © 2018 PayPal Inc. Confidential and proprietary.

Q&A © 2018 PayPal Inc. Confidential and proprietary.