Databricks: the new kid on the block

Slides:

Advertisements

Similar presentations

Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.

Advertisements

Running Hadoop-as-a-Service in the Cloud

How* to Win the #BestMicrosoftHack Shahed Chowdhuri Sr. Technical WakeUpAndCode.com *Hint: Use the Cloud.

Matthew Winter and Ned Shawa

Andy Roberts Data Architect

AZ PASS User Group Azure Data Factory Overview Josh Sivey, Solution Partner October

An Introduction To Big Data For The SQL Server DBA.

What if your app could put the power of analytics everywhere decisions are made? Modern apps with data visualizations built-in have the power to inform.

A Suite of Products that allow you to Predict Outcomes, Prescribe Actions and Automate Decisions.

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Business Insights Play briefing deck.

Big Data from Microsoft Azure Robert Turnage Data Solutions Architect

BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT

Connected Infrastructure

Run Azure Services in your datacenter

AuraPortal Cloud Helps Empower Organizations to Organize and Control Their Business Processes via Applications on the Microsoft Azure Cloud Platform MICROSOFT.

Data Platform and Analytics Foundational Training

Big Data Enterprise Patterns

Data Platform and Analytics Foundational Training

Smart Building Solution

Examine information management in Cortana Intelligence

Cortana Intelligence Overview

Creating Enterprise Grade BI Models with Azure Analysis Services

Build interactive data analysis environments using Apache Spark

Microsoft Machine Learning & Data Science Summit

Working With Azure Batch AI

Hybrid Management and Security

Partner Logo Veropath Offers a Next-Gen Expense Management SaaS Technology Solution, Built Specifically to Harness Big Data Analytics Capabilities in Azure.

Spark Presentation.

Smart Building Solution

Connected Infrastructure

Building Analytics At Scale With USQL and C#

Data Platform and Analytics Foundational Training

Add intelligence to Dynamics AX with Cortana Intelligence suite

Cloudy with a Chance of Data

Shubha Vijayasarathy Program Manager, Azure Event Hubs - Microsoft

Azure Infrastructure as a Service

9/21/2018 3:41 AM BRK3180 Architect your big data solutions with SQL Data Warehouse & Azure Analysis Services Josh Caplan & Matt Usher Program Managers.

Enterprise security for big data solutions on Azure HDInsight

Turning back time … … to 1998.

Capitalize on modern technology

Welcome! Power BI User Group (PUG)

Migrating Your BI Platform To Azure

Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.

Welcome! Power BI User Group (PUG)

Near Real Time ETLs with Azure Serverless Architecture

Data science and machine learning at scale, powered by Jupyter

12/5/ :36 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.

Modern cloud PaaS for mobile apps, web sites, API's and business logic apps

Analytics in the Cloud using Microsoft Azure

Technical Capabilities

What’s New and What’s Coming…

Microsoft Azure.

Azure Machine Learning on Databricks

Agenda Need of Cloud Computing What is Cloud Computing

ETL Patterns in the Cloud with Azure Data Factory

Big-Data Analytics with Azure HDInsight

Moving your on-prem data warehouse to cloud. What are your options?

Introduction to Azure Data Lake

Productive + Hybrid + Intelligent + Trusted

Data Wrangling for ETL enthusiasts

Michael French Principal Consultant 5/18/2019

Beyond orchestration with Azure Data Factory

SQL Server 2019 Bringing Apache Spark to SQL Server

Get your data flowing with Data Flows! and...umm...dataflows.

Visual Data Flows – Azure Data Factory v2

Visual Data Flows – Azure Data Factory v2

Architecture of modern data warehouse

Presentation transcript:

Databricks: the new kid on the block Antonio Abalos Castillo antonioa@avanade.com http://www.sqlsaturday.com/746/Sessions/Details.aspx?sid=78633

A big thanks to all of our sponsors!

… the new kid on the block [informal] Someone who is new in a place or organization and has many things to learn about it Well, it is actually us the ones who really need to learn about it!! https://dictionary.cambridge.org/es/diccionario/ingles/new-kid-on-the-block?q=the-new-kid-on-the-block https://www.phrases.org.uk/meanings/255875.html

Ok, this is about BI and data science… 85% !! …and failure rates for analytics, BI, and big data projects = https://designingforanalytics.com/resources/failure-rates-for-analytics-bi-iot-and-big-data-projects-85-yikes/ http://www.digitaljournal.com/tech-and-science/technology/big-data-strategies-disappoint-with-85-percent-failure-rate/article/508325 https://twitter.com/nheudecker/status/928720268662530048

Who is already “in the block”? AZURE DATA FACTORY AZURE IMPORT EXPORT SERVICE AZURE SQL DB AZURE COSMOS DB AZURE SQL DATA WAREHOUSE AZURE ANALYSIS SERVICES POWER BI AZURE STORAGE BLOBS AZURE DATA LAKE STORE AZURE ML ML SERVER AZURE DATABRICKS AZURE DATA LAKE ANALYTICS AZURE HDINSIGHT AZURE DATABRICKS AZURE IOT HUB AZURE EVENT HUBS KAFKA ON AZURE HDINSIGHT AZURE SEARCH AZURE DATA CATALOG AZURE STREAM ANALYTICS HDINSIGHT DATABRICKS COGNITIVE SERVICES BOT SERVICE AZURE ACTIVE DIRECTORY AZURE NETWORK SECURITY GROUPS AZURE KEY MANAGEMENT SERVICE AZURE EXPRESSROUTE OPERATIONS MANAGEMENT SUITE AZURE FUNCTIONS VISUAL STUDIO

More precisely, on big data, HDInsight Includes Jupyter and Zeppelin notebooks Remote API for job management Integrated with Blob storage, Event Hubs for streaming and Power Bi for analytics Quick to deploy and scale https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-overview

HDInsight, other considerations Provisioning (template: 101-hdinsight-spark-linux): https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache- spark-jupyter-spark-sql Clusters have to be created (20 minutes) and deleted after use Admins have to decide on what to do with the disks and files Data Factory can be used to automate the process (on-demand) https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-load-data-run-query

Azure Machine Learning Studio Serverless Web based Active Directory integrated Notebooks Limited regions (West Europe)

Azure Machine Learning Studio https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml

Azure Machine Learning Workbench (aka Machine Learning Services) (In preview as of August 2018) Desktop application Python based and Git compatible Built-in Jupyter notebooks Integrated in Azure AD Deploys and runs models via Docker containers (Azure Machine Learning Experimentation service) https://docs.microsoft.com/en-us/azure/machine-learning/service/ https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/experimentation-service-configuration https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml

Azure Machine Learning Workbench and Jupyter notebooks https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-use-jupyter-notebooks

Microsoft Machine Learning Server Previously known as “R Server” Extends R with parallel tools for big data processing Available in HDInsight Runs models via Hadoop or Spark Can publish models via web service Can run Python too http://blog.revolutionanalytics.com/2016/01/microsoft-r-open.html https://docs.microsoft.com/en-us/machine-learning-server/what-is-machine-learning-server https://docs.microsoft.com/en-us/machine-learning-server/ https://docs.microsoft.com/en-us/machine-learning-server/operationalize/quickstart-publish-r-web-service#b-publish-model-as-a-web-service

What is the point with notebooks? https://www.svds.com/why-notebooks-are-super-charging-data-science/

Isn’t everything about Jupyter? Azure Machine Learning Studio Azure Machine Learning Workbench Data Science VM HDInsight Databricks https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-use-jupyter-notebooks https://notebooks.azure.com/

How does the technology framework look like?

Some tools in the Azure technology framework for data science Data preparation Azure Notebooks Azure Machine Learning Workbench Azure Machine Learning Studio Other tools (R Studio, Visual Studio Code, …) Data Factory/Data Lake Analytics Model execution Spark on HDInsight Docker Machine Learning Server SQL Server (yes!) Azure Machine Learning web service Some tools in the Azure technology framework for data science https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/data-science-and-machine-learning

Big data architectures

Big data architectures https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/

Big data and advanced analytics scenarios Modern Data Warehousing “We want to integrate all our data including ‘big data’ with our data warehouse” Advanced Analytics “We are trying to predict when our customers churn” Real-time Analytics “We are trying to get insights from our devices in real-time”

Fast, easy, and collaborative Apache Spark-based analytics platform Databricks Fast, easy, and collaborative Apache Spark-based analytics platform

Ok, but what is Databricks? Best of Databricks Best of Microsoft The leading Apache Spark analytics platform “It is not so often in the software industry that the most widely used tool is also the best available platform to choose from” Dr. Veljko Krunic

Databricks foundations What is Apache Spark? Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets on top of an existing Hadoop Distributed File System (HDFS) infrastructure. Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations. What is Hadoop?

What do we get with Spark? Allows programmers to develop complex, multi-step data pipelines In-memory data sharing across different jobs (not like Hadoop, which is HDFS file-based) More than just Map and Reduce functions Optimizes arbitrary operator graphs Lazy evaluation of big data queries Provides concise and consistent APIs in Scala, Java and Python Interactive shell for Scala and Python Support for SQL and R

Ok wait, I like Spark but… I don’t want Databricks Azure still has HDInsight with Spark on top, but: Cluster management is up to you Notebook integration has to be configured (Jupyter or Zeppelin) Lacks memory and performance enhancements Some good things still remain: Anaconda comes preloaded by default Azure integration with other services (Data lake, Machine Learning, Power BI) REST APIs for service deployment and job management (Livy) https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-overview

Why Databricks then? Unified platform for data science and data engineering Easy to promote experiments to “products” Unified security model, encryption and auditing Optimized version of Spark, running 10 to 40x faster

Machine learning models MULTI-STAGE PIPELINES Azure Databricks Azure Databricks Collaborative Workspace Machine learning models IoT / streaming data DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST Deploy Production Jobs & Workflows BI tools Cloud storage MULTI-STAGE PIPELINES JOB SCHEDULER NOTIFICATION & LOGS Data warehouses Optimized Databricks Runtime Engine Data exports Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring Hadoop storage DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs Data warehouses Enhance Productivity Build on secure & trusted cloud Scale without limits

Databricks in Azure Control plane managed by Databricks Data plane controlled by Azure Deployed as IaaS using as many nodes as required

Control plane Notebooks, jobs, clusters, users and ACLs are managed from the control plane These services store data in dedicated Databricks databases (not accessible to external users) The control plane is accessible from Databricks UX Databricks API

Data plane The Spark clusters are deployed to the customer’s Azure subscription Each workspace and associated clusters are created in dedicated VNETs Access to VNETs is restricted by network security groups (NSG)

How to provision Databricks from Azure Databricks setup

Databricks setup – Creating workspace https://azure.microsoft.com/en-us/pricing/details/databricks/

Databricks setup – Creating workspace Control plane provisioned

Databricks setup – Creating workspace Control plane So far, nothing to worry about

Databricks setup – Creating clusters

Databricks setup – Creating clusters Provisioning time is approx. 8’

Databricks setup – Testing setup Impressive results!! ;)

Databricks setup – Behind the scenes Here are the cost drivers Separated resource group, managed from the control plane network, VMs, storage, disks

Databricks setup – Behind the scenes Cluster terminated Virtual machines and networks removed Storage account remains

Other resources https://azure.microsoft.com/en-us/services/databricks/ https://blogs.msdn.microsoft.com/sqlcat/2016/08/18/migrating-data-to-azure-sql-data- warehouse-in-practice/ https://blogs.msdn.microsoft.com/sqlcat/2017/05/17/azure-sql-data-warehouse-loading- patterns-and-strategies/ https://blogs.msdn.microsoft.com/sqlcat/2017/09/05/azure-sql-data-warehouse-workload- patterns-and-anti-patterns/ https://channel9.msdn.com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK3377 https://channel9.msdn.com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK4016 https://databricks.com/product/azure https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best- practices

Help deciding which Machine Learning tool to use Help deciding what Machine Learning technology to use: https://docs.microsoft.com/en-us/azure/architecture/data- guide/technology-choices/data-science-and-machine-learning https://docs.microsoft.com/en-us/azure/machine- learning/service/overview-what-is-azure-ml https://docs.microsoft.com/en-us/azure/machine- learning/service/overview-more-machine-learning

Thank you!!