Databricks: the new kid on the block Antonio Abalos Castillo antonioa@avanade.com http://www.sqlsaturday.com/746/Sessions/Details.aspx?sid=78633
A big thanks to all of our sponsors!
… the new kid on the block [informal] Someone who is new in a place or organization and has many things to learn about it Well, it is actually us the ones who really need to learn about it!! https://dictionary.cambridge.org/es/diccionario/ingles/new-kid-on-the-block?q=the-new-kid-on-the-block https://www.phrases.org.uk/meanings/255875.html
Ok, this is about BI and data science… 85% !! …and failure rates for analytics, BI, and big data projects = https://designingforanalytics.com/resources/failure-rates-for-analytics-bi-iot-and-big-data-projects-85-yikes/ http://www.digitaljournal.com/tech-and-science/technology/big-data-strategies-disappoint-with-85-percent-failure-rate/article/508325 https://twitter.com/nheudecker/status/928720268662530048
Who is already “in the block”? AZURE DATA FACTORY AZURE IMPORT EXPORT SERVICE AZURE SQL DB AZURE COSMOS DB AZURE SQL DATA WAREHOUSE AZURE ANALYSIS SERVICES POWER BI AZURE STORAGE BLOBS AZURE DATA LAKE STORE AZURE ML ML SERVER AZURE DATABRICKS AZURE DATA LAKE ANALYTICS AZURE HDINSIGHT AZURE DATABRICKS AZURE IOT HUB AZURE EVENT HUBS KAFKA ON AZURE HDINSIGHT AZURE SEARCH AZURE DATA CATALOG AZURE STREAM ANALYTICS HDINSIGHT DATABRICKS COGNITIVE SERVICES BOT SERVICE AZURE ACTIVE DIRECTORY AZURE NETWORK SECURITY GROUPS AZURE KEY MANAGEMENT SERVICE AZURE EXPRESSROUTE OPERATIONS MANAGEMENT SUITE AZURE FUNCTIONS VISUAL STUDIO
More precisely, on big data, HDInsight Includes Jupyter and Zeppelin notebooks Remote API for job management Integrated with Blob storage, Event Hubs for streaming and Power Bi for analytics Quick to deploy and scale https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-overview
HDInsight, other considerations Provisioning (template: 101-hdinsight-spark-linux): https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache- spark-jupyter-spark-sql Clusters have to be created (20 minutes) and deleted after use Admins have to decide on what to do with the disks and files Data Factory can be used to automate the process (on-demand) https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-load-data-run-query
Azure Machine Learning Studio Serverless Web based Active Directory integrated Notebooks Limited regions (West Europe)
Azure Machine Learning Studio https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml
Azure Machine Learning Workbench (aka Machine Learning Services) (In preview as of August 2018) Desktop application Python based and Git compatible Built-in Jupyter notebooks Integrated in Azure AD Deploys and runs models via Docker containers (Azure Machine Learning Experimentation service) https://docs.microsoft.com/en-us/azure/machine-learning/service/ https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/experimentation-service-configuration https://docs.microsoft.com/en-us/azure/machine-learning/service/overview-what-is-azure-ml
Azure Machine Learning Workbench and Jupyter notebooks https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-use-jupyter-notebooks
Microsoft Machine Learning Server Previously known as “R Server” Extends R with parallel tools for big data processing Available in HDInsight Runs models via Hadoop or Spark Can publish models via web service Can run Python too http://blog.revolutionanalytics.com/2016/01/microsoft-r-open.html https://docs.microsoft.com/en-us/machine-learning-server/what-is-machine-learning-server https://docs.microsoft.com/en-us/machine-learning-server/ https://docs.microsoft.com/en-us/machine-learning-server/operationalize/quickstart-publish-r-web-service#b-publish-model-as-a-web-service
What is the point with notebooks? https://www.svds.com/why-notebooks-are-super-charging-data-science/
Isn’t everything about Jupyter? Azure Machine Learning Studio Azure Machine Learning Workbench Data Science VM HDInsight Databricks https://docs.microsoft.com/en-us/azure/machine-learning/desktop-workbench/how-to-use-jupyter-notebooks https://notebooks.azure.com/
How does the technology framework look like?
Some tools in the Azure technology framework for data science Data preparation Azure Notebooks Azure Machine Learning Workbench Azure Machine Learning Studio Other tools (R Studio, Visual Studio Code, …) Data Factory/Data Lake Analytics Model execution Spark on HDInsight Docker Machine Learning Server SQL Server (yes!) Azure Machine Learning web service Some tools in the Azure technology framework for data science https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/data-science-and-machine-learning
Big data architectures
Big data architectures https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/
Big data and advanced analytics scenarios Modern Data Warehousing “We want to integrate all our data including ‘big data’ with our data warehouse” Advanced Analytics “We are trying to predict when our customers churn” Real-time Analytics “We are trying to get insights from our devices in real-time”
Fast, easy, and collaborative Apache Spark-based analytics platform Databricks Fast, easy, and collaborative Apache Spark-based analytics platform
Ok, but what is Databricks? Best of Databricks Best of Microsoft The leading Apache Spark analytics platform “It is not so often in the software industry that the most widely used tool is also the best available platform to choose from” Dr. Veljko Krunic
Databricks foundations What is Apache Spark? Apache Spark is an open source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark gives us a comprehensive, unified framework to manage big data processing requirements with a variety of data sets on top of an existing Hadoop Distributed File System (HDFS) infrastructure. Apache Hadoop is an open source software platform for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Hadoop services provide for data storage, data processing, data access, data governance, security, and operations. What is Hadoop?
What do we get with Spark? Allows programmers to develop complex, multi-step data pipelines In-memory data sharing across different jobs (not like Hadoop, which is HDFS file-based) More than just Map and Reduce functions Optimizes arbitrary operator graphs Lazy evaluation of big data queries Provides concise and consistent APIs in Scala, Java and Python Interactive shell for Scala and Python Support for SQL and R
Ok wait, I like Spark but… I don’t want Databricks Azure still has HDInsight with Spark on top, but: Cluster management is up to you Notebook integration has to be configured (Jupyter or Zeppelin) Lacks memory and performance enhancements Some good things still remain: Anaconda comes preloaded by default Azure integration with other services (Data lake, Machine Learning, Power BI) REST APIs for service deployment and job management (Livy) https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-overview
Why Databricks then? Unified platform for data science and data engineering Easy to promote experiments to “products” Unified security model, encryption and auditing Optimized version of Spark, running 10 to 40x faster
Machine learning models MULTI-STAGE PIPELINES Azure Databricks Azure Databricks Collaborative Workspace Machine learning models IoT / streaming data DATA ENGINEER DATA SCIENTIST BUSINESS ANALYST Deploy Production Jobs & Workflows BI tools Cloud storage MULTI-STAGE PIPELINES JOB SCHEDULER NOTIFICATION & LOGS Data warehouses Optimized Databricks Runtime Engine Data exports Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business. 75% of the code committed to Apache Spark comes from Databricks Unified Runtime Create clusters in seconds, dynamically scale them up and down. They’ve made enhancements to Spark engine to make it 10x faster than open source Spark Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation Unified Collaboration Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks DS - For data scientists, easy data exploration in notebooks Business SME – interactive dashboards empower teams to create dynamic reports Enterprise Security Encryption Fine grained Role-based access control (files, clusters, code, application, dashboard) Compliance Rest APIs DE – DBIO, SPARK, API’s , JOBS DS – Spark and Serverless, Interactive Data Science Data Products - Everything Creators of Spark Training People Number of Customers Ingest Workflow Schedule / Run / Monitor Execute Troubleshoot Debug Production Jobs --------- Ingest, ETL, Scheduling, Monitoring Hadoop storage DATABRICKS I/O APACHE SPARK SERVERLESS Rest APIs Data warehouses Enhance Productivity Build on secure & trusted cloud Scale without limits
Databricks in Azure Control plane managed by Databricks Data plane controlled by Azure Deployed as IaaS using as many nodes as required
Control plane Notebooks, jobs, clusters, users and ACLs are managed from the control plane These services store data in dedicated Databricks databases (not accessible to external users) The control plane is accessible from Databricks UX Databricks API
Data plane The Spark clusters are deployed to the customer’s Azure subscription Each workspace and associated clusters are created in dedicated VNETs Access to VNETs is restricted by network security groups (NSG)
How to provision Databricks from Azure Databricks setup
Databricks setup – Creating workspace https://azure.microsoft.com/en-us/pricing/details/databricks/
Databricks setup – Creating workspace Control plane provisioned
Databricks setup – Creating workspace Control plane So far, nothing to worry about
Databricks setup – Creating clusters
Databricks setup – Creating clusters Provisioning time is approx. 8’
Databricks setup – Testing setup Impressive results!! ;)
Databricks setup – Behind the scenes Here are the cost drivers Separated resource group, managed from the control plane network, VMs, storage, disks
Databricks setup – Behind the scenes Cluster terminated Virtual machines and networks removed Storage account remains
Other resources https://azure.microsoft.com/en-us/services/databricks/ https://blogs.msdn.microsoft.com/sqlcat/2016/08/18/migrating-data-to-azure-sql-data- warehouse-in-practice/ https://blogs.msdn.microsoft.com/sqlcat/2017/05/17/azure-sql-data-warehouse-loading- patterns-and-strategies/ https://blogs.msdn.microsoft.com/sqlcat/2017/09/05/azure-sql-data-warehouse-workload- patterns-and-anti-patterns/ https://channel9.msdn.com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK3377 https://channel9.msdn.com/Events/Ignite/Microsoft-Ignite-Orlando-2017/BRK4016 https://databricks.com/product/azure https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best- practices
Help deciding which Machine Learning tool to use Help deciding what Machine Learning technology to use: https://docs.microsoft.com/en-us/azure/architecture/data- guide/technology-choices/data-science-and-machine-learning https://docs.microsoft.com/en-us/azure/machine- learning/service/overview-what-is-azure-ml https://docs.microsoft.com/en-us/azure/machine- learning/service/overview-more-machine-learning
Thank you!!