Build interactive data analysis environments using Apache Spark Microsoft 2016 5/29/2018 4:13 PM BRK3226 Build interactive data analysis environments using Apache Spark Maxim Lukiyanov Senior Program Manager, Big Data © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Agenda How it all fits together Components Resource management 5/29/2018 4:13 PM Agenda How it all fits together Components Apache Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure Cloud Resource management © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
What is your top concern for big data projects?
Length of Development Cycle #1 © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Length of development cycle Universal metric to track and improve Affects productivity Predicts project risk
Development phases Data exploration and experimentation Data sharing Development of production code Debugging
Interactive Spark on Azure YARN Jupyter notebooks Default Queue Local HDFS Spark Application IntelliJ/Eclipse Spark Application Livy server REST Spark Application Blob Storage Command line SSH Thrift Queue BI Tools Spark Application Thrift server ODBC Data Lake Store
Components
Apache Spark Interactive compute engine New in Spark 2.0 Interactive on small datasets Interactive on large datasets on large clusters with in-memory or SSD caching Built-in sampling New in Spark 2.0 Tungsten Phase 2 (3-10x speedup) Structured Streams Great momentum Active and large community Supported by all major big data vendors Fast release cadence
Evolution of big data Data Sources
Spark on Azure Cloud (HDInsight) Fully Managed Service 100% open source Apache Spark and Hadoop bits Latest releases of Spark (2.0 is coming later this week) Fully supported by Microsoft and Hortonworks 99.9% Azure Cloud SLA Certifications: PCI, ISO 27018, SOC, HIPAA, EU-MC Tools for data exploration, experimentation and development Jupyter Notebooks (scala, python, automatic data visualizations) IntelliJ/Eclipse plugin (job submission, remote debugging) ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc
Demo: Components in action Maxim Lukiyanov
Resource Management
Interactive Spark on Azure YARN Jupyter notebooks Default Queue Local HDFS Spark Application IntelliJ/Eclipse Spark Application Livy server REST Spark Application Blob Storage Command line SSH Thrift Queue BI Tools Spark Application Thrift server ODBC Data Lake Store
Yarn resource management Dynamic resource allocation (Thrift) Thrift server adds executors when processing SQL queries After timeout it shrinks back Resource preemption (between queues) Thrift will take resources from other apps during activity and vice versa When multiple apps are active the resources are shared fairly
Yarn resource management: Limitations Bugs Capacity resource scheduler + Default resource calculator configuration works Dominant resource calculator breaks preemption logic Limitations No resource preemption between applications No application sharing between notebooks in Livy
Summary Components Techniques Apache Spark Jupyter + sparkmagic kernel (or Zeppelin) Livy job server Apache Yarn resource management using queues and preemption Columnar file formats (parquet, orc) IntelliJ IDEA + plugin for HDInsight [Non-OSS] BI Tools: Power BI, Tableau, Qlik, SAP, Excel, etc Azure Cloud Techniques Sample, sample, sample CACHE TABLE (or auto-caching using Alluxio) Scale out on demand using elasticity of the cloud
Resources SparkMagic kernel for Jupyter notebook Livy job server https://github.com/jupyter-incubator/sparkmagic Livy job server https://github.com/cloudera/livy IntelliJ IDEA plug-in documentation https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-intellij-tool-plugin/ NYTaxi data science notebooks https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-spark-overview/
Q & A Maxim Lukiyanov
Free IT Pro resources To advance your career in cloud technology Microsoft Ignite 2016 5/29/2018 4:13 PM Free IT Pro resources To advance your career in cloud technology Plan your career path Microsoft IT Pro Career Center www.microsoft.com/itprocareercenter Cloud role mapping Expert advice on skills needed Self-paced curriculum by cloud role $300 Azure credits and extended trials Pluralsight 3 month subscription (10 courses) Phone support incident Weekly short videos and insights from Microsoft’s leaders and engineers Connect with community of peers and Microsoft experts Get started with Azure Microsoft IT Pro Cloud Essentials www.microsoft.com/itprocloudessentials Demos and how-to videos Microsoft Mechanics www.microsoft.com/mechanics Connect with peers and experts Microsoft Tech Community https://techcommunity.microsoft.com © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Please evaluate this session 5/29/2018 4:13 PM Please evaluate this session Your feedback is important to us! From your PC or Tablet visit MyIgnite at http://myignite.microsoft.com From your phone download and use the Ignite Mobile App by scanning the QR code above or visiting https://aka.ms/ignite.mobileapp © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
5/29/2018 4:13 PM © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.