Microsoft Machine Learning & Data Science Summit

Microsoft Machine Learning & Data Science Summit
September 26 – 27 | Atlanta, GA

Big, fast and data-furious… with Spark
Maxim Lukiyanov Senior Program Manager Big Data, Microsoft

Session objectives and takeaways
Tech Ready 15 5/31/2018 Session objectives and takeaways Session objective(s): Discover tools and techniques enabling interactive data analysis on Spark Explore interactive Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure Cloud Discuss problems and solutions of resource management in Spark Key takeaway 1 Productivity of data scientists is bound by the speed of development cycle Key takeaway 2 Speed of development cycle in big data projects can be maintained at high level as long as right tools and techniques are utilized © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

What is your top concern for big data projects?

Length of Development Cycle
Machine Learning & Data Science Conference 5/31/ :33 PM Length of Development Cycle #1 © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Length of development cycle
Universal metric to track and improve Affects productivity Predicts project risk

Development phases Data exploration and experimentation Data sharing
Development of production code Debugging

Interactive Spark on Azure
YARN Jupyter notebooks Default Queue Local HDFS Spark Application IntelliJ/Eclipse Spark Application Livy server REST Spark Application Blob Storage Command line SSH Thrift Queue BI Tools Spark Application Thrift server ODBC Data Lake Store

Components

Apache Spark Interactive compute engine Upcoming in Spark 2.0
Interactive on small datasets Interactive on large datasets on large clusters with in-memory or SSD caching Built-in sampling Upcoming in Spark 2.0 Tungsten Phase 2 (3-10x speedup) Structured Streams Great momentum Active and large community Supported by all major big data vendors Fast release cadence

Evolution of big data Data Sources

Spark on Azure Cloud (HDInsight)
Fully Managed Service 100% open source Apache Spark and Hadoop bits Latest releases of Spark Fully supported by Microsoft and Hortonworks 99.9% Azure Cloud SLA Certifications: PCI, ISO 27018, SOC, HIPAA, EU-MC Tools for data exploration, experimentation and development Jupyter Notebooks (scala, python, automatic data visualizations) IntelliJ/Eclipse plugin (job submission, remote debugging) ODBC connector for Power BI, Tableau, Qlik, SAP, Excel, etc

Demo: Components in action
Maxim Lukiyanov

Resource management

Interactive Spark on Azure
YARN Jupyter notebooks Default Queue Local HDFS Spark Application IntelliJ/Eclipse Spark Application Livy server REST Spark Application Blob Storage Command line SSH Thrift Queue BI Tools Spark Application Thrift server ODBC Data Lake Store

Yarn resource management
Dynamic resource allocation (Thrift) Thrift server adds executors when processing SQL queries After timeout it shrinks back Resource preemption (between queues) Thrift will take resources from other apps during activity and vice versa When multiple apps are active the resources are shared fairly

Yarn resource management: Limitations
Bugs Capacity resource scheduler + Default resource calculator configuration works Dominant resource calculator breaks preemption logic Limitations No resource preemption between applications No application sharing between notebooks in Livy

Summary Components Techniques Apache Spark
Jupyter + sparkmagic kernel (or Zeppelin) Livy job server Apache Yarn resource management using queues and preemption Columnar file formats (parquet, orc) IntelliJ/Eclipse + plugin for HDInsight [Non-OSS] BI Tools: Power BI, Tableau, Qlik, SAP, Excel, etc Azure Cloud Techniques Sample, sample, sample CACHE TABLE (or auto-caching using Alluxio) Scale out on demand using elasticity of the cloud

In review: session objectives and takeaways
Tech Ready 15 5/31/2018 In review: session objectives and takeaways Session objective(s): Discover tools and techniques enabling interactive data analysis on Spark Explore interactive Spark, Notebooks, Job submission server, BI Tools, Developer Tools, Azure Cloud Discuss problems and solutions of resource management in Spark Key takeaway 1 Productivity of data scientists is bound by the speed of development cycle Key takeaway 2 Speed of development cycle can remain high even in big data projects as long as right tools and techniques are utilized © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Related content SparkMagic kernel for Jupyter notebook Livy job server
Livy job server IntelliJ IDEA plug-in documentation Azure Spark Documentation

Microsoft Machine Learning & Data Science Summit

Similar presentations

Presentation on theme: "Microsoft Machine Learning & Data Science Summit"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microsoft Machine Learning & Data Science Summit

Similar presentations

Presentation on theme: "Microsoft Machine Learning & Data Science Summit"— Presentation transcript:

Similar presentations

About project

Feedback