BigDL Deep Learning Library on HDInsight 5/23/2018 1:25 PM THR3040 BigDL Deep Learning Library on HDInsight Microsoft Ignite September , 2017 Xiaoyong Zhu, Microsoft Sergey Ermolin, Intel © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
BigDL Deep Learning Library on HDInsight Microsoft Ignite September , 2017 Xiaoyong Zhu, Microsoft Sergey Ermolin, Intel
BIGDL WITHIN SPARK FRAMEWORK End-to-end Big Data Analytics with Deep Learning Functionalities Directly on Spark Natively integrated with Big Data (Hadoop/Spark) ecosystem Massively distributed, scale out Sends compute to data Fault tolerance Elasticity Incremental scaling Dynamic resource sharing BigDL https://software.intel.com/bigdl
BigDL features BigDL Python API Scala API Examples Documents Seq2Seq Vgg ResNet Lenet Inception SGD Adagrad Cross Entropy Distributed Training Batch Normalization Other 100+ Layers Tensor MKL Integration Spatial Convolution RELU LRN RNN Pooling BigDL
https://software.intel.com/bigdl BigDL Features Distributed Deep learning applications (training, fine-tuning & prediction) on Apache Spark* No changes to the existing Hadoop/Spark clusters needed https://software.intel.com/bigdl
https://github.com/intel-analytics/BigDL BIGDL benefits Allows to write deep learning applications as standard Spark programs Runs on top of existing Spark or Hadoop/Hive clusters Adds rich Deep Learning functionalities to Apache Spark Feature parity with Caffe and TensorFlow. High performance - Intel MKL and multi-threaded programming Efficient scale-out with an all-reduce communications on Spark https://github.com/intel-analytics/BigDL BigDL has been open-sourced since 2016: https://software.intel.com/bigdl
BigDL can re-use/fine-tune models from other frameworks BigDL Model File Load existing Caffe/Torch/TF Model Allows for transition from single-node to distributed application deployment Useful for inference Allows for minor model tuning Allows for model sharing between Data Scientists and Production Engr. Scoring can be done *outside of Spark*, as a Java app Caffe Model File Load BigDL TensorFlow Model File Save Torch Model File Storage https://software.intel.com/bigdl
BigDL integration with spark streaming Integration with Spark Streaming for runtime training and prediction HDFS/S3 Kafka Flume Kinesis Twitter BigDL Model RDDs Train Spark Streaming Evaluator StreamWriter Predict https://software.intel.com/bigdl
https://software.intel.com/bigdl Python API Support Based on PySpark, Python API in BigDL allows use of existing Python libs: Numpy Scipy Pandas Scikit-learn Matplotlib $pip install bigdl https://software.intel.com/bigdl
Jupyter Notebook support Running BigDL applications directly in Jupyter notebooks Share and Reproduce Notebooks can be shared with others Easy to reproduce and track Rich Content Texts, images, videos, LaTeX and JavaScript Code can also produce rich contents Rich toolbox Apache Spark, from Python, R and Scala Pandas, scikit-learn, ggplot2, dplyr, etc https://software.intel.com/bigdl
Visualization of optimization process - tensorboard BigDL integration with TensorBoard TensorBoard is a suite of web applications from Google for visualizing and understanding deep learning applications https://software.intel.com/bigdl
HDInsight on Linux Overview
HDInsight (Linux) supports… Hive & Hive LLAP & Standard Hadoop: ETL, reporting, ad hoc queries, data mining and analysis, log analysis, data warehousing… Spark: real-time analysis, streaming analysis, machine learning, ETL, graph analysis, real-time SQL query R Server: advanced analytics over big data, machine learning, statistical analysis Hbase & Phoenix: No SQL storage with SQL friendly interfaces (Phoenix), suitable for key-value store or schema-changing logs Storm: real-time streaming analysis Kafka: high throughput data ingestion engine
Scale compute & storage independently Gateway nodes Head Worker Edge Zookeeper nodes Azure Blob Storage or Azure Data Lake Store
Demo
Train a CNN model on MNIST dataset Install BigDL on HDInsight – easy as 1-2-3 Configure Spark settings Set up BigDL parameters Set up network topologies Run, train, and see results
Set up HDInsight Cluster in a few steps
Monitor HDInsight Cluster via Ambari GUI
BigDL is easily installed and built (“Deploy to Azure”)
Spark Session configuration
Network Layout
To learn more about BigDL + HDInsight github.com/intel-analytics/BigDL software.intel.com/bigdl https://blogs.msdn.microsoft.com/azuredatalake/2017/03/17/ho w-to-use-bigdl-on-apache-spark-for-azure-hdinsight/
Please evaluate this session Tech Ready 15 5/23/2018 Please evaluate this session From your Please expand notes window at bottom of slide and read. Then Delete this text box. PC or tablet: visit MyIgnite https://myignite.microsoft.com/evaluations Phone: download and use the Microsoft Ignite mobile app https://aka.ms/ignite.mobileapp Your input is important! © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
5/23/2018 1:25 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.