11/21/2018 11:32 PM BRK3316 Operationalizing Microsoft Cognitive Toolkit and TensorFlow models with HDInsight Spark Mary Wahl Data Scientist, AI Enablement Artificial Intelligence & Research @ Microsoft © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Common customer request: 11/21/2018 11:32 PM Common customer request: Train a DNN at scale on a huge pool of collected images… …and apply in real-time to new images. © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
On further investigation: Very few of those images are labeled… 11/21/2018 11:32 PM On further investigation: Very few of those images are labeled… …and the customer would like the model’s predictions on the rest. © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Session Goals Introduce an example use case Explain methods for DNN operationalization with PySpark Using Cognitive Toolkit (CNTK) and TensorFlow (TF) APIs Using MMLSpark Highlight common and insidious errors Enable attendees to adapt the methods © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Example use case: aerial image classification 11/21/2018 11:32 PM Example use case: aerial image classification © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Land use classification of aerial imagery Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Land use classification of aerial imagery Large, freely-available, labeled datasets Imagery: National Agriculture Imagery Program, every two years Labels: National Land Cover Database, every five years (w/ delay) Common need in industry and government Enforce regulations, collect taxes, geopolitical surveillance Monitor crop performance, property value estimation, marketing Barren Forested Shrub Cultivated Grassland Developed © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Selecting training and validation data Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Selecting training and validation data © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Training method: transfer learning 11/21/2018 11:32 PM Training method: transfer learning Adapts pretrained models for new tasks Used AlexNet and 52-layer ResNet pretrained on ImageNet classification task Accommodates smaller training datasets Avoids overfitting by retraining only part of the model Used a balanced training set of 44k labeled images Lower computation burden Performed retraining in under one hour on a single-GPU Windows Data Science Virtual Machine © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Data readers offer huge benefits during training 11/21/2018 11:32 PM Data readers offer huge benefits during training Minibatching Makes efficient use of multiple cores Improve gradient estimation Faster convergence (potentially) Queuing Pre-load the next minibatch while the GPU processes the current one Distributed training Partition data between workers Transformations Add diversity through random cropping/scaling/colorization © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Most commonly-used data readers 11/21/2018 11:32 PM Most commonly-used data readers Cognitive Toolkit (CNTK): “MAP file” lists filename and label for each image in the training set Read by MinibatchSource TensorFlow: “TFRecords” are binary files containing images and labels Read by TFRecordReader © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Quick look: data preparation and use in training 11/21/2018 11:32 PM Quick look: data preparation and use in training © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Batch scoring with CNTK and TF models on HDInsight Spark 11/21/2018 11:32 PM Batch scoring with CNTK and TF models on HDInsight Spark © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Motivation for operationalizing DNNs on Spark Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Motivation for operationalizing DNNs on Spark Reduces image data transfer latency Cluster and images can be located on the same Azure Data Lake Store (HDFS) Even scoring with DNNs is a time-intensive task Often 100s of milliseconds per image on CPU Split scoring task over arbitrarily-many worker nodes No interdependency -> “Embarrassingly parallel” scoring is possible Familiar Python interface to Cognitive Toolkit/TensorFlow © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Operationalization architecture on Azure 11/21/2018 11:32 PM Operationalization architecture on Azure Azure Data Lake Store (HDFS) - or - Azure HDInsight Spark Azure storage account © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Replicating image loading steps on Spark Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Replicating image loading steps on Spark Can’t use the data readers that we used during training: Cognitive Toolkit: MinibatchSource expects local file access to images listed in MAP files TensorFlow: Can’t realistically write TFRecords for all files Alternative: match the data loading steps that each reader performed during training with custom code © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Image pre-processing with OpenCV Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Image pre-processing with OpenCV Color channels loaded in “BGR” order Many other packages load images in RGB order Image data dimensions: “# color channels x width x height” Many other packages load images with dimensions “width x height x # channels” Data type (float vs. int, precision) may also differ NB: some mistakes have a surprisingly small effect on prediction accuracy! © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Split scoring task appropriately among workers Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Split scoring task appropriately among workers Divide images into n partitions: Map partitions to workers: Workers access data through a tuple generator: image_rdd = sc.binaryFiles('adl://account_name.azuredatalakestore.net/images/*.png', minPartitions=num_workers).coalesce(num_workers) labeled_images = image_rdd.mapPartitions(image_scoring_func).collect() def image_scoring_func(file_generator): for file in file_generator: # file is a two-tuple: [0] filename, [1] byte data ... return predicted_labels © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Demo: Batch scoring on Azure HDInsight Spark 11/21/2018 11:32 PM Demo: Batch scoring on Azure HDInsight Spark © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Results: Parallelization and processing time Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Results: Parallelization and processing time Measured time required to score an entire balanced test set of 11,760 images. From 38 minutes to <1 minute through parallelization (using CPU-only workers) © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Results: overall classification accuracy ~80% for both CNTK and TensorFlow © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
11/21/2018 11:32 PM Operationalizing CNTK models with Microsoft Machine Learning for Apache Spark © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Microsoft Machine Learning for Apache Spark (MMLSpark) 11/21/2018 11:32 PM Microsoft Machine Learning for Apache Spark (MMLSpark) Easily ingest and preprocess images from HDFS Seamless integration with CNTK and OpenCV Featurize images and other inputs with pretrained DNNs BYOM or use one of many pretrained CNTK models Can use a GPU edge node to accelerate this process Train classifiers on featurized images Fast form of transfer learning that does not require GPU compute © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Demo: Training and scoring with DNNs using MMLSpark 11/21/2018 11:32 PM Demo: Training and scoring with DNNs using MMLSpark © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Results: Identifying newly-developed regions Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Results: Identifying newly-developed regions 2010 2016 © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Results: Predicting land use in Middlesex County, MA in 2016 Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Results: Predicting land use in Middlesex County, MA in 2016 Most recent ground-truth labels are from 2011 Red: developed; white: cultivated; green: all others (undeveloped) Come visit us at Microsoft’s NERD Center! © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Machine Learning, Analytics, & Data Science Conference 11/21/2018 11:32 PM Where to learn more: End-to-end tutorial covering the aerial image classification use case, with sample data/code/models: aka.ms/aerialimageclassification Download MMLSpark and examples from: https://github.com/Azure/mmlspark You can reach me (Mary Wahl) at mawah@Microsoft.com © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Please evaluate this session Tech Ready 15 11/21/2018 Please evaluate this session From your Please expand notes window at bottom of slide and read. Then Delete this text box. PC or tablet: visit MyIgnite https://myignite.microsoft.com/evaluations Phone: download and use the Microsoft Ignite mobile app https://aka.ms/ignite.mobileapp Your input is important! © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
11/21/2018 11:32 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.