Deep Dive with Microsoft Cognitive Toolkit (CNTK)

Slides:



Advertisements
Similar presentations
Request Dispatching for Cheap Energy Prices in Cloud Data Centers
Advertisements

SpringerLink Training Kit
Luminosity measurements at Hadron Colliders
From Word Embeddings To Document Distances
Choosing a Dental Plan Student Name
Virtual Environments and Computer Graphics
Chương 1: CÁC PHƯƠNG THỨC GIAO DỊCH TRÊN THỊ TRƯỜNG THẾ GIỚI
THỰC TIỄN KINH DOANH TRONG CỘNG ĐỒNG KINH TẾ ASEAN –
D. Phát triển thương hiệu
NHỮNG VẤN ĐỀ NỔI BẬT CỦA NỀN KINH TẾ VIỆT NAM GIAI ĐOẠN
Điều trị chống huyết khối trong tai biến mạch máu não
BÖnh Parkinson PGS.TS.BS NGUYỄN TRỌNG HƯNG BỆNH VIỆN LÃO KHOA TRUNG ƯƠNG TRƯỜNG ĐẠI HỌC Y HÀ NỘI Bác Ninh 2013.
Nasal Cannula X particulate mask
Evolving Architecture for Beyond the Standard Model
HF NOISE FILTERS PERFORMANCE
Electronics for Pedestrians – Passive Components –
Parameterization of Tabulated BRDFs Ian Mallett (me), Cem Yuksel
L-Systems and Affine Transformations
CMSC423: Bioinformatic Algorithms, Databases and Tools
Some aspect concerning the LMDZ dynamical core and its use
Bayesian Confidence Limits and Intervals
实习总结 (Internship Summary)
Current State of Japanese Economy under Negative Interest Rate and Proposed Remedies Naoyuki Yoshino Dean Asian Development Bank Institute Professor Emeritus,
Front End Electronics for SOI Monolithic Pixel Sensor
Face Recognition Monday, February 1, 2016.
Solving Rubik's Cube By: Etai Nativ.
CS284 Paper Presentation Arpad Kovacs
انتقال حرارت 2 خانم خسرویار.
Summer Student Program First results
Theoretical Results on Neutrinos
HERMESでのHard Exclusive生成過程による 核子内クォーク全角運動量についての研究
Wavelet Coherence & Cross-Wavelet Transform
yaSpMV: Yet Another SpMV Framework on GPUs
Creating Synthetic Microdata for Higher Educational Use in Japan: Reproduction of Distribution Type based on the Descriptive Statistics Kiyomi Shirakawa.
MOCLA02 Design of a Compact L-­band Transverse Deflecting Cavity with Arbitrary Polarizations for the SACLA Injector Sep. 14th, 2015 H. Maesaka, T. Asaka,
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Fuel cell development program for electric vehicle
Overview of TST-2 Experiment
Optomechanics with atoms
داده کاوی سئوالات نمونه
Inter-system biases estimation in multi-GNSS relative positioning with GPS and Galileo Cecile Deprez and Rene Warnant University of Liege, Belgium  
ლექცია 4 - ფული და ინფლაცია
10. predavanje Novac i financijski sustav
Wissenschaftliche Aussprache zur Dissertation
FLUORECENCE MICROSCOPY SUPERRESOLUTION BLINK MICROSCOPY ON THE BASIS OF ENGINEERED DARK STATES* *Christian Steinhauer, Carsten Forthmann, Jan Vogelsang,
Particle acceleration during the gamma-ray flares of the Crab Nebular
Interpretations of the Derivative Gottfried Wilhelm Leibniz
Advisor: Chiuyuan Chen Student: Shao-Chun Lin
Widow Rockfish Assessment
SiW-ECAL Beam Test 2015 Kick-Off meeting
On Robust Neighbor Discovery in Mobile Wireless Networks
Chapter 6 并发:死锁和饥饿 Operating Systems: Internals and Design Principles
You NEED your book!!! Frequency Distribution
Y V =0 a V =V0 x b b V =0 z
Fairness-oriented Scheduling Support for Multicore Systems
Climate-Energy-Policy Interaction
Hui Wang†*, Canturk Isci‡, Lavanya Subramanian*,
Ch48 Statistics by Chtan FYHSKulai
The ABCD matrix for parabolic reflectors and its application to astigmatism free four-mirror cavities.
Measure Twice and Cut Once: Robust Dynamic Voltage Scaling for FPGAs
Online Learning: An Introduction
Factor Based Index of Systemic Stress (FISS)
What is Chemistry? Chemistry is: the study of matter & the changes it undergoes Composition Structure Properties Energy changes.
THE BERRY PHASE OF A BOGOLIUBOV QUASIPARTICLE IN AN ABRIKOSOV VORTEX*
Quantum-classical transition in optical twin beams and experimental applications to quantum metrology Ivano Ruo-Berchera Frascati.
The Toroidal Sporadic Source: Understanding Temporal Variations
FW 3.4: More Circle Practice
ارائه یک روش حل مبتنی بر استراتژی های تکاملی گروه بندی برای حل مسئله بسته بندی اقلام در ظروف
Decision Procedures Christoph M. Wintersteiger 9/11/2017 3:14 PM
Limits on Anomalous WWγ and WWZ Couplings from DØ
Presentation transcript:

Deep Dive with Microsoft Cognitive Toolkit (CNTK) 6/26/2018 1:18 PM Deep Dive with Microsoft Cognitive Toolkit (CNTK) Cha Zhang Principal Applied Science Manager Microsoft AI & Research © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

6/26/2018 1:18 PM Introduction © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Microsoft Cognitive Toolkit (CNTK) 6/26/2018 1:18 PM Microsoft Cognitive Toolkit (CNTK) Microsoft’s open-source deep-learning toolkit https://github.com/Microsoft/CNTK Created by Microsoft Speech researchers in 2012, “Computational Network Toolkit” Open sourced on GitHub since Jan 2016 (MIT license) 12,000+ GitHub stars Third among all DL toolkits 140+ contributors from MIT, Stanford, Nvidia, Intel, and many others © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Microsoft Cognitive Toolkit (CNTK) 6/26/2018 1:18 PM Microsoft Cognitive Toolkit (CNTK) Runs over 80% Microsoft internal DL workload Cortana HoloLens Bing and Bing Ads Skype Translator Windows Office Microsoft Cognitive Services Microsoft Research Internal == External © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Microsoft Cognitive Toolkit (CNTK) 6/26/2018 1:18 PM Microsoft Cognitive Toolkit (CNTK) 1st-class on Linux and Windows Docker support Rich API support Mostly implemented in C++ (train and eval) Low level + high level Python API R and C# API for train and eval UWP, Java and Spark support Built-in readers for distributed learning Keras backend support (Beta) Model compression (Fast binarized evaluation) Latest version: v2.2 (Sep. 15, 2017) © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Toolkit Benchmark http://dlbench.comp.hkbu.edu.hk/ 6/26/2018 1:18 PM Toolkit Benchmark Caffe: 1.0rc5(39f28e4) CNTK: 2.0 Beta10(1ae666d) MXNet: 0.93(32dc3a2) TensorFlow: 1.0(4ac9c09) Torch: 7(748f5e3) http://dlbench.comp.hkbu.edu.hk/ Benchmarking by HKBU, Version 8 Single Tesla K80 GPU, CUDA: 8.0 CUDNN: v5.1 Caffe Cognitive Toolkit MxNet TensorFlow Torch FCN5 (1024) 55.329ms 51.038ms 60.448ms 62.044ms 52.154ms AlexNet (256) 36.815ms 27.215ms 28.994ms 103.960ms 37.462ms ResNet (32) 143.987ms 81.470ms 84.545ms 181.404ms 90.935ms LSTM (256) (v7 benchmark) - 43.581ms (44.917ms) 288.142ms (284.898ms) (223.547ms) 1130.606ms (906.958ms) © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Production Ready Scalability

Toolkit Delivering Near-Linear Multi-GPU Scaling Microsoft Cognitive Toolkit First Deep Learning Framework Fully Optimized for Pascal Toolkit Delivering Near-Linear Multi-GPU Scaling AlexNet Performance 170x Faster v. CPU Server images / sec AlexNet training batch size 128, Grad Bit = 32, Dual socket E5-2699v4 CPUs (total 44 cores) CNTK 2.0b3 (to be released) includes cuDNN 5.1.8, NCCL 1.6.1, NVLink enabled

Released in CNTK v2.2 (Sep. 2017) Scale with NCCL 2.0 GTC, May 2017 Released in CNTK v2.2 (Sep. 2017)

Scale to 1,000 GPUs

6/26/2018 1:18 PM CNTK Overview © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Installation $ pip install <url> Microsoft Build 2017 6/26/2018 1:18 PM Installation $ pip install <url> https://docs.microsoft.com/en-us/cognitive-toolkit/Setup-CNTK-on-your-machine © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Tutorials Microsoft Build 2017 6/26/2018 1:18 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Tutorials on Azure (Free Pre-Hosted) https://notebooks.azure.com/cntk/libraries/tutorials

https://github.com/Microsoft/CNTK

CNTK expresses (nearly) arbitrary neural networks by composing simple building blocks into complex computational networks, supporting relevant network types and applications.

What is CNTK? Example: 2-hidden layer feed-forward NN h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1) h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2) P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x  RM and one-hot label L  RM and cross-entropy training criterion ce = LT log P ce = cross_entropy (L, P) Scorpusce = max

What is CNTK? Example: 2-hidden layer feed-forward NN h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1) h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2) P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x  RM and one-hot label y  RJ and cross-entropy training criterion ce = yT log P ce = cross_entropy (L, P) Scorpusce = max

What is CNTK? example: 2-hidden layer feed-forward NN h1 = s(W1 x + b1) h1 = sigmoid (x @ W1 + b1) h2 = s(W2 h1 + b2) h2 = sigmoid (h1 @ W2 + b2) P = softmax(Wout h2 + bout) P = softmax (h2 @ Wout + bout) with input x  RM and one-hot label y  RJ and cross-entropy training criterion ce = yT log P ce = cross_entropy (P, y) Scorpusce = max

What is CNTK? h1 = sigmoid (x @ W1 + b1) h2 = sigmoid (h1 @ W2 + b2) P = softmax (h2 @ Wout + bout) ce = cross_entropy (P, y)

What is CNTK? h1 = sigmoid (x @ W1 + b1) h2 = sigmoid (h1 @ W2 + b2) ce h1 = sigmoid (x @ W1 + b1) h2 = sigmoid (h1 @ W2 + b2) P = softmax (h2 @ Wout + bout) ce = cross_entropy (P, y) cross_entropy P softmax bout + Wout • h2 s b2 + W2 • h1 s b1 + W1 • x y

What is CNTK? Nodes: functions (primitives) Edges: values Can be composed into reusable composites Edges: values Incl. tensors, sparse Automatic differentiation ∂F / ∂in = ∂F / ∂out ∙ ∂out / ∂in Deferred computation  execution engine Editable, clonable LEGO-like composability allows CNTK to support wide range of networks & applications ce cross_entropy P softmax bout + Wout • h2 s b2 + W2 • h1 s b1 + W1 • x y

CNTK Workflow trainer reader network • SGD (momentum, Adam, …) Microsoft Build 2017 6/26/2018 1:18 PM CNTK Workflow Script configure and executes through CNTK Python APIs… trainer • SGD (momentum, Adam, …) • minibatching reader • minibatch source • task-specific deserializer • automatic randomization • distributed reading corpus model network • model function • criterion function • CPU/GPU execution engine • packing, padding © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

As Easy as 1-2-3 from cntk import * # reader def create_reader(path, is_training): ... # network def create_model_function(): def create_criterion_function(model): # trainer (and evaluator) def train(reader, model): def evaluate(reader, model): # main function model = create_model_function() reader = create_reader(..., is_training=True) train(reader, model) reader = create_reader(..., is_training=False) evaluate(reader, model)

Workflow Prepare data Configure reader, network, learner (Python) Train: python my_cntk_script.py

Prepare Data: Reader def create_reader(map_file, mean_file, is_training): # image preprocessing pipeline transforms = [ ImageDeserializer.crop(crop_type='Random', ratio=0.8, jitter_type='uniRatio') ImageDeserializer.scale(width=image_width, height=image_height, channels=num_channels, interpolations='linear'), ImageDeserializer.mean(mean_file) ] # deserializer return MinibatchSource(ImageDeserializer(map_file, StreamDefs( features = StreamDef(field='image', transforms=transforms), ' labels = StreamDef(field='label', shape=num_classes) )), randomize=is_training, epoch_size = INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP)

Prepare Data: Reader def create_reader(map_file, mean_file, is_training): # image preprocessing pipeline transforms = [ ImageDeserializer.crop(crop_type='Random', ratio=0.8, jitter_type='uniRatio') ImageDeserializer.scale(width=image_width, height=image_height, channels=num_channels, interpolations='linear'), ImageDeserializer.mean(mean_file) ] # deserializer return MinibatchSource(ImageDeserializer(map_file, StreamDefs( features = StreamDef(field='image', transforms=transforms), ' labels = StreamDef(field='label', shape=num_classes) )), randomize=is_training, epoch_size = INFINITELY_REPEAT if is_training else FULL_DATA_SWEEP) Automatic on-the-fly randomization important for large data sets Readers compose, e.g. image  text caption

Prepare Network: Multi-Layer Perceptron Microsoft Build 2017 6/26/2018 1:18 PM Prepare Network: Multi-Layer Perceptron Model z = model(x): h1 = Dense(400, act = relu)(x) h2 = Dense(200, act = relu)(h1) r = Dense(10, act = None)(h2) return r Loss cross_entropy_with_softmax(z,Y) © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Prepare Network: Convolutional Neural Network Microsoft Build 2017 6/26/2018 1:18 PM Prepare Network: Convolutional Neural Network 28 pix z = model(x): h = Convolution2D((5,5),filt=8, …)(x) h = MaxPooling(…)(h) h = Convolution2D ((5,5),filt=16, …)((h) r = Dense(output_classes, act= None)(h) return r Model © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

An Sequence Example (many to many + 1:1) Problem: Tagging entities in Air Traffic Controller (ATIS) data Rec show o Rec burbank From_city Rec to o Rec seattle To_city Rec flights o Rec tomorrow Date http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Prepare Network: Recurrent Neural Network 1 𝑥 (t) E i = 943 O= 150 L i = 150 O= 300 D i = 300 O= 129 a = sigmoid ℎ (t-1) ℎ (t) 𝑦 (t) Text token Class label 1 x 943 z = model(): return Sequential([ Embedding(emb_dim=150), Recurrence(LSTM(hidden_dim=300), go_backwards=False), Dense(num_labels = 129) ])

Prepare Learner Many built-in learners Microsoft Build 2017 6/26/2018 1:18 PM Prepare Learner Many built-in learners SGD, SGD with momentum, Adagrad, RMSProp, Adam, Adamax, AdaDelta, etc. Specify learning rate schedule and momentum schedule If wanted, specify minibatch size schedule lr_schedule = C.learning_rate_schedule([0.05]*3 + [0.025]*2 + [0.0125], C.UnitType.minibatch, epoch_size=100) sgd_learner = C.sgd(z.parameters, lr_schedule) © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Overall Train Workflow z = model(): return Sequential([ Embedding(emb_dim=150), Recurrence(LSTM(hidden_dim=300), go_backwards=False), Dense(num_labels = 129) ]) . #1 #2 #3 #96 Input feature ( 96 x 𝑥 (t)) t23 t1 t15 t9 t12 ATIS Train (mini-batch) 96 samples Loss cross_entropy_with_softmax(z,Y) One-hot encoded Label (Y: 96 x 129/sample Or word in sequence) Error classification_error(z,Y) Choose a learner (SGD, Adam, adagrad etc.) Trainer(model, (loss, error), learner) Trainer.train_minibatch({X, Y})

Distributed training Prepare data Configure reader, network, learner (Python) Train: -- distributed! mpiexec --np 16 --hosts server1,server2,server3,server4 \ python my_cntk_script.py

6/26/2018 1:18 PM CNTK Deep Dive © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

CNTK Unique Features Symbolic loops over sequences with dynamic scheduling Turn graph into parallel program through minibatching Unique parallel training algorithms (1-bit SGD, Block Momentum)

Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h1(t) = s(W1 x(t) + H1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) + b1) h2(t) = s(W2 h1(t) + H2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ H2 + b2) P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = LT(t) log P(t) ce = cross_entropy(P, L) Scorpusce(t) = max  no explicit notion of time

Symbolic Loops over Sequential Data Extend our example to a recurrent network (RNN) h1(t) = s(W1 x(t) + R1 h1(t-1) + b1) h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1) h2(t) = s(W2 h1(t) + R2 h2(t-1) + b2) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2) P(t) = softmax(Wout h2(t) + bout) P = softmax(h2 @ Wout + bout) ce(t) = LT(t) log P(t) ce = cross_entropy(P, L) Scorpusce(t) = max

Symbolic Loops over Sequential Data • + s softmax W1 b1 Wout bout cross_entropy h1 P x y ce R1 z-1 W2 b2 h2 R2 h1 = sigmoid(x @ W1 + past_value(h1) @ R1 + b1) h2 = sigmoid(h1 @ W2 + past_value(h2) @ R2 + b2) P = softmax(h2 @ Wout + bout) ce = cross_entropy(P, L) CNTK automatically unrolls cycles at execution time cycles are detected with Tarjan’s algorithm Efficient and composable

Batch-Scheduling of Variable-Length Sequences Minibatches containing sequences of different lengths are automatically packed and padded CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) sequence reductions time steps computed in parallel sequence 1 sequence 2 sequence 3 padding parallel sequences sequence 3 sequence 7 sequence 5 sequence 6

Batch-Scheduling of Variable-Length Sequences Minibatches containing sequences of different lengths are automatically packed and padded CNTK handles the special cases: past_value operation correctly resets state and gradient at sequence boundaries non-recurrent operations just pretend there is no padding (“garbage-in/garbage-out”) sequence reductions time steps computed in parallel sequence 1 sequence 2 sequence 3 padding parallel sequences sequence 4 schedule into the same slot, it may come for free! sequence 7 sequence 5 sequence 6

Batch-Scheduling of Variable-Length Sequences Minibatches containing sequences of different lengths are automatically packed and padded Fully transparent batching Recurrent  CNTK unrolls, handles sequence boundaries Non-recurrent operations  parallel Sequence reductions  mask time steps computed in parallel sequence 1 sequence 2 sequence 3 padding parallel sequences sequence 4 sequence 7 sequence 5 sequence 6

Data-Parallel Training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients all-reduce minibatch (sample frames) node 1 node 2 node 3 each node computes sub-mini-batch sub-gradient of one layer S

Data-Parallel Training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients minibatch (sample frames) node 1 node 2 node 3 each node computes sub-mini-batch sub-gradient of one layer 1/K-th stripe (K-1) concurrent transfers of 1/K-th data step 1: each node owns aggregating a stripe

Data-Parallel Training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients minibatch (sample frames) node 1 node 2 node 3 each node computes sub-mini-batch sub-gradient of one layer ring algorithm O(2 (K-1)/K M)  O(1) w.r.t. K step 1: each node owns aggregating a stripe step 2: aggregated stripes are redistributed

Data-Parallel Training Data-parallelism: distribute minibatch over workers, all-reduce partial gradients O(1)  enough? Example: DNN, MB size 1024, 160M model parameters compute per MB:  1/7 second communication per MB:  1/9 second (640M over 6 GB/s) can’t even parallelize to 2 GPUs: communication cost already dominates!

Data-Parallel Training How to reduce communication cost: communicate less each time 1-bit SGD: [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “1-Bit Stochastic Gradient Descent...Distributed Training of Speech DNNs”, Interspeech 2014] Quantize gradients to 1 bit per value Trick: carry over quantization error to next minibatch communicate less often Automatic MB sizing [F. Seide, H. Fu, J. Droppo, G. Li, D. Yu: “ON Parallelizability of Stochastic Gradient Descent...”, ICASSP 2014] Block momentum [K. Chen, Q. Huo: “Scalable training of deep learning machines by incremental block training…,” ICASSP 2016] Very recent, very effective parallelization method Combines model averaging with error-residual idea

CNTK C# and R Bindings 6/26/2018 1:18 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

C# API Overview Basic C# Training API Is Added In CNTK 2.2 Release Microsoft Build 2017 6/26/2018 1:18 PM C# API Overview Basic C# Training API Is Added In CNTK 2.2 Release Supports Building of DNNs Supports DNN Training Supports DNN Evaluation End-To-End Examples in C# © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

CNTK C# API Highlights Over 100 Basic Operations To Build A Computation Network. Sigmoid, Tanh, ReLU, Plus, Minus, Convolution, Pooling, BatchNormalization… For example: Logistic Regression Loss Function: Function z = CNTKLib.Times(weightParam, input) + biasParam; Function loss = CNTKLib.CrossEntropyWithSoftmax(z, labelVariable); CNTK Function As A Primitive Element To Build A DNN Build a DNN through basic operation composition. For example to build a ResNet node: Function conv = CNTKLib.Pooling(CNTKLib.Convolution(convParam, input), PoolingType.Average, poolingWindowShape); Function resNetNode = CNTKLib.ReLU(CNTKLib.Plus(conv, input));

CNTK C# API Highlights Batching Supports Training Supports Provides MinibatchSource and MinibacthData to help data loading and batching Training Supports Supports SGD optimizers commonly seen in DNN literatures: MomentumSGDLearner, AdamLearner, AdaGradLearner, etc. For example, to train a model with a ADAM Stochastic Optimizer: var parameterLearners = new List<Learner>() { Learner.AdamLearner(classifierOutput.Parameters(), learningRate, momentum) }; var trainer = Trainer.CreateTrainer(classifierOutput, trainingLoss, prediction, parameterLearners); Evaluation Supports

Hand-writing Data Classification (MNIST) Build a Convolution Model Function conv = CNTKLib.ReLU(CNTKLib.Convolution(convParams, features, strides )); Function pooling = CNTKLib.Pooling(conv, PoolingType.Max, poolingWindow, stride, padding); Function classifier = TestHelper.Dense(pooling, numClasses, device, Activation.None); Prepare Data var minibatchSource = MinibatchSource.TextFormatMinibatchSource("Train_cntk_text.txt"), streamConfigurations, MinibatchSource.InfinitelyRepeat);

Hand-writing Data Classification(MNIST) Cont. Training (One Iteration) var minibatchData = minibatchSource.GetNextMinibatch(minibatchSize, device); var arguments = new Dictionary<Variable, MinibatchData> { { input, minibatchData[featureStreamInfo] }, { labels, minibatchData[labelStreamInfo] } }; trainer.TrainMinibatch(arguments, device);

CNTK C# API Examples More Examples At: https://github.com/Microsoft/CNTK/tree/master/Examples/TrainingCSharp CifarResNetClassifier: Shows how to do image classification using ResNet. Transfer learning: Demonstrates transfer learning using a pretrained ResNet model. LSTMSequenceClassifier: Shows how to build and train a recurrent neural network model.

CNTK R-Binding devtools::install_github("Microsoft/CNTK-R") Comprehensive coverage of CNTK functionality Readers (CTF, Image, HTK) Layers library (fully-connected, convolution, pooling, dropout, LSTM, GRU, …) Loss functions Normalizers Optimizers (including 1-bit SGD) Model saving and loading Exposed via idiomatic R functional APIs Interop with R objects and packages like ggplot Fluent syntax for constructing layers based on magittr High performance, including support for multi-node and multi-GPU

6/26/2018 1:18 PM Implementation R bindings based on reticulate package (https://rstudio.github.io/reticulate/) Enables importing and calling Python functions from R Ensures full parity with functionality in Python bindings (such as layers library) Enables interop between R objects and NumPy However, requires full installation of Python bindings Converting back and forth between R and Python objects expensive However, not an issue if only done at the beginning and end of training Or using the native Python readers and deserializers © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

CIFAR-10 Training Example Import libraries Define reader function

CIFAR-10 Training Example Define feature and label variables in the graph Low-level operations start with an “op” prefix Use functional notion instead of operator overloading Normalize the inputs

CIFAR-10 Training Example Define network with four convolution and two max-pool layers using the Layers library Notice use of the “For” loop construct from CNTK

CIFAR-10 Training Example Define loss function Initialize the learning algorithm z$parameters notation used for accessing properties of a Python object

CIFAR-10 Training Example Run the training algorithm until convergence Notice the use of fluent syntax using magittr pipe notation, “%>%”

CIFAR-10 Training Example Convergence plot using ggplot

Future Work More examples and vignettes Tighter R integration Particularly text and NLP Tighter R integration Formula objects Data frames and matrices Integration of trained models with summary and predict generic function Integration with TensorBoard

More Information Source code Vignettes and documentation https://github.com/Microsoft/CNTK-R Vignettes and documentation https://microsoft.github.io/CNTK-R/ Issues and feedback: https://github.com/Microsoft/CNTK-R/issues https://docs.microsoft.com/en-us/cognitive-toolkit/Feedback-Channels

CNTK Moving Forward 6/26/2018 1:18 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Open Neural Network Exchange (ONNX) An open source intermediate representation (IR) of computation graph (https://github.com/onnx/onnx) With defined common OPs and their semantics Released on Sep. 7, 2017 Collaboration between Microsoft and Facebook A share library with a Caffe2 example as reference Permissive MIT license and no patents

ONNX Motivation Allow interoperability between frameworks Starting with CNTK, Caffe2 and PyTorch Allow hardware vendor to focus on one IR in their backend optimization Allow train in one toolkit and deploy in another

ONNX Vision

Current Status V1 release in Github, focus on the basic Support only inference, no loop and no RNN Caffe2 and PyTorch support now, CNTK support early October In V2, we expect to add: RNN support Loop and control Model Zoo Look forward to more support from other frameworks

CNTK Upcoming New Features 6/26/2018 1:18 PM CNTK Upcoming New Features Model optimization and codegen for fast inference Model compression via SVD, quantization, etc. Fast inference via codegen C# API for training Low-level C# API released in v2.2 (Sep. 2017) In process to design high-level API © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

CNTK Upcoming New Features (cont.) 6/26/2018 1:18 PM CNTK Upcoming New Features (cont.) CrossTalk Porting models created from other toolkits to CNTK Caffe, Caffe2 and TensorFlow Will use ONNX as intermediate representation Dynamic graph support Data-dependent graph Ease of debugging More examples, better documentation © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

6/26/2018 1:18 PM Conclusions © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Conclusions Performance Programmability 6/26/2018 1:18 PM Conclusions Performance Speed: faster than others, 5-10x faster on recurrent networks Scalability: few lines of change to scale to thousands of GPUs Built-in readers: efficient distributed readers Programmability Powerful C++ library for enterprise users Intuitive and performant Python APIs Supports C#/.NET, UWP, R, Java, etc. Extensible via user custom layers, learners, readers, etc. © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Thank You! https://github.com/Microsoft/CNTK 6/26/2018 1:18 PM Thank You! https://github.com/Microsoft/CNTK © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Please evaluate this session Your feedback is important to us! 6/26/2018 1:18 PM Please evaluate this session Your feedback is important to us! The slide will be replaced onsite through Silver Fox Productions with an updated QR code. This slide is required. Do NOT delete or alter the slide. From your PC or Tablet visit MyIgnite at http://myignite.microsoft.com From your phone download and use the Ignite Mobile App by scanning the QR code above or visiting https://aka.ms/ignite.mobileapp © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

6/26/2018 1:18 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.