SINGA: Putting Deep Learning into the Hands of Multimedia Users

Slides:

Advertisements

Similar presentations

Operating System.

Advertisements

CS590M 2008 Fall: Paper Presentation

Advanced topics.

An Efficient Multi-Dimensional Index for Cloud Data Management Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin.

ImageNet Classification with Deep Convolutional Neural Networks

Tiled Convolutional Neural Networks TICA Speedup Results on the CIFAR-10 dataset Motivation Pretraining with Topographic ICA References [1] Y. LeCun, L.

Structure learning with deep neuronal networks 6 th Network Modeling Workshop, 6/6/2013 Patrick Michl.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Lecture 4: CNN: Optimization Algorithms

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

LOGO Classification III Lecturer: Dr. Bo Yuan

AN ANALYSIS OF SINGLE- LAYER NETWORKS IN UNSUPERVISED FEATURE LEARNING [1] Yani Chen 10/14/

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Comp 5013 Deep Learning Architectures Daniel L. Silver March,

Nantes Machine Learning Meet-up 2 February 2015 Stefan Knerr CogniTalk

A General Distributed Deep Learning Platform

CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.

A shallow introduction to Deep Learning

Large-scale Deep Unsupervised Learning using Graphics Processors

Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,

Basic User Guide 1 Installation Data preparation Examples – Convolutional Neural Network (CNN) Dataset: CIFAR-10 Single Worker / Synchronous / Downpour.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

M. Wang, T. Xiao, J. Li, J. Zhang, C. Hong, & Z. Zhang (2014)

Joe Bradish Parallel Neural Networks. Background  Deep Neural Networks (DNNs) have become one of the leading technologies in artificial intelligence.

Convolutional Neural Network

Philipp Gysel ECE Department University of California, Davis

Progress Report of Apache SINGA Wei 28/09/2015.

EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Welcome deep loria !.

Big data classification using neural network

TensorFlow– A system for large-scale machine learning

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Big Data A Quick Review on Analytical Tools

Deep Learning Amin Sobhani.

Energy models and Deep Belief Networks

Chilimbi, et al. (2014) Microsoft Research

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Recurrent Neural Networks for Natural Language Processing

Deep Learning Insights and Open-ended Questions

COMP24111: Machine Learning and Optimisation

Matt Gormley Lecture 16 October 24, 2016

Restricted Boltzmann Machines for Classification

Deep Learning Platform as a Service

Multimodal Learning with Deep Boltzmann Machines

Deep Learning Fundamentals online Training at GoLogica

Intelligent Information System Lab

Classification of Hand-Written Digits Using Scattering Convolutional Network Dongmian Zou Advisor: Professor Radu Balan.

Deep learning and applications to Natural language processing

Deep Learning Qing LU, Siyuan CAO.

Structure learning with deep autoencoders

Unsupervised Learning and Autoencoders

Deep Learning Workshop

Machine Learning: The Connectionist

TensorFlow and Clipper (Lecture 24, cs262a)

Department of Electrical and Computer Engineering

Introduction to Neural Networks

SAS Deep Learning: From Toolkit to Fast Model Prototyping

Oral presentation for ACM International Conference on Multimedia, 2014

A Proposal Defense On Deep Residual Network For Face Recognition Presented By SAGAR MISHRA MECE

Declarative Transfer Learning from Deep CNNs at Scale

Neural Networks Geoff Hulten.

Representation Learning with Deep Auto-Encoder

Graph Neural Networks Amog Kamsetty January 30, 2019.

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

SINGA: Putting Deep Learning into the Hands of Multimedia Users http://singa.apache.org/ Wei Wang, Gang Chen, Tien Tuan Anh Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, and Sheng Wang

Multimedia data and application Introduction Multimedia data and application Motivations Deep learning models and training, and design principles SINGA Usability Scalability Implementation Experiment

Introduction VocallIQ (acquired by Apple) Social Media Audio Madbits (acquired by Twitter) Multimedia Data Perceptio (acquired by Apple) E-commerce Image/video LookFlow (acquired by Yahoo! Flickr) Deepomatic (e-commerce product search) We are in the era of Big Data, and most of which are multimedia in nature. Huge amounts of data have been generated at very unprecedented rates by modern applications, and in a variety of format. This creates a lot of opportunity for startups that analyze the data. And Deep Learning has been noted as one of the most effective techniques for handling the complexity of the problems and extracting values from the data. Descartes Labs (satellite images) Health-care Text Clarifai (tagging) Ldibon ParallelDots Deep Learning has been noted for its effectiveness for multimedia applications! AlchemyAPI (acquired by IBM) Semantria (NLP tasks >10 languages)

Motivations Model Categories CNN Feedforward Models CNN, MLP, Auto-encoder Image/video classification For examples, the Convolutional Neural Network, a feedforward model, shows breakthrough/remarkable improvement for image classification. CNN Krizhevsky, Sutskever, and Hinton, 2012; Szegedy et al., 2014; Simonyan and Zisserman, 2014a

Motivations Model Categories DBN Dahl et al., 2012 Feedforward Models CNN, MLP, Auto-encoder Image/video classification DBN, RBM, DBM Speech recognition Energy models The Deep Belief network, an energy model, is very effective for speech recognition. DBN RBM Dahl et al., 2012

Recurrent Neural Networks Motivations Feedforward Models Energy models Recurrent Neural Networks Model Categories CNN, MLP, Auto-encoder Image/video classification DBN, RBM, DBM Speech recognition The Recurrent Neural Network has shown promising performance for modelling sequential data, and NLP applications. RNN, LSTM, GRU Natural language processing Mikolov et al., 2010; Cho et al., 2014

Recurrent Neural Networks Motivations Feedforward Models Energy models Recurrent Neural Networks Model Categories CNN, MLP, Auto-encoder Image/video classification Design Goal I Usability: easy to implement various models DBN, RBM, DBM Speech recognition As in any other domains, different applications require different models. They are not easy to implement and hard to tune. Therefore, one of the main goals of SINGA is to PROVIDE a general programming model to help users implement various applications. Our aim is to make life easier for USERS RNN, LSTM, GRU Natural language processing

Motivations: Training Process Update model parameters to minimize prediction error Training algorithm Mini-batch Stochastic Gradient Descent (SGD) Training time (time per SGD iteration) x (number of SGD iterations) Long time to train large models over large datasets, e.g., 2 weeks for training Overfeat (Pierre, et al.) reported by Intel (https://software.intel.com/sites/default/files/managed/74/15/SPCS008.pdf). To deploy a deep learning model for an online application, we must first train the model. The training of deep learning models typically follows the SGD algorithm. The training is slow, very slow in fact, because the it requires many iterations to converge and each iteration is just as costly. It is so computation intensive that it can take weeks or even months to train a large model using a large dataset. Distributed training is therefore a good way to address this training issue Back-propagation (BP) Contrastive Divergence (CD)

Motivations: Distributed Training Frameworks Synchronous training (Google Sandblaster, Dean et al., 2012; Baidu AllReduce, Wu et al., 2015) Reduce time per iteration Scalable for single-node with multiple GPUs Cannot scale to large cluster Asynchronous training (Google Downpour, Dean et al., 2012, Hogwild!, Recht et al., 2011) Reduce number of iterations per machine Scalable for big cluster with commodity machine(CPU) Not stable Hybrid frameworks Design Goal II Scalability: not just flexible, but also efficient and adaptive to run different training frameworks We can use distributed training to reduce the training time, by using more computing resources. Its scalability depends on the parallelism schemes; that is, the training frameworks. There are basically two categories of distributed training frameworks: synchronous and asynchronous. Both have strengths and weaknesses. Synchronous training frameworks improves the efficiency per iteration. It tends to work well for single node with multiple GPUs BUT synchronous training cannot scale to large clusters. Asynchronous training frameworks reduces the training iterations. It is however more suitable for a cluster with CPUs. Unfortunately, it is not as stable as synchronous training possibly due to the conflicts and delay of parameter updates. ( It is validated in both Google’s paper and in our experiment. ) There are also hybrid frameworks. Different frameworks have different application scenario. So, we need a system that is not just FLEXIBLE, but also ADAPTIVE, to run different training frameworks in a scalable fashion.

A Distributed Deep Learning Platform SINGA: A Distributed Deep Learning Platform Now, we will present the design and system architecture of SINGA. This is an overview of SINGA, where the workers compute parameter gradients against user defined neural net in each SGD iteration. The servers receive parameter gradients and update them. Cluster topology specifies the training framework.

Usability: Abstraction NeuralNet stop class Layer { vector<Blob> data, grad; vector<Param*> param; ... void Setup(LayerProto& conf, vector<Layer*> src); void ComputeFeature(int flag, vector<Layer*> src); void ComputeGradient(int flag, vector<Layer*> src); }; Driver::RegisterLayer<FooLayer>("Foo"); // register new layers Layer TrainOneBatch Layer is the core abstraction in SINGA, which conducts feature transformations. Unlike the programming models in other systems, which separate layer operations from layer features and parameters, the layer of SINGA carries both features and parameters. In this way, a neural net can be easily constructed by connecting a set of layers. This abstraction also simplifies the neural net partitioning which will be discussed soon. Common layers are implemented as built-in layers and classified into the 5 categories. The neuron layer is the layer that applies non-linear feature transformations. Input layers load raw data (and label) Output layers output feature (and prediction results) Neuron layers transform features, e.g., convolution and pooling Loss layers measure training loss, e.g., cross-entropy loss Connection layers connect layers due to neural net partition

Usability: Neural Net Representation Input Hidden Loss labels Feedforward models (e.g., CNN) stop Layer TrainOneBatch SINGA has a uniform neural net representation for all three categories of models. A neural network consists of uni-directionally connected layers. The feedforward model is straight-forward using this representation. The RNN model is represented by unrolling the recurrent connections of layers inside one layer. To represent an energy model, we replace each un-directed connection with 2 directed connections. RNN RBM

Usability: TrainOneBatch NeuralNet stop Loss Layer labels Hidden TrainOneBatch Input Feedforward models (e.g., CNN) Back-propagation (BP) Contrastive Divergence (CD) The TrainOneBatch function calls layer functions to compute parameter gradients. Currently, we have implemented the BP algorithm for feedforward models and RNN, the CD algorithm for RBM. Other algorithms could be implemented by overriding the TrainOneBatch function. This gives us the USABILITY! Just need to override the TrainOneBatch function to implement other algorithms! RNN RBM

Scalability: Partitioning for Distributed Training 1 NeuralNet Partitioning: 1. Partition layers into different subsets 2. Partition each singe layer on batch dimension. 3. Partition each singe layer on feature dimension. 4. Hybrid partitioning strategy of 1, 2 and 3. Worker 1 Worker 2 2 3 To distribute the training onto multiple nodes, one way is to partition the model, i.e., partitioning the neural net. Another way is to partition the dataset. In SINGA, we partition the neural net and assign a subset of layers to each worker. First, we can separate different layers for different workers to parallelize them. Second, we can distribute the data of one mini-batch onto different workers. Third, we can let different workers to compute different parts of each feature. Fourth, a hybrid partition is useful for some models, e.g., deep CNN. Users simply configure the partition scheme (1,2,3,4), SINGA will do the neural net partitioning automatically (i.e., slice layers and connect layers). Users just need to CONFIGURE the partitioning scheme and SINGA takes care of the real work (eg. slice and connect layers) Worker 1 Worker 2 Worker 1 Worker 1 Worker 2

Scalability: Training Framework Cluster Topology Server Group Parameters Server Legends: Worker Server Node Group Inter-node Communication Neural Net Worker The cluster topology decides the training framework which in turns affects the scalability. With a single worker group and a single server group, SINGA workers run synchronously. Each worker computes over a partition of the neural net. Typically a model cannot be partitioned into too many partitions. Therefore, synchronous training cannot scale well to a large cluster. The scalability is affected! Synchronous training cannot scale to large group size

Scalability: Training Framework Cluster Topology Legends: Worker Server Node Group Inter-node Communication We can have more machines and more worker groups. However, since all worker groups communicate with the single server group, the communication is likely to be the bottleneck. As always, parallelism breaks bottleneck. So, we use server groups to distribute the computation and reduce the communication cost. Communication is the bottleneck!

Scalability: Training Framework Cluster Topology Legends: Worker Server Node Group Inter-node Communication This design is flexible to run different existing training frameworks by configuring the cluster topology, Such as the synchronous frameworks for Sandblaster from Google and AllReduce from Baidu; And asynchronous frameworks including Downpour and Distributed Hogwild. With this, SINGA is extensible, efficient and scalable. (c) Downpour (d) Distributed Hogwild (a) Sandblaster (b) AllReduce sync async SINGA is able to configure most known frameworks.

Implementation SINGA Software Stack Remote Nodes HDFS Ubuntu Docker CentOS MacOS DiskFile Mesos Zookeeper Worker Stub Server Driver CNN RBM RNN Optional Component SINGA Component Driver::Train() Main Thread Stub::Run() Worker thread While(not stop): Worker::TrainOneBatch() Server thread Server::Update() Remote Nodes SINGA implements workers and servers as threads. Hence we can run distributed training in a single node or in a cluster. SINGA has been seamlessly integrated with cloud computing software for easy management of computing resources, and deployment. Legend:

Deep learning as a Service (DLaaS) Third party APPs (Web app, Mobile,..) ---------------------- API Developers (Browser) ---------------------- GUI http request http request Rafiki Server User, Job, Model, Node Management Data Base Routing(Load balancing) File Storage System (e.g. HDFS) http request http request A layer also been built on top of SINGA to support both developers and users. This layer provides a GUI for users to select and launch built-in models and applications. That is, our aim is simple but ambitious. We want to use SINGA to level the playing field for those who want to work on Deep Learning or exploit Deep learning, by taking care of complex system plumbing work, its reliability, efficiency and scalability. Further, you can use SINGA for repeatability study of other models. Rafiki Agent Rafiki Agent Timon (c++ wrapper) Timon (c++ wrapper) Timon (c++ wrapper) Timon (c++ wrapper) … … … 1. To improve the Usability of SINGA; 2. To “level” the playing field by taking care of complex system plumbing work, its reliability, efficiency and scalability. SINGA SINGA SINGA SINGA SINGA’s RAFIKI

Comparison: Features of the Systems MXNet on 28/09/15 Feature SINGA Caffe CXXNET cuda-convnet H2O Deep Learning Models Feed-forward (CNN) ✔ MLP Energy model (RBM) x Recurrent networks (RNN) Distributed Training Frameworks Synchronous Asynchronous Hybrid Hardware CPU GPU V0.2.0 Cloud Software HDFS Resource management Virtualization Binding Python (P), Matlab(M), R ongoing (P) P+M P P+R There are a few open source systems out there, and here we compare SINGA against them in terms of features. caffe: is famous for convolutional neural network training since 2014. (the winner of Open Source Software Competition 2014) CXXNET: another CNN training system written in C++ CUDA-..: written by the author of the Deep CNN paper, specifically optimized for Deep CNN. H20: written in Java, good integration with cloud software. As can be seen, SINGA supports all the known models and training frameworks, and the GPU version will be released next month. Comparison with other open source projects

Experiment --- Usability Used SINGA to train three known models and verify the results Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 28 July 2006. … To verify the correctness of the systems, we used SINGA to train three different models and checked against published results/benchmark. I shall not go into details here. We ran SINGA to train RBM models. RBM Deep Auto-Encoders

Experiment --- Usability W. Wang, X. Yang, B. C. Ooi, D. Zhang, Y. Zhuang: Effective Deep Learning Based Multi-Modal Retrieval. VLDB Journal - Special issue of VLDB'14 best papers, 2015. W. Wang, B.C. Ooi, X. Yang, D. Zhang, Y. Zhuang: Effective MultiModal Retrieval based on Stacked AutoEncoders. Int'l Conference on Very Large Data Bases (VLDB), 2014. Deep Multi-Model Neural Network We run SINGA to train feedforward models. CNN MLP

Experiment --- Usability Mikolov Tomá, Karafiát Martin, Burget Luká, Èernocký Jan, Khudanpur Sanjeev: Recurrent neural network based language model, INTERSPEECH 2010), Makuhari, Chiba, JP We ran SINGA to train RNN models. This graph shows the perplexity vs training iterations. Perplexity is used for measuring the performance of a language model. It is calculated according to the accuracy of predicting the next word given the current word in a sentence. The code from the authors has many hard coded training settings, we didn’t follow those settings exactly. Hence the performance lines are not exactly the same. But they reach the same performance at the end.

Experiment --- Efficiency and Scalability Train DCNN over CIFAR10: https://code.google.com/p/cuda-convnet Single Node 4 NUMA nodes (Intel Xeon 7540, 2.0GHz) Each node has 6 cores hyper-threading enabled 500 GB memory Cluster Quad-core Intel Xeon 3.1 GHz CPU and 8GB memory, 1Gbps switch 32 nodes, 4 workers per node Since SINGA has been designed for scalability and efficiency, we also conducted performance analysis against existing systems. We test the performance of synchronous training. The figure on the left shows that OUR distributed training is scalable than Caffe/CXXNET. This is because SINGA can fully parallelize the training among multiple workers/cores. Caffe/CXXNET uses OpenBlas which parallelize only a part of all operations. The figure on the right show the comparison of SINGA with Petuum which runs Caffe as an application. It is obvious that SINGA scales better. When we have more than 64 workers, the benefit in terms of scalability is not as obvious This is the limitation of synchronous training. Typically a model cannot be partitioned into too many partitions. e.g., a mini-batch typically has less than 256 images, if we partition them onto 32 workers, each would only get 8 images. The computation would be very fast in terms of communication. It takes a single GTX970 about 260ms per iteration (batchsize, i.e., images processed per iteration, =512). Caffe, GTX 970 Synchronous

Experiment --- Scalability Train DCNN over CIFAR10: https://code.google.com/p/cuda-convnet Single Node Cluster Caffe We also compare SINGA with Caffe using in-memory asynchronous training. Since both systems run the in-memory hogwild algorithm, they both scale well. The time to reach a certain accuracy is reduced if more workers are launched. As can be observed SINGA runs a bit faster than Caffe to reach the same accuracy. The figure on the right shows the training in a cluster. It is not as stable as in-memory asynchronous training possibly due to the conflicts and delay of parameter updates. It is also observed in Google Brain paper. The training can converge by running a single worker at the last training stage. The accuracy is similar to that from in-memory training. In this test, we ran SINGA using a hybrid training framework, where the number of groups is fixed. Larger groups run faster for each SGD iteration, therefore take less time to finish the training. %added on 24 Oct.: It takes a single GTX970 about 60min (i.e., 3600s) longer than the asynchronous training with 4 workers per group (32 groups in total). SINGA Asynchronous

Conclusions Easy to implement different models Programming Model, Abstraction, and System Architecture Easy to implement different models Flexible and efficient to run different frameworks Experiments Train models from different categories Scalability test for different training frameworks SINGA Usable, extensible, efficient and scalable Apache SINGA v0.1.0 has been released V0.2.0 (with GPU-CPU, DLaaS, more features) out next month Being used for healthcare analytics, product search, … In summary, we proposed a programming model based on the layer abstraction to support the implementation of different models and to enable the running of different frameworks. Thorough experimental study was conducted. We trained different models to verify the correctness, and conducted the scalability and efficiency tests. SINGA is usable extensible, efficient and scalable. Apache Version 0.1.0 has been released, and version 2 will be out next month. It is being used for various applications, including healthcare, product search and business analytics.

Thank You! Acknowledgement: Apache SINGA Team (ASF mentors, contributors, committers, and users) + funding agencies (NRF, MOE, ASTAR)