Download presentation
Published byFranklin Stephens Modified over 9 years ago
1
SINGA: Putting Deep Learning into the Hands of Multimedia Users
Wei Wang, Gang Chen, Tien Tuan Anh Dinh, Jinyang Gao, Beng Chin Ooi, Kian-Lee Tan, and Sheng Wang
2
Multimedia data and application
Introduction Multimedia data and application Motivations Deep learning models and training, and design principles SINGA Usability Scalability Implementation Experiment
3
Introduction VocallIQ (acquired by Apple) Social Media Audio Madbits (acquired by Twitter) Multimedia Data Perceptio (acquired by Apple) E-commerce Image/video LookFlow (acquired by Yahoo! Flickr) Deepomatic (e-commerce product search) We are in the era of Big Data, and most of which are multimedia in nature. Huge amounts of data have been generated at very unprecedented rates by modern applications, and in a variety of format. This creates a lot of opportunity for startups that analyze the data. And Deep Learning has been noted as one of the most effective techniques for handling the complexity of the problems and extracting values from the data. Descartes Labs (satellite images) Health-care Text Clarifai (tagging) Ldibon ParallelDots Deep Learning has been noted for its effectiveness for multimedia applications! AlchemyAPI (acquired by IBM) Semantria (NLP tasks >10 languages)
4
Motivations Model Categories CNN
Feedforward Models CNN, MLP, Auto-encoder Image/video classification For examples, the Convolutional Neural Network, a feedforward model, shows breakthrough/remarkable improvement for image classification. CNN Krizhevsky, Sutskever, and Hinton, 2012; Szegedy et al., 2014; Simonyan and Zisserman, 2014a
5
Motivations Model Categories DBN Dahl et al., 2012 Feedforward Models
CNN, MLP, Auto-encoder Image/video classification DBN, RBM, DBM Speech recognition Energy models The Deep Belief network, an energy model, is very effective for speech recognition. DBN RBM Dahl et al., 2012
6
Recurrent Neural Networks
Motivations Feedforward Models Energy models Recurrent Neural Networks Model Categories CNN, MLP, Auto-encoder Image/video classification DBN, RBM, DBM Speech recognition The Recurrent Neural Network has shown promising performance for modelling sequential data, and NLP applications. RNN, LSTM, GRU Natural language processing Mikolov et al., 2010; Cho et al., 2014
7
Recurrent Neural Networks
Motivations Feedforward Models Energy models Recurrent Neural Networks Model Categories CNN, MLP, Auto-encoder Image/video classification Design Goal I Usability: easy to implement various models DBN, RBM, DBM Speech recognition As in any other domains, different applications require different models. They are not easy to implement and hard to tune. Therefore, one of the main goals of SINGA is to PROVIDE a general programming model to help users implement various applications. Our aim is to make life easier for USERS RNN, LSTM, GRU Natural language processing
8
Motivations: Training Process
Update model parameters to minimize prediction error Training algorithm Mini-batch Stochastic Gradient Descent (SGD) Training time (time per SGD iteration) x (number of SGD iterations) Long time to train large models over large datasets, e.g., 2 weeks for training Overfeat (Pierre, et al.) reported by Intel ( To deploy a deep learning model for an online application, we must first train the model. The training of deep learning models typically follows the SGD algorithm. The training is slow, very slow in fact, because the it requires many iterations to converge and each iteration is just as costly. It is so computation intensive that it can take weeks or even months to train a large model using a large dataset. Distributed training is therefore a good way to address this training issue Back-propagation (BP) Contrastive Divergence (CD)
9
Motivations: Distributed Training Frameworks
Synchronous training (Google Sandblaster, Dean et al., 2012; Baidu AllReduce, Wu et al., 2015) Reduce time per iteration Scalable for single-node with multiple GPUs Cannot scale to large cluster Asynchronous training (Google Downpour, Dean et al., 2012, Hogwild!, Recht et al., 2011) Reduce number of iterations per machine Scalable for big cluster with commodity machine(CPU) Not stable Hybrid frameworks Design Goal II Scalability: not just flexible, but also efficient and adaptive to run different training frameworks We can use distributed training to reduce the training time, by using more computing resources. Its scalability depends on the parallelism schemes; that is, the training frameworks. There are basically two categories of distributed training frameworks: synchronous and asynchronous. Both have strengths and weaknesses. Synchronous training frameworks improves the efficiency per iteration. It tends to work well for single node with multiple GPUs BUT synchronous training cannot scale to large clusters. Asynchronous training frameworks reduces the training iterations. It is however more suitable for a cluster with CPUs. Unfortunately, it is not as stable as synchronous training possibly due to the conflicts and delay of parameter updates. ( It is validated in both Google’s paper and in our experiment. ) There are also hybrid frameworks. Different frameworks have different application scenario. So, we need a system that is not just FLEXIBLE, but also ADAPTIVE, to run different training frameworks in a scalable fashion.
10
A Distributed Deep Learning Platform
SINGA: A Distributed Deep Learning Platform Now, we will present the design and system architecture of SINGA. This is an overview of SINGA, where the workers compute parameter gradients against user defined neural net in each SGD iteration. The servers receive parameter gradients and update them. Cluster topology specifies the training framework.
11
Usability: Abstraction
NeuralNet stop class Layer { vector<Blob> data, grad; vector<Param*> param; ... void Setup(LayerProto& conf, vector<Layer*> src); void ComputeFeature(int flag, vector<Layer*> src); void ComputeGradient(int flag, vector<Layer*> src); }; Driver::RegisterLayer<FooLayer>("Foo"); // register new layers Layer TrainOneBatch Layer is the core abstraction in SINGA, which conducts feature transformations. Unlike the programming models in other systems, which separate layer operations from layer features and parameters, the layer of SINGA carries both features and parameters. In this way, a neural net can be easily constructed by connecting a set of layers. This abstraction also simplifies the neural net partitioning which will be discussed soon. Common layers are implemented as built-in layers and classified into the 5 categories. The neuron layer is the layer that applies non-linear feature transformations. Input layers load raw data (and label) Output layers output feature (and prediction results) Neuron layers transform features, e.g., convolution and pooling Loss layers measure training loss, e.g., cross-entropy loss Connection layers connect layers due to neural net partition
12
Usability: Neural Net Representation
Input Hidden Loss labels Feedforward models (e.g., CNN) stop Layer TrainOneBatch SINGA has a uniform neural net representation for all three categories of models. A neural network consists of uni-directionally connected layers. The feedforward model is straight-forward using this representation. The RNN model is represented by unrolling the recurrent connections of layers inside one layer. To represent an energy model, we replace each un-directed connection with 2 directed connections. RNN RBM
13
Usability: TrainOneBatch
NeuralNet stop Loss Layer labels Hidden TrainOneBatch Input Feedforward models (e.g., CNN) Back-propagation (BP) Contrastive Divergence (CD) The TrainOneBatch function calls layer functions to compute parameter gradients. Currently, we have implemented the BP algorithm for feedforward models and RNN, the CD algorithm for RBM. Other algorithms could be implemented by overriding the TrainOneBatch function. This gives us the USABILITY! Just need to override the TrainOneBatch function to implement other algorithms! RNN RBM
14
Scalability: Partitioning for Distributed Training
1 NeuralNet Partitioning: 1. Partition layers into different subsets 2. Partition each singe layer on batch dimension. 3. Partition each singe layer on feature dimension. 4. Hybrid partitioning strategy of 1, 2 and 3. Worker 1 Worker 2 2 3 To distribute the training onto multiple nodes, one way is to partition the model, i.e., partitioning the neural net. Another way is to partition the dataset. In SINGA, we partition the neural net and assign a subset of layers to each worker. First, we can separate different layers for different workers to parallelize them. Second, we can distribute the data of one mini-batch onto different workers. Third, we can let different workers to compute different parts of each feature. Fourth, a hybrid partition is useful for some models, e.g., deep CNN. Users simply configure the partition scheme (1,2,3,4), SINGA will do the neural net partitioning automatically (i.e., slice layers and connect layers). Users just need to CONFIGURE the partitioning scheme and SINGA takes care of the real work (eg. slice and connect layers) Worker 1 Worker 2 Worker 1 Worker 1 Worker 2
15
Scalability: Training Framework
Cluster Topology Server Group Parameters Server Legends: Worker Server Node Group Inter-node Communication Neural Net Worker The cluster topology decides the training framework which in turns affects the scalability. With a single worker group and a single server group, SINGA workers run synchronously. Each worker computes over a partition of the neural net. Typically a model cannot be partitioned into too many partitions. Therefore, synchronous training cannot scale well to a large cluster. The scalability is affected! Synchronous training cannot scale to large group size
16
Scalability: Training Framework
Cluster Topology Legends: Worker Server Node Group Inter-node Communication We can have more machines and more worker groups. However, since all worker groups communicate with the single server group, the communication is likely to be the bottleneck. As always, parallelism breaks bottleneck. So, we use server groups to distribute the computation and reduce the communication cost. Communication is the bottleneck!
17
Scalability: Training Framework
Cluster Topology Legends: Worker Server Node Group Inter-node Communication This design is flexible to run different existing training frameworks by configuring the cluster topology, Such as the synchronous frameworks for Sandblaster from Google and AllReduce from Baidu; And asynchronous frameworks including Downpour and Distributed Hogwild. With this, SINGA is extensible, efficient and scalable. (c) Downpour (d) Distributed Hogwild (a) Sandblaster (b) AllReduce sync async SINGA is able to configure most known frameworks.
18
Implementation SINGA Software Stack Remote Nodes HDFS Ubuntu Docker
CentOS MacOS DiskFile Mesos Zookeeper Worker Stub Server Driver CNN RBM RNN Optional Component SINGA Component Driver::Train() Main Thread Stub::Run() Worker thread While(not stop): Worker::TrainOneBatch() Server thread Server::Update() Remote Nodes SINGA implements workers and servers as threads. Hence we can run distributed training in a single node or in a cluster. SINGA has been seamlessly integrated with cloud computing software for easy management of computing resources, and deployment. Legend:
19
Deep learning as a Service (DLaaS)
Third party APPs (Web app, Mobile,..) API Developers (Browser) GUI http request http request Rafiki Server User, Job, Model, Node Management Data Base Routing(Load balancing) File Storage System (e.g. HDFS) http request http request A layer also been built on top of SINGA to support both developers and users. This layer provides a GUI for users to select and launch built-in models and applications. That is, our aim is simple but ambitious. We want to use SINGA to level the playing field for those who want to work on Deep Learning or exploit Deep learning, by taking care of complex system plumbing work, its reliability, efficiency and scalability. Further, you can use SINGA for repeatability study of other models. Rafiki Agent Rafiki Agent Timon (c++ wrapper) Timon (c++ wrapper) Timon (c++ wrapper) Timon (c++ wrapper) … … … 1. To improve the Usability of SINGA; 2. To “level” the playing field by taking care of complex system plumbing work, its reliability, efficiency and scalability. SINGA SINGA SINGA SINGA SINGA’s RAFIKI
20
Comparison: Features of the Systems
MXNet on 28/09/15 Feature SINGA Caffe CXXNET cuda-convnet H2O Deep Learning Models Feed-forward (CNN) ✔ MLP Energy model (RBM) x Recurrent networks (RNN) Distributed Training Frameworks Synchronous Asynchronous Hybrid Hardware CPU GPU V0.2.0 Cloud Software HDFS Resource management Virtualization Binding Python (P), Matlab(M), R ongoing (P) P+M P P+R There are a few open source systems out there, and here we compare SINGA against them in terms of features. caffe: is famous for convolutional neural network training since (the winner of Open Source Software Competition 2014) CXXNET: another CNN training system written in C++ CUDA-..: written by the author of the Deep CNN paper, specifically optimized for Deep CNN. H20: written in Java, good integration with cloud software. As can be seen, SINGA supports all the known models and training frameworks, and the GPU version will be released next month. Comparison with other open source projects
21
Experiment --- Usability
Used SINGA to train three known models and verify the results Hinton, G. E. and Salakhutdinov, R. R. (2006) Reducing the dimensionality of data with neural networks. Science, Vol no. 5786, pp , 28 July 2006. … To verify the correctness of the systems, we used SINGA to train three different models and checked against published results/benchmark. I shall not go into details here. We ran SINGA to train RBM models. RBM Deep Auto-Encoders
22
Experiment --- Usability
W. Wang, X. Yang, B. C. Ooi, D. Zhang, Y. Zhuang: Effective Deep Learning Based Multi-Modal Retrieval. VLDB Journal - Special issue of VLDB'14 best papers, 2015. W. Wang, B.C. Ooi, X. Yang, D. Zhang, Y. Zhuang: Effective MultiModal Retrieval based on Stacked AutoEncoders. Int'l Conference on Very Large Data Bases (VLDB), 2014. Deep Multi-Model Neural Network We run SINGA to train feedforward models. CNN MLP
23
Experiment --- Usability
Mikolov Tomá, Karafiát Martin, Burget Luká, Èernocký Jan, Khudanpur Sanjeev: Recurrent neural network based language model, INTERSPEECH 2010), Makuhari, Chiba, JP We ran SINGA to train RNN models. This graph shows the perplexity vs training iterations. Perplexity is used for measuring the performance of a language model. It is calculated according to the accuracy of predicting the next word given the current word in a sentence. The code from the authors has many hard coded training settings, we didn’t follow those settings exactly. Hence the performance lines are not exactly the same. But they reach the same performance at the end.
24
Experiment --- Efficiency and Scalability
Train DCNN over CIFAR10: Single Node 4 NUMA nodes (Intel Xeon 7540, 2.0GHz) Each node has 6 cores hyper-threading enabled 500 GB memory Cluster Quad-core Intel Xeon 3.1 GHz CPU and 8GB memory, 1Gbps switch 32 nodes, 4 workers per node Since SINGA has been designed for scalability and efficiency, we also conducted performance analysis against existing systems. We test the performance of synchronous training. The figure on the left shows that OUR distributed training is scalable than Caffe/CXXNET. This is because SINGA can fully parallelize the training among multiple workers/cores. Caffe/CXXNET uses OpenBlas which parallelize only a part of all operations. The figure on the right show the comparison of SINGA with Petuum which runs Caffe as an application. It is obvious that SINGA scales better. When we have more than 64 workers, the benefit in terms of scalability is not as obvious This is the limitation of synchronous training. Typically a model cannot be partitioned into too many partitions. e.g., a mini-batch typically has less than 256 images, if we partition them onto 32 workers, each would only get 8 images. The computation would be very fast in terms of communication. It takes a single GTX970 about 260ms per iteration (batchsize, i.e., images processed per iteration, =512). Caffe, GTX 970 Synchronous
25
Experiment --- Scalability
Train DCNN over CIFAR10: Single Node Cluster Caffe We also compare SINGA with Caffe using in-memory asynchronous training. Since both systems run the in-memory hogwild algorithm, they both scale well. The time to reach a certain accuracy is reduced if more workers are launched. As can be observed SINGA runs a bit faster than Caffe to reach the same accuracy. The figure on the right shows the training in a cluster. It is not as stable as in-memory asynchronous training possibly due to the conflicts and delay of parameter updates. It is also observed in Google Brain paper. The training can converge by running a single worker at the last training stage. The accuracy is similar to that from in-memory training. In this test, we ran SINGA using a hybrid training framework, where the number of groups is fixed. Larger groups run faster for each SGD iteration, therefore take less time to finish the training. %added on 24 Oct.: It takes a single GTX970 about 60min (i.e., 3600s) longer than the asynchronous training with 4 workers per group (32 groups in total). SINGA Asynchronous
26
Conclusions Easy to implement different models
Programming Model, Abstraction, and System Architecture Easy to implement different models Flexible and efficient to run different frameworks Experiments Train models from different categories Scalability test for different training frameworks SINGA Usable, extensible, efficient and scalable Apache SINGA v0.1.0 has been released V0.2.0 (with GPU-CPU, DLaaS, more features) out next month Being used for healthcare analytics, product search, … In summary, we proposed a programming model based on the layer abstraction to support the implementation of different models and to enable the running of different frameworks. Thorough experimental study was conducted. We trained different models to verify the correctness, and conducted the scalability and efficiency tests. SINGA is usable extensible, efficient and scalable. Apache Version has been released, and version 2 will be out next month. It is being used for various applications, including healthcare, product search and business analytics.
27
Thank You! Acknowledgement: Apache SINGA Team (ASF mentors, contributors, committers, and users) + funding agencies (NRF, MOE, ASTAR)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.