Lecture 1c: Caffe - Getting Started boris.ginsburg@gmail.com
Agenda Caffe installation Caffe internals MNIST training Digits Import of Dataset Test description Network topology definition Layers( ) Internal data format ( ) MNIST training Digits Implementation details of Convolutional layer
Exercises & Projects Build caffe Play with MNIST topologies & layers Change to CPU Change atlas to openblas Play with MNIST topologies & layers How does the net accuracy depend on topology? What will happen if we replace RELU with tanh? Add normalization layer Extra: look at the definition of following layers: Maxout, normalization layer Convolutional layer internals
Open-source Deep Learning libraries http://caffe.berkeleyvision.org/ C++/ CUDA, Python and Matlab wrappers, + easily extendable ; NVIDIA Digits http://torch.ch/ Excellent tutorial, C++/Cuda; Lua. https://code.google.com/p/cuda-convnet2/ http://deeplearning.net/software/pylearn2/ Integrated with Theano, C++/Cuda, Python http://torontodeeplearning.github.io/convnet/
Caffe: installation HW: https://timdettmers.wordpress.com/2015/03/09/deep-learning-hardware-guide/ Nvidia – optional, can run on CPU only OS: Ubuntu 14.04 (12.04) , OS X (Windows – unofficial) CUDA 7.0 (6.5) + cuDNN Install caffe: http://caffe.berkeleyvision.org/ works out of box
Caffe: make Build caffe for CPU Get caffe source: $ git clone https://github.com/BVLC/caffe Copy makefile.config and $ make -j Tutorial (MNIST. CIFAR-10, Imagenet): http://caffe.berkeleyvision.org/gathered/examples/mnist.html http://caffe.berkeleyvision.org/gathered/examples/cifar10.html Build caffe for CPU Change to CPU_only and BLAS:= openblas
Homework Play with MNIST topologies & layers How does accuracy depend on topology? How speed and accuracy depend on batch size? What will happen if we replace RELU with tanh? Add normalization layer Extra: look at the implementation of following layers: Data layer Convolutional layer IP layer Soft-max and accuracy
Caffe: example 1 - MNIST Database: http://yann.lecun.com/exdb/lenet/index.html
Caffe: Step 1 - import datasets 3 examples of data conversion from images to caffe Mnist: Dataset http://yann.lecun.com/exdb/mnist/ …/examples/mnist/convert_mnist_data.cpp CIFAR-10: Dataset : http://www.cs.toronto.edu/~kriz/cifar.html …/examples/mnist/convert_cifar_data.cpp Imagenet: Dataset http://www.image-net.org/challenges/LSVRC/2014/ …tools/convert_imageset.cpp
Caffe: Database format leveldb: https://code.google.com/p/leveldb/ <key,value>: arbitrary byte arrays; data is stored sorted by key; callers can provide a custom comparison function to override the sort order. basic operations: Put(key,value), Get(key), Delete(key). lmdb: http://symas.com/mdb/ (in dev branch) <key;value> ; data is stored sorted by key uses memory-mapped files: the read performance of in-memory db while still offering the persistence of standard disk-based db concurrent HDF5 (Hierarchical Data Format 5) http://www.hdfgroup.org/HDF5/
Caffe: configuration files You should define train and test configuration Solver parameters file : …/examples/mnist/lenet_solver.prototxt Network descriptor for training and testing: …/examples/mnist/lenet_train_test.prototxt Descriptors for parameters are in …/src/caffe/proto/caffe.proto . Protobuf tool (Google protocol buffers) will generate C++ classes from these descriptors https://developers.google.com/protocol-buffers/docs/overview
LeNet topology Soft Max Inner Product BACKWARD FORWARD ReLUP Pooling [2x2, stride 2] Convolutional layer [5x5] Pooling [2x2, stride 2] Convolutional layer [5x5] Data Layer
Layer:: Forward( ) } Layer::Forward( ) propagate yl-1 to next layer: class Layer { Setup (bottom, top); // initialize layer Forward (bottom, top); //compute next layer Backward( top, bottom); //compute gradient } Layer::Forward( ) propagate yl-1 to next layer: 𝑦 𝑙 =𝑓 𝑤 𝑙 , 𝑦 𝑙−1
Data Layer data label name: "mnist" type: DATA data_param { source: "mnist-train-leveldb" batch_size: 64 scale: 0.00390625 } top: "data" top: "label" mnist-train-leveldb label
Blob All data is stored as BLOBs – Binary (Basic) Large Objects class Blob { Blob( int num, int channels, int height, int width); const Dtype* cpu_data() const; const Dtype* gpu_data() const; … protected: shared_ptr<SyncedMemory> data_; // containter for cpu_ / gpu_memory shared_ptr<SyncedMemory> diff_; // gradient int num_; int channels_; int height_; int width_; int count_; }
SynchedMemory class SyncedMemory { public: SyncedMemory() const void* cpu_data(); const void* gpu_data(); void* mutable_cpu_data(); void* mutable_gpu_data(); enum SyncedHead { UNINITIALIZED, HEAD_AT_CPU, HEAD_AT_GPU, SYNCED }; SyncedHead head() { return head_; } size_t size() { return size_; }
SynchedMemory class SyncedMemory { … private: void to_cpu(); void to_gpu(); void* cpu_ptr_; void* gpu_ptr_; size_t size_; SyncedHead head_; }; // class SyncedMemory
Convolutional Layer Data name: "conv1" type: CONVOLUTION conv1 blobs_lr: 1. convolution_param { num_output: 20 kernelsize: 5 stride: 1 weight_filler { type: "xavier" } bias_filler { type: "constant" bottom: "data" top: "conv1” conv1 conv1 Data conv1 conv1
Convolutional Layer for (n = 0; n < N; n++) for (m = 0; m < M; m ++) for(y = 0; y<Y; y++) for(x = 0; x<X; x++) for (p = 0; p< K; p++) for (q = 0; q< K; q++) yL (n; x, y) += yL-1(m, x+p, y+q) * w (m , n; p, q); Add bias…
Pooling Layer name: "pool1" type: POOLING pooling_param { kernel_size: 2 stride: 2 pool: MAX } bottom: "conv1" top: "pool1" for (p = 0; p< k; p++) for (q = 0; q< k; q++) yL (x, y) = max ( yL(x, y), yL-1(x*s + p, y*s + q) ) ; Pooling helps to extract features that are increasingly invariant to local transformations of the input image.
Inner product (Fully Connected) Layer name: "ip1" type: INNER_PRODUCT blobs_lr: 1. blobs_lr: 2. inner_product_param { num_output: 500 weight_filler { type: "xavier" } bias_filler { type: "constant" bottom: "pool2" top: "ip1" YL (n) = ∑ WL(n, m) * YL-1 (m)
ReLU Layer YL (n; x, y) = max( YL-1(n; x, y), 0 ); layers { name: "relu1" type: RELU bottom: "ip1" top: "ip1" } YL (n; x, y) = max( YL-1(n; x, y), 0 );
SoftMax + Loss Layer Combines softmax: E = - log (YL-(label (n) ) layers { name: "loss" type: SOFTMAX_LOSS bottom: "ip2" bottom: "label" } label X[0..9] Combines softmax: YL [i] = exp (YL-1[i] ) / ( ∑ exp (YL-[i] ); with log-loss : E = - log (YL-(label (n) )
Digits https://developer.nvidia.com/digits Nice development environment built on top of caffe: Easy installation Manage datasets and models Visualize training Look inside layers Open source
Backup: Convolutional layer internals
Conv layer:: Forward() template <typename Dtype> Dtype ConvolutionLayer<Dtype>::Forward_cpu( const vector<Blob<Dtype>*>& bottom, vector<Blob<Dtype>*>* top) { const Dtype* bottom_data = bottom[0]->cpu_data(); Dtype* top_data = (*top)[0]->mutable_cpu_data(); const Dtype* weight = this->blobs_[0]->cpu_data(); Dtype* col_data = col_buffer_.mutable_cpu_data(); ….
Conv layer:: Forward() for (int n = 0; n < num_; ++n) { im2col_cpu(bottom_data + bottom[0]->offset(n), channels_, height_, width_, kernel_size_, pad_, stride_, col_data); caffe_cpu_gemm(CblasNoTrans, CblasNoTrans, M_, N_, K_, 1., weight , col_data , 0., top_data + (*top)[0]->offset(n)); if (bias_term_) { caffe_cpu_gemm(CblasNoTrans, CblasNoTrans, num_output_, N_, 1, 1., this->blobs_[1]->cpu_data(), bias_multiplier_->cpu_data(), 1., top_data + (*top)[0]->offset(n)); } return Dtype(0.);
Conv layer:: Forward() for (int n = 0; n < num_; ++n) { im2col_cpu(bottom_data + bottom[0]->offset(n), channels_, height_, width_, kernel_size_, pad_, stride_, col_data); for (int g = 0; g < group_; ++g) { caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, N_, K_, (Dtype)1., weight + weight_offset * g, col_data + col_offset * g, (Dtype)0., top_data + (*top)[0]->offset(n) + top_offset * g); } if (bias_term_) { caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, num_output_, N_, 1, (Dtype)1., this->blobs_[1]->cpu_data(), reinterpret_cast<const Dtype*>(bias_multiplier_->cpu_data()), (Dtype)1., top_data + (*top)[0]->offset(n));
Convolutional Layer : im2col Implementation of Convolutional Layer is based on 2 tricks: reduction of convolution to matrix – matrix multiply Using BLAS gemm() for fast computation of matrix-matrix multiply Let’s discuss reduction to matrix – matrix multiply: im2col( ). We will talk about BLAS in details later.
Convolutional Layer : im2col W(1, 1) W(1, 2) X(1, 1) X(1, 2) X(1, 3) Y(1, 1) Y(1, 2) W(2, 1) W(2, 2) X(2, 1) X(2, 2) X(2, 3) Y(2, 1) Y(2, 2) X(3, 1) X(3, 2) X(3, 3)
Convolutional Layer : im2col W(1, 1) W(1, 2) X(1, 1) X(1, 2) X(1, 3) Y(1, 1) Y(1, 2) W(2, 1) W(2, 2) X(2, 1) X(2, 2) X(2, 3) Y(2, 1) Y(2, 2) X(3, 1) X(3, 2) X(3, 3)
Convolutional Layer : im2col W(1, 1) W(1, 2) W(2, 1) W(2, 2) X(1, 1) Y(1, 1) X(1, 2) X(2, 2) X(2, 1)
Convolutional Layer : im2col W(1, 1) W(1, 2) W(2, 1) W(2, 2) X(1, 1) X(1, 2) X(2, 1) X(2, 2) Y(1, 1) Y(1, 2) Y(2, 1) Y(2, 2) X(1, 2) X(1, 3) X(2, 2) X(2, 3) X(2, 2) X(2, 2) X(3, 2) X(3, 2) X(2, 1) X(2, 3) X(3, 2) A3, 3)
Convolutional Layer : im2col See Chellapilla, “High Performance Convolutional Neural Networks for Document Processing” for more details:
Convolutional Layer: im2col
Im2col: Exercise Workout these 2 cases based on the example above: 2 input feature, 1output feature 2 input features , 3 ouput features