Mihaela Malița Gheorghe M. Ștefan ETTI Colloquia, May 31, 2017 * An Architectural Approach for the New AI Mihaela Malița Computer Science Dept., Saint Anselm College, NH, US www.anselm.edu/mmalita Gheorghe M. Ștefan Electronic Devices, Circuits & Architectures Dept., ETTI, UPB users.dcae.pub.ro/~gstefan/
Abstract: The increased complexity faced by computer science leads it to look for solutions in the self-organizing mechanisms paradigm. Big sized, simple hardware and complex informational structure, tightly interleaved, start to allow us to solve the hard, new Artificial Intelligence (AI) problems we are faced with. One of the most investigated methods in Machine Learning is the Convolutional Neural Network (CNN) computational model. Both, hardware and software must and tend to be radically reshaped, as they are unable to provide at a reasonable energy consumption level the huge computational power requested by the AI applications. Our presentation proposes a high performance, low power architectural solution for implementing CNNs solutions for the new face of AI. The applications we consider start from stereo vision in automotive and end in Big Data. May 31, 2017 ETTI Colloquia
Outline: Function in electronics: circuit & information Embedded Artificial Intelligence Convolutional Neural Networks (CNN) Functional Set Architecture for CNN Map-Reduce based Accelerated Processing Unit (APU) May 31, 2017 ETTI Colloquia
Functional Electronics Early stage: microcontroller based Mature stage: heterogenous networks of microcontrollers specific circuits parallel accelerators Emerging stage: self-organizing informational structures based on the new embodiment of Artificial Intelligence: Deep (Convolutional) Neural Networks May 31, 2017 ETTI Colloquia
Embedded Artificial Intelligence Functional electronics ~ Embedded systems Current stage is dominated by explicitly defined informational structure of the programs embedded in big physical structures Emerging stage requests the big sized and complex informational structures of “programs” embodied in matrices of weights extracted from data as self-organized information May 31, 2017 ETTI Colloquia
“AI winter” 1981: starts the unsuccessful Japanese Fifth Generation Computer project ~1987: Collapse of Lisp Machine market (Lambda Machine, Symbolics, … shut down the lights) ~1990: Expert systems get out of fashion 1990: the coldest year ~2000: AI starts to recover under different names as cognitive systems, computational intelligence ~2010: industrial applications of Deep Convolutional Neural Networks May 31, 2017 ETTI Colloquia
Convolutional Neural Network AlexNet Architecture May 31, 2017 ETTI Colloquia
Convolutional layer Input volume: W1 × H1 × D1 Hyperparameters: K: number of filters F: receptive field S: stride P: padding Output volume: W2 × H2 × D2 W2 = (W1-F+2P)/S+1 H2 = (H1 -F+2P)/S+1 D2 = K Weights per filter: F × F × D1 Receptive field: vector of F × F × D1 components Weights per filter: vector of F × F × D1 components Matrix of weights: (F × F × D1 ) × K components Computation: W2 × H2 “multiplications” of matrix of weights with receptive fields May 31, 2017 ETTI Colloquia
It is possible to define a Functional Set Architecture? Case study: TensorFlow functions used for MNIST database of handwritten digits For all of them the acceleration on our Map-Reduce architecture is done with a degree of parallelism > 95 %. May 31, 2017 ETTI Colloquia
TensorFlow functions for ML tf.matmul: map & reduce operations tf.add: map operations tf.nn.softmax: map & reduce operations tf.argmax: map & reduce operations tf.reduce_mean: reduce operations tf.equal: map operations tf.cast: map operations … : map &/ reduce operations May 31, 2017 ETTI Colloquia
Functional Set Architecture (FSA) The set of functions of type mr.xxx is used to redefine tf.xxx for running on a p-cell Map-Reduce Accelerator tf.matmul(mr.matmul) tf.add(mr.add) tf.nn.softmax(mr.nn.softmax) tf.argmax(mr.argmax) tf.reduce_mean(mr.reduce_mean) tf.equal(mr.equal) … FSA(p)={mr.matmul,mr.add,mr.nn.softmax,…} Our main target: TensorFlow(FSA(p)) May 31, 2017 ETTI Colloquia
Map-Reduce based Accelerated Processing Unit May 31, 2017 ETTI Colloquia
Matrix-Vector Multiplication May 31, 2017 ETTI Colloquia
Linear algebra on Map-reduce Accelerator Matrix-Vector Multiplication: for N×N matrix the execution time for p execution units is TMVmult(N) = (N + 2 + log2 p) clock_cycles which represents supra-linear acceleration Matrix-Matrix Multiplication: for N×N matrices the execution time for p execution units is TMMmult(N) = (2N2 + (43 + log2 p)N – 1) clock_cycles May 31, 2017 ETTI Colloquia
Comparative performances In MNIST experiment on a p-cell accelerator: 62% of computation is accelerated p times, 38% of computation is accelerates p/log2p times Solution 1: x86 mono-core, 2 GHz, ~50 Watt Solution 2: our Map-Reduce, FPGA, p = 512, 500 MHz, ~40 Watt Acceleration > 90x Solution 3: our Map-Reduce, ASIC 28nm, 84 mm2, p = 2048, 1 GHz, 12 Watt at 85oC Acceleration > 650x May 31, 2017 ETTI Colloquia
Current solutions for APU GPU (Graphic Processing Units: Nvidia): uses in matrix-vector multiplication ~1% from its peak performance* MIC (Multiple Integrated Core: Intel’s Xeon Phi): uses in matrix-vector multiplication maximum 1.4 % from its peak performance* TPU (Tensor Processing Units: Google): is a very efficient ASIC which beats GPU, and MIC accelerators for a narrow application *The performance is so low because of the architectural incompatibilities: they are not designed to be accelerators for map-reduce operations May 31, 2017 ETTI Colloquia
The main drawbacks of the current solutions Actual performance vs. peak performance is very low for GPU, MIC (reason: the map operations and the reduce operations does not work easy together) Energy consumption is very high for GPU, MIC (reason: cache oriented architecture instead of buffer oriented, and too much emphasis on float arithmetic) Application specific architecture for TPU (only matrix operations are supported efficiently; poor support for other specific functions in ML; it is a systolic circuit, not a programmable system) May 31, 2017 ETTI Colloquia
Our proposal: Map-Reduce Accelerator (MRA) General purpose programmable accelerator can be added to current cloud architectures Actual performance vs. peak performance is very high (30 – 95 %) Very low energy consumption : Xeon Phi: 2 TFLOP/sec/300Watt MRA: 1TFOP/sec/12Watt (12.5x as Xeon Phi) May 31, 2017 ETTI Colloquia
Programming Map-Reduce based APU The program for PU is: organized using a programming language (C, Python, …) readable locally modifiable The “program” for A (the set of weight matrices) is: self-organized from data unreadable only globally modifiable May 31, 2017 ETTI Colloquia
Concluding remarks: Deep Convolutional NN computation must be accelerated in order to: reduce the training time because: the network architecture is established only experimentally the training process is restarted for each new token reduce the energy consumption in running because: the technology is used in mobile application the technology is used in data centers MRA is qualified for ML domain because: GFLOP/sec/Watt is very high actual_performance/peak_performance is very high May 31, 2017 ETTI Colloquia
Bibliography Mono-core performance: 16-50% from float peak performance for matrix-vector multiplication, at: http://simulationcorner.net/index.php?page=fastmatrixvector Nvidia performance: ~1% from float peak performance for matrix-vector multiplication, at: https://stackoverflow.com/questions/26417475/matrix-vector-multiplication-in-cuda-benchmarking-performance Xeon Phi performance: 16-50% from float peak performance for matrix-vector multiplication, at: http://www.jcomputers.us/vol9/jcp0907-09.pdf May 31, 2017 ETTI Colloquia
Thank you Q&(possible)A May 31, 2017 ETTI Colloquia