Mihaela Malița Gheorghe M. Ștefan

Mihaela Malița Gheorghe M. Ștefan
ETTI Colloquia, May 31, 2017 * An Architectural Approach for the New AI Mihaela Malița Computer Science Dept., Saint Anselm College, NH, US Gheorghe M. Ștefan Electronic Devices, Circuits & Architectures Dept., ETTI, UPB users.dcae.pub.ro/~gstefan/

Abstract: The increased complexity faced by computer science leads it to look for solutions in the self-organizing mechanisms paradigm. Big sized, simple hardware and complex informational structure, tightly interleaved, start to allow us to solve the hard, new Artificial Intelligence (AI) problems we are faced with. One of the most investigated methods in Machine Learning is the Convolutional Neural Network (CNN) computational model. Both, hardware and software must and tend to be radically reshaped, as they are unable to provide at a reasonable energy consumption level the huge computational power requested by the AI applications. Our presentation proposes a high performance, low power architectural solution for implementing CNNs solutions for the new face of AI. The applications we consider start from stereo vision in automotive and end in Big Data. May 31, 2017 ETTI Colloquia

Outline: Function in electronics: circuit & information
Embedded Artificial Intelligence Convolutional Neural Networks (CNN) Functional Set Architecture for CNN Map-Reduce based Accelerated Processing Unit (APU) May 31, 2017 ETTI Colloquia

Functional Electronics
Early stage: microcontroller based Mature stage: heterogenous networks of microcontrollers specific circuits parallel accelerators Emerging stage: self-organizing informational structures based on the new embodiment of Artificial Intelligence: Deep (Convolutional) Neural Networks May 31, 2017 ETTI Colloquia

Embedded Artificial Intelligence
Functional electronics ~ Embedded systems Current stage is dominated by explicitly defined informational structure of the programs embedded in big physical structures Emerging stage requests the big sized and complex informational structures of “programs” embodied in matrices of weights extracted from data as self-organized information May 31, 2017 ETTI Colloquia

“AI winter” 1981: starts the unsuccessful Japanese Fifth Generation Computer project ~1987: Collapse of Lisp Machine market (Lambda Machine, Symbolics, … shut down the lights) ~1990: Expert systems get out of fashion 1990: the coldest year ~2000: AI starts to recover under different names as cognitive systems, computational intelligence ~2010: industrial applications of Deep Convolutional Neural Networks May 31, 2017 ETTI Colloquia

Convolutional Neural Network
AlexNet Architecture May 31, 2017 ETTI Colloquia

Convolutional layer Input volume: W1 × H1 × D1 Hyperparameters:
K: number of filters F: receptive field S: stride P: padding Output volume: W2 × H2 × D2 W2 = (W1-F+2P)/S+1 H2 = (H1 -F+2P)/S+1 D2 = K Weights per filter: F × F × D1 Receptive field: vector of F × F × D1 components Weights per filter: vector of F × F × D1 components Matrix of weights: (F × F × D1 ) × K components Computation: W2 × H2 “multiplications” of matrix of weights with receptive fields May 31, 2017 ETTI Colloquia

It is possible to define a Functional Set Architecture?
Case study: TensorFlow functions used for MNIST database of handwritten digits For all of them the acceleration on our Map-Reduce architecture is done with a degree of parallelism > 95 %. May 31, 2017 ETTI Colloquia

TensorFlow functions for ML
tf.matmul: map & reduce operations tf.add: map operations tf.nn.softmax: map & reduce operations tf.argmax: map & reduce operations tf.reduce_mean: reduce operations tf.equal: map operations tf.cast: map operations … : map &/ reduce operations May 31, 2017 ETTI Colloquia

Functional Set Architecture (FSA)
The set of functions of type mr.xxx is used to redefine tf.xxx for running on a p-cell Map-Reduce Accelerator tf.matmul(mr.matmul) tf.add(mr.add) tf.nn.softmax(mr.nn.softmax) tf.argmax(mr.argmax) tf.reduce_mean(mr.reduce_mean) tf.equal(mr.equal) … FSA(p)={mr.matmul,mr.add,mr.nn.softmax,…} Our main target: TensorFlow(FSA(p)) May 31, 2017 ETTI Colloquia

Map-Reduce based Accelerated Processing Unit
May 31, 2017 ETTI Colloquia

Matrix-Vector Multiplication

Linear algebra on Map-reduce Accelerator
Matrix-Vector Multiplication: for N×N matrix the execution time for p execution units is TMVmult(N) = (N log2 p) clock_cycles which represents supra-linear acceleration Matrix-Matrix Multiplication: for N×N matrices the execution time for p execution units is TMMmult(N) = (2N2 + (43 + log2 p)N – 1) clock_cycles May 31, 2017 ETTI Colloquia

Comparative performances
In MNIST experiment on a p-cell accelerator: 62% of computation is accelerated p times, 38% of computation is accelerates p/log2p times Solution 1: x86 mono-core, 2 GHz, ~50 Watt Solution 2: our Map-Reduce, FPGA, p = 512, 500 MHz, ~40 Watt Acceleration > 90x Solution 3: our Map-Reduce, ASIC 28nm, 84 mm2, p = 2048, 1 GHz, 12 Watt at 85oC Acceleration > 650x May 31, 2017 ETTI Colloquia

Current solutions for APU
GPU (Graphic Processing Units: Nvidia): uses in matrix-vector multiplication ~1% from its peak performance* MIC (Multiple Integrated Core: Intel’s Xeon Phi): uses in matrix-vector multiplication maximum 1.4 % from its peak performance* TPU (Tensor Processing Units: Google): is a very efficient ASIC which beats GPU, and MIC accelerators for a narrow application *The performance is so low because of the architectural incompatibilities: they are not designed to be accelerators for map-reduce operations May 31, 2017 ETTI Colloquia

The main drawbacks of the current solutions
Actual performance vs. peak performance is very low for GPU, MIC (reason: the map operations and the reduce operations does not work easy together) Energy consumption is very high for GPU, MIC (reason: cache oriented architecture instead of buffer oriented, and too much emphasis on float arithmetic) Application specific architecture for TPU (only matrix operations are supported efficiently; poor support for other specific functions in ML; it is a systolic circuit, not a programmable system) May 31, 2017 ETTI Colloquia

Our proposal: Map-Reduce Accelerator (MRA)
General purpose programmable accelerator can be added to current cloud architectures Actual performance vs. peak performance is very high (30 – 95 %) Very low energy consumption : Xeon Phi: 2 TFLOP/sec/300Watt MRA: 1TFOP/sec/12Watt (12.5x as Xeon Phi) May 31, 2017 ETTI Colloquia

Programming Map-Reduce based APU
The program for PU is: organized using a programming language (C, Python, …) readable locally modifiable The “program” for A (the set of weight matrices) is: self-organized from data unreadable only globally modifiable May 31, 2017 ETTI Colloquia

Concluding remarks: Deep Convolutional NN computation must be accelerated in order to: reduce the training time because: the network architecture is established only experimentally the training process is restarted for each new token reduce the energy consumption in running because: the technology is used in mobile application the technology is used in data centers MRA is qualified for ML domain because: GFLOP/sec/Watt is very high actual_performance/peak_performance is very high May 31, 2017 ETTI Colloquia

Bibliography Mono-core performance: 16-50% from float peak performance for matrix-vector multiplication, at: Nvidia performance: ~1% from float peak performance for matrix-vector multiplication, at: Xeon Phi performance: 16-50% from float peak performance for matrix-vector multiplication, at: May 31, 2017 ETTI Colloquia

Thank you Q&(possible)A

Mihaela Malița Gheorghe M. Ștefan

Similar presentations

Presentation on theme: "Mihaela Malița Gheorghe M. Ștefan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mihaela Malița Gheorghe M. Ștefan

Similar presentations

Presentation on theme: "Mihaela Malița Gheorghe M. Ștefan"— Presentation transcript:

Similar presentations

About project

Feedback