Mihaela Malița Gheorghe M. Ștefan

Slides:



Advertisements
Similar presentations
Computer Abstractions and Technology
Advertisements

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Artificial Intelligence John Ross Yuki Yabushita Sharon Pieloch Steven Smith.
Computer Performance Computer Engineering Department.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
One-Chip TeraArchitecture 19 martie 2009 One-Chip TeraArchitecture Gheorghe Stefan
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.
István Lőrentz 1 Mihaela Malita 2 Răzvan Andonie 3 Mihaela MalitaRăzvan Andonie 3 (presenter) 1 Electronics and Computers Department, Transylvania University.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
Philipp Gysel ECE Department University of California, Davis
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Architecture Furkan Rabee
Chapter 1 Introduction.
Measuring Performance II and Logic Design
It’s Time for Cognitive Computing
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
Panel: Beyond Exascale Computing
Conclusions on CS3014 David Gregg Department of Computer Science
CS203 – Advanced Computer Architecture
Big data classification using neural network
M. Bellato INFN Padova and U. Marconi INFN Bologna
Deep Learning Software: TensorFlow
Analysis of Sparse Convolutional Neural Networks
Optical RESERVOIR COMPUTING
Machine Learning for Big Data
Data Mining, Neural Network and Genetic Programming
CSE Jeongbin Choe Advisor: Prof. Bohyung Han (CV Lab)
DeepCount Mark Lenson.
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Enabling machine learning in embedded systems
Biomedical Signal processing Chapter 1 Introduction
Morgan Kaufmann Publishers
Map-Scan Node Accelerator for Big-Data
Deep Learning Fundamentals online Training at GoLogica
Parallel Processing and GPUs
Torch 02/27/2018 Hyeri Kim Good afternoon, everyone. I’m Hyeri. Today, I’m gonna talk about Torch.
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Brewing Deep Networks With Caffe
Introduction to Neural Networks
Artificial Intelligence introduction(2)
Biomedical Signal processing Chapter 1 Introduction
Simulation of computer system
Introduction of MATRIX CAPSULES WITH EM ROUTING
network of simple neuron-like computing elements
AI Stick Easy to learn and use, accelerate the industrialization of artificial intelligence, and let the public become an expert in AI.
AI abc Learn AI in one hour Do AI in one day (a life
Final Project presentation
On Convolutional Neural Network
Parallel Computing Architectures (048874) Fall
Slalom: Fast, Verifiable and Private Execution of Neural Networks in Trusted Hardware Kriti shreshtha.
Overview of deep learning
实习生汇报 ——北邮 张安迪.
PANN Testing.
Multicore and GPU Programming
EE 193: Parallel Computing
AI (Artificial Intelligence) Chip Market Artificial Intelligence Chip Market to Global Analysis and Forecasts by Segment (Data Center, Edge); Type.
Automatic Handwriting Generation
CS295: Modern Systems: Application Case Study Neural Network Accelerator – 2 Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech.
Multicore and GPU Programming
Learning and Memorization
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Debasis Bhattacharya, JD, DBA University of Hawaii Maui College
Parallel Systems to Compute
Presentation transcript:

Mihaela Malița Gheorghe M. Ștefan ETTI Colloquia, May 31, 2017 * An Architectural Approach for the New AI Mihaela Malița Computer Science Dept., Saint Anselm College, NH, US www.anselm.edu/mmalita Gheorghe M. Ștefan Electronic Devices, Circuits & Architectures Dept., ETTI, UPB users.dcae.pub.ro/~gstefan/

Abstract: The increased complexity faced by computer science leads it to look for solutions in the self-organizing mechanisms paradigm. Big sized, simple hardware and complex informational structure, tightly interleaved, start to allow us to solve the hard, new Artificial Intelligence (AI) problems we are faced with. One of the most investigated methods in Machine Learning is the Convolutional Neural Network (CNN) computational model. Both, hardware and software must and tend to be radically reshaped, as they are unable to provide at a reasonable energy consumption level the huge computational power requested by the AI applications.  Our presentation proposes a high performance, low power architectural solution for implementing CNNs solutions for the new face of AI. The applications we consider start from stereo vision in automotive and end in Big Data. May 31, 2017 ETTI Colloquia

Outline: Function in electronics: circuit & information Embedded Artificial Intelligence Convolutional Neural Networks (CNN) Functional Set Architecture for CNN Map-Reduce based Accelerated Processing Unit (APU) May 31, 2017 ETTI Colloquia

Functional Electronics Early stage: microcontroller based Mature stage: heterogenous networks of microcontrollers specific circuits parallel accelerators Emerging stage: self-organizing informational structures based on the new embodiment of Artificial Intelligence: Deep (Convolutional) Neural Networks May 31, 2017 ETTI Colloquia

Embedded Artificial Intelligence Functional electronics ~ Embedded systems Current stage is dominated by explicitly defined informational structure of the programs embedded in big physical structures Emerging stage requests the big sized and complex informational structures of “programs” embodied in matrices of weights extracted from data as self-organized information May 31, 2017 ETTI Colloquia

“AI winter” 1981: starts the unsuccessful Japanese Fifth Generation Computer project ~1987: Collapse of Lisp Machine market (Lambda Machine, Symbolics, … shut down the lights) ~1990: Expert systems get out of fashion 1990: the coldest year ~2000: AI starts to recover under different names as cognitive systems, computational intelligence ~2010: industrial applications of Deep Convolutional Neural Networks May 31, 2017 ETTI Colloquia

Convolutional Neural Network AlexNet Architecture May 31, 2017 ETTI Colloquia

Convolutional layer Input volume: W1 × H1 × D1 Hyperparameters: K: number of filters F: receptive field S: stride P: padding Output volume: W2 × H2 × D2 W2 = (W1-F+2P)/S+1 H2 = (H1 -F+2P)/S+1 D2 = K Weights per filter: F × F × D1 Receptive field: vector of F × F × D1 components Weights per filter: vector of F × F × D1 components Matrix of weights: (F × F × D1 ) × K components Computation: W2 × H2 “multiplications” of matrix of weights with receptive fields May 31, 2017 ETTI Colloquia

It is possible to define a Functional Set Architecture? Case study: TensorFlow functions used for MNIST database of handwritten digits For all of them the acceleration on our Map-Reduce architecture is done with a degree of parallelism > 95 %. May 31, 2017 ETTI Colloquia

TensorFlow functions for ML tf.matmul: map & reduce operations tf.add: map operations tf.nn.softmax: map & reduce operations tf.argmax: map & reduce operations tf.reduce_mean: reduce operations tf.equal: map operations tf.cast: map operations … : map &/ reduce operations May 31, 2017 ETTI Colloquia

Functional Set Architecture (FSA) The set of functions of type mr.xxx is used to redefine tf.xxx for running on a p-cell Map-Reduce Accelerator tf.matmul(mr.matmul) tf.add(mr.add) tf.nn.softmax(mr.nn.softmax) tf.argmax(mr.argmax) tf.reduce_mean(mr.reduce_mean) tf.equal(mr.equal) … FSA(p)={mr.matmul,mr.add,mr.nn.softmax,…} Our main target: TensorFlow(FSA(p)) May 31, 2017 ETTI Colloquia

Map-Reduce based Accelerated Processing Unit May 31, 2017 ETTI Colloquia

Matrix-Vector Multiplication May 31, 2017 ETTI Colloquia

Linear algebra on Map-reduce Accelerator Matrix-Vector Multiplication: for N×N matrix the execution time for p execution units is TMVmult(N) = (N + 2 + log2 p) clock_cycles which represents supra-linear acceleration Matrix-Matrix Multiplication: for N×N matrices the execution time for p execution units is TMMmult(N) = (2N2 + (43 + log2 p)N – 1) clock_cycles May 31, 2017 ETTI Colloquia

Comparative performances In MNIST experiment on a p-cell accelerator: 62% of computation is accelerated p times, 38% of computation is accelerates p/log2p times Solution 1: x86 mono-core, 2 GHz, ~50 Watt Solution 2: our Map-Reduce, FPGA, p = 512, 500 MHz, ~40 Watt Acceleration > 90x Solution 3: our Map-Reduce, ASIC 28nm, 84 mm2, p = 2048, 1 GHz, 12 Watt at 85oC Acceleration > 650x May 31, 2017 ETTI Colloquia

Current solutions for APU GPU (Graphic Processing Units: Nvidia): uses in matrix-vector multiplication ~1% from its peak performance* MIC (Multiple Integrated Core: Intel’s Xeon Phi): uses in matrix-vector multiplication maximum 1.4 % from its peak performance* TPU (Tensor Processing Units: Google): is a very efficient ASIC which beats GPU, and MIC accelerators for a narrow application *The performance is so low because of the architectural incompatibilities: they are not designed to be accelerators for map-reduce operations May 31, 2017 ETTI Colloquia

The main drawbacks of the current solutions Actual performance vs. peak performance is very low for GPU, MIC (reason: the map operations and the reduce operations does not work easy together) Energy consumption is very high for GPU, MIC (reason: cache oriented architecture instead of buffer oriented, and too much emphasis on float arithmetic) Application specific architecture for TPU (only matrix operations are supported efficiently; poor support for other specific functions in ML; it is a systolic circuit, not a programmable system) May 31, 2017 ETTI Colloquia

Our proposal: Map-Reduce Accelerator (MRA) General purpose programmable accelerator can be added to current cloud architectures Actual performance vs. peak performance is very high (30 – 95 %) Very low energy consumption : Xeon Phi: 2 TFLOP/sec/300Watt MRA: 1TFOP/sec/12Watt (12.5x as Xeon Phi) May 31, 2017 ETTI Colloquia

Programming Map-Reduce based APU The program for PU is: organized using a programming language (C, Python, …) readable locally modifiable The “program” for A (the set of weight matrices) is: self-organized from data unreadable only globally modifiable May 31, 2017 ETTI Colloquia

Concluding remarks: Deep Convolutional NN computation must be accelerated in order to: reduce the training time because: the network architecture is established only experimentally the training process is restarted for each new token reduce the energy consumption in running because: the technology is used in mobile application the technology is used in data centers MRA is qualified for ML domain because: GFLOP/sec/Watt is very high actual_performance/peak_performance is very high May 31, 2017 ETTI Colloquia

Bibliography Mono-core performance: 16-50% from float peak performance for matrix-vector multiplication, at: http://simulationcorner.net/index.php?page=fastmatrixvector Nvidia performance: ~1% from float peak performance for matrix-vector multiplication, at: https://stackoverflow.com/questions/26417475/matrix-vector-multiplication-in-cuda-benchmarking-performance Xeon Phi performance: 16-50% from float peak performance for matrix-vector multiplication, at: http://www.jcomputers.us/vol9/jcp0907-09.pdf May 31, 2017 ETTI Colloquia

Thank you Q&(possible)A May 31, 2017 ETTI Colloquia