Effects of Limiting Numerical Precision on Neural Networks

Slides:



Advertisements
Similar presentations
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
Advertisements

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Performance and Power Analysis on ATI GPU: A Statistical Approach Ying Zhang, Yue Hu, Bin Li, and Lu Peng Department of Electrical and Computer Engineering.
Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
ARM Cortex-A9 performance in HPC applications Kurt Keville, Clark Della Silva, Merritt Boyd ARM gaining market share in embedded systems and SoCs Current.
Nvidia Tegra 2 The world's first mobile super chip.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Chapter 2 Computer Clusters Lecture 2.3 GPU Clusters for Massive Paralelism.
Computationally Efficient Histopathological Image Analysis: Use of GPUs for Classification of Stromal Development Olcay Sertel 1,2, Antonio Ruiz 3, Umit.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
11 C H A P T E R Artificial Intelligence and Expert Systems.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
Accelerating image recognition on mobile devices using GPGPU
Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on a low power, embedded system School of Information Technology & Mathematical.
Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on an embedded system School of Information Technology & Mathematical Sciences September.
Performance Characterization and Architecture Exploration of PicoRadio Data Link Layer Mei Xu and Rahul Shah EE249 Project Fall 2001 Mentor: Roberto Passerone.
1 Latest Generations of Multi Core Processors
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.
INTRODUCTION TO COMPUTER ENGINEERING (ECE 001) Dr. Ahmed Bayoumi Dr. Shady Yehia Elmashad 1.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Philipp Gysel ECE Department University of California, Davis
Assignment 4: Deep Convolutional Neural Networks
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
CFI 2004 UW A quick overview with lots of time for Q&A and exploration.
i.MX 8 Series: 3 Processor Families with Targeted Features
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
Overview Gamma camera prototype recap Panoramic gamma Imaging
i.MX Processor Roadmap i.MX 8 family i.MX 8M family i.MX 8X family
Transformer for your computer
. ASAP 2017 Ramine Roane Sr Dir Product Planning July 12, 2017.
GPU Architecture and Its Application
288 Core ARM® and 13’824 CUDA Core Microserver Cluster with Toradex Apalis System on Modules by Christmann.
Low-Cost High-Performance Computing Via Consumer GPUs
Machine Learning Developments in ROOT Sergei Gleyzer, Lorenzo Moneta
Software Defined Storage
Low Power processors in HEP
Autonomous Driving Localization, path planning, obstacle avoidance
Utilizing AI & GPUs to Build Cloud-based Real-Time Video Event Detection Solutions Zvika Ashani CTO.
Texas Instruments TDA2x and Vision SDK
dawn.cs.stanford.edu/benchmark
Electronics for Physicists
Why microcontrollers in embedded systems?
Building Advanced Imaging for Better Diagnostics
Windows Server 2016 Software Defined Storage
Layer-wise Performance Bottleneck Analysis of Deep Neural Networks
Deep Learning Packages
About Hardware Optimization in Midas SW
CaffePresso: An Optimized Library for Deep Learning on Embedded Accelerator-based platforms Authors: Gopalakrishna Hegde, Siddhartha, Nachiappan Ramasamy,
Automated Video Cutting:
Thales Alenia Space Competence Center Software Solutions
Final Project presentation
Decision Making with Neural Networks
Architecture & System Performance
Odroid XU4.
Decision Making with Neural Networks
Introduction to Single Board Computer
CS 332 Visual Processing in Computer and Biological Vision Systems
Authors: Chaim Baskin, Natan Liss, Evgenii Zheltonozhskii, Alex M
Sculptor: Flexible Approximation with
Single Parameter Tuning
Mohammad Samragh Mojan Javaheripi Farinaz Koushanfar
YOLO-based Object Detection on ARM Mali GPU
Jetson-Enabled Autonomous Vehicle
Martin Croome VP Business Development GreenWaves Technologies.
Presentation transcript:

Effects of Limiting Numerical Precision on Neural Networks An Empirical Study on Deep Learning Accelerators vinu@cs.utah.edu chandru@cs.utah.edu

Deep Learning Applications Neural Network Compute-intensive embedded projects like Drones, Autonomous robotic systems, Mobile medical imaging, and Intelligent Video Analytics (IVA). OEMs, independent developers, Neural Network Training Inference

Problem Statement (New age) Neural Network Architects are having a hard time Training Time Compute Resources Hyperparameter search … Inference Power Budget Accuracy

Problem Statement (New age) Neural Network Architects are having a hard time Training Time Compute Resources Hyperparameter search … Inference Power Budget Accuracy

Lecture: Tools to Explore Accelerators We saw Minerva In Class … Topics: the Minerva tool to explore the design space, prune/quantize, lower voltages Project prep

Mine-rva Keras Aladdin Madonna Overview Keras Aladdin Madonna

Motivation

Project Proposal MADONNA : A tool for Measurement & Assistance in Design Of NN to its Architect Measurement Nvidia jetson TK-1 Assistance FPTuner Ristretto

Numeric Precision

Assistance

Measurement

Architecting a good Deep Learning Applicaiton Low power User defined accuracy Best possible within power budget MAUD Our project, Framework for NN architect Principle: Measure As yoU Design

Hardware Jetson Embedded Platform Measurement hardware Topology NVIDIA Jetson with GPU-accelerated parallel processing. Leading embedded visual computing platform. It features high-performance, low-energy computing for deep learning and computer vision Ideal for compute-intensive embedded projects like drones, autonomous robotic systems, mobile medical imaging, and Intelligent Video Analytics (IVA). OEMs, independent developers, Makers and hobbyists can use the NVIDIA Jetson TX1 to explore the future of embedded computing. Measurement hardware Yokogawa  wt310 The WT300E series digital power analyzer Provides extremely low current measurement capability down to 50 micro-Amps, This instrument is ideal for engineers performing stand-by power measurements. Topology The head/gateway node is mir.cs.utah.edu  This is the large tower computer on the floor. mir is then connected to the switch on the table. This switch is then connected to the nvidia jetson tk-1 boards (mir01, mir02,.... mir16).

nVIDIA Tegra, Jetson K1 Board

GPU based Accelerator Tegra K1 GPU NVIDIA® Kepler™ Architecture TEGRA K1 PROCESSOR SPECIFICATIONS - See more at: http://www.nvidia.com/object/tegra-k1-processor.html#sthash.8CHMnKLF.dpuf GPU based Accelerator  Tegra K1 GPU   NVIDIA® Kepler™ Architecture 192 NVIDIA CUDA® Cores CPU CPU Cores and Architecture NVIDIA 4-Plus-1™ Quad-Core ARM Cortex-A15 "r3" Max Clock Speed 2.3 GHz Memory Memory Type DDR3L and LPDDR3 Max Memory Size 8 GB (with 40-bit address extension) Display LCD 3840x2160 HDMI 4K (UltraHD, 4096x2160) Package Package Size/Type 23x23 FCBGA 16x16 S-FCCSP 15x15 FC PoP Process 28 nm

Workflow Design

Software Measurement Assistance eServer.exe (Backend) eNergy.py (Frontend) Assistance Caffe FPTuner Ristretto

Caffe: make –j 4 all 10 W 8690 Joules

Caffe: make clean 5 W 34 Joules

Performance Metrics (No free lunch) DOUBLE I1207 02:12:46.015034 4489 caffe.cpp:275] Batch 49, loss = 0.783467I1207 02:12:46.015295 4489 caffe.cpp:280] Loss: 0.749668I1207 02:12:46.015519 4489 caffe.cpp:292] accuracy = 0.7538I1207 02:12:46.015750 4489 caffe.cpp:292] loss = 0.749668 (* 1 = 0.749668 loss) SINGLE I1207 04:02:18.003669 19796 caffe.cpp:275] Batch 49, loss = 0.780145I1207 04:02:18.003912 19796 caffe.cpp:280] Loss: 0.748495I1207 04:02:18.004142 19796 caffe.cpp:292] accuracy = 0.7488I1207 04:02:18.004376 19796 caffe.cpp:292] loss = 0.748495 (* 1 = 0.748495 loss)

34K Joules 346 Joules 12K Joules 244 Joules 6W 10 W 34K Joules 346 Joules Power Metrics 10W 5.5 W 12K Joules 244 Joules CIFAR 10

11K Joules 62 Joules 4K Joules 45 Joules 8 W 4.5W 11K Joules 62 Joules 8 W 4 W 4K Joules 45 Joules LENET - MNIST

Next Steps Measurement Assistance Complete measurement studies CaffeNet ImageNet Study and Report Impact of Precision on energy consumption Assistance Attempt implementing fixed point support in Ristretto / native Caffe Resume FPTuner addition in workFlow, aiming for automation