Parallel Processing and GPUs

Slides:

Advertisements

Similar presentations

Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.

Advertisements

Layer 3 Switching. Routers vs Layer 3 Switches Both forward on the basis of IP addresses But Layer 3 switches are faster and cheaper However, Layer 3.

1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.

Programmable Logic Devices

Chapter 8 Hardware Conventional Computer Hardware Architecture.

GPGPU platforms GP - General Purpose computation using GPU

Backpropagation An efficient way to compute the gradient Hung-yi Lee.

ECEn 191 – New Student Seminar - Session 9: Microprocessors, Digital Design Microprocessors and Digital Design ECEn 191 New Student Seminar.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

EE3A1 Computer Hardware and Digital Design

ANTON D.E Shaw Research.

Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.

Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.

Programmable Logic Device Architectures

HOW COMPUTERS WORK THE CPU & MEMORY. THE PARTS OF A COMPUTER.

Memory The term memory is referred to computer’s main memory, or RAM (Random Access Memory). RAM is the location where data and programs are stored (temporarily),

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Chapter 1 Introduction.

Mihaela Malița Gheorghe M. Ștefan

Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,

Programmable Logic Devices

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

HOW IT WORKS ….

TensorFlow– A system for large-scale machine learning

Generalized and Hybrid Fast-ICA Implementation using GPU

Analysis of Sparse Convolutional Neural Networks

Programmable Logic Device Architectures

The Relationship between Deep Learning and Brain Function

Chilimbi, et al. (2014) Microsoft Research

Neural Network Implementations on Parallel Architectures

Enabling machine learning in embedded systems

Deep Learning in HEP Large number of applications:

The University of Adelaide, School of Computer Science

Instructor: Dr. Phillip Jones

Architecture & Organization 1

Design Methodology II EMT 251.

Electronics for Physicists

EmbedDed Systems – MECT190

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Chapter 1: The 8051 Microcontrollers

Unsupervised Learning and Autoencoders

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

EmbedDed Systems – MECT190

Architecture & Organization 1

Deep Learning Packages

Introduction to Tensorflow

Compiler Back End Panel

Physical Implementation Manufactured IC Technologies

Jason furmanek Kris murphy IBM

Simulation of computer system

Compiler Back End Panel

OVERVIEW OF BIOLOGICAL NEURONS

Azure Accelerated Networking: SmartNICs in the Public Cloud

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Types of Computers Mainframe/Server

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

About Hardware Optimization in Midas SW

Chapter 1 Introduction.

Computer Graphics Graphics Hardware

Greg Stitt ECE Department University of Florida

Computer Evolution and Performance

Advanced Digital Systems Design Methodology

Electronics for Physicists

Artificial Neural Networks

Artificial Neural Networks

Physical Implementation

TensorFlow: A System for Large-Scale Machine Learning

EE 193: Parallel Computing

♪ Embedded System Design: Synthesizing Music Using Programmable Logic

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Parallel Processing and GPUs with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Benefits of GPUs Hundreds, or thousands of cores to allow for parallel processing of data, such as computing the forward pass through the network for each example in parallel. Memory on a GPU is faster and closer to the GPU cores, further improving processing speed. GPUs are designed specifically for vector addition or multiplication, which comes up often in deep learning

Drawbacks of GPU’s Moving data between the CPU and GPU is very costly, so unnecessary movement slows down the processing. Difficult to add additional memory to a GPU if the network size exceeds GPU memory. CPU’s can use up to 1TB of RAM, whereas GPU’s max out around 24-32 GB. Each core is less powerful than a CPU core, so operations than cannot be parallelized are slower than on a CPU.

Field Programmable Gate Array (FPGA) An integrated circuit that can be programmed in the field to handle a specific function. Tend to be faster than GPU for specific tasks, but are less versatile Can be used with embedded microprocessors to form a complete system More power efficient than GPUs due to their reduced ability at general purpose computing.

Application Specific Integrated Circuit (ASIC) An integrated circuit designed for a specific purpose, such as multiplying matrices. Can get away with lower precision (even 8 bit) when calculating weight updates or outputs of nodes for faster training. Google has created the TPU (Tensor Processing Unit), which powered AlphaGo in its competition against Lee Sedol Faster and uses less power than Traditional GPUs, but can cost millions of dollars to create the circuitry. Only useful for larger scale operations (such as Google).