Parallel Processing and GPUs

Slides:



Advertisements
Similar presentations
Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
Advertisements

Layer 3 Switching. Routers vs Layer 3 Switches Both forward on the basis of IP addresses But Layer 3 switches are faster and cheaper However, Layer 3.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #2 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Programmable Logic Devices
Chapter 8 Hardware Conventional Computer Hardware Architecture.
GPGPU platforms GP - General Purpose computation using GPU
Backpropagation An efficient way to compute the gradient Hung-yi Lee.
ECEn 191 – New Student Seminar - Session 9: Microprocessors, Digital Design Microprocessors and Digital Design ECEn 191 New Student Seminar.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
EE3A1 Computer Hardware and Digital Design
ANTON D.E Shaw Research.
Gravitational N-body Simulation Major Design Goals -Efficiency -Versatility (ability to use different numerical methods) -Scalability Lesser Design Goals.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Programmable Logic Device Architectures
HOW COMPUTERS WORK THE CPU & MEMORY. THE PARTS OF A COMPUTER.
Memory The term memory is referred to computer’s main memory, or RAM (Random Access Memory). RAM is the location where data and programs are stored (temporarily),
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Chapter 1 Introduction.
Mihaela Malița Gheorghe M. Ștefan
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
Programmable Logic Devices
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
HOW IT WORKS ….
TensorFlow– A system for large-scale machine learning
Generalized and Hybrid Fast-ICA Implementation using GPU
Analysis of Sparse Convolutional Neural Networks
Programmable Logic Device Architectures
The Relationship between Deep Learning and Brain Function
Chilimbi, et al. (2014) Microsoft Research
Neural Network Implementations on Parallel Architectures
Enabling machine learning in embedded systems
Deep Learning in HEP Large number of applications:
The University of Adelaide, School of Computer Science
Instructor: Dr. Phillip Jones
Architecture & Organization 1
Design Methodology II EMT 251.
Electronics for Physicists
EmbedDed Systems – MECT190
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Chapter 1: The 8051 Microcontrollers
Unsupervised Learning and Autoencoders
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
EmbedDed Systems – MECT190
Architecture & Organization 1
Deep Learning Packages
Introduction to Tensorflow
Compiler Back End Panel
Physical Implementation Manufactured IC Technologies
Jason furmanek Kris murphy IBM
Simulation of computer system
Compiler Back End Panel
OVERVIEW OF BIOLOGICAL NEURONS
Azure Accelerated Networking: SmartNICs in the Public Cloud
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Types of Computers Mainframe/Server
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
About Hardware Optimization in Midas SW
Chapter 1 Introduction.
Computer Graphics Graphics Hardware
Greg Stitt ECE Department University of Florida
Computer Evolution and Performance
Advanced Digital Systems Design Methodology
Electronics for Physicists
Artificial Neural Networks
Artificial Neural Networks
Physical Implementation
TensorFlow: A System for Large-Scale Machine Learning
EE 193: Parallel Computing
♪ Embedded System Design: Synthesizing Music Using Programmable Logic
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Parallel Processing and GPUs with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Benefits of GPUs Hundreds, or thousands of cores to allow for parallel processing of data, such as computing the forward pass through the network for each example in parallel. Memory on a GPU is faster and closer to the GPU cores, further improving processing speed. GPUs are designed specifically for vector addition or multiplication, which comes up often in deep learning

Drawbacks of GPU’s Moving data between the CPU and GPU is very costly, so unnecessary movement slows down the processing. Difficult to add additional memory to a GPU if the network size exceeds GPU memory. CPU’s can use up to 1TB of RAM, whereas GPU’s max out around 24-32 GB. Each core is less powerful than a CPU core, so operations than cannot be parallelized are slower than on a CPU.

Field Programmable Gate Array (FPGA) An integrated circuit that can be programmed in the field to handle a specific function. Tend to be faster than GPU for specific tasks, but are less versatile Can be used with embedded microprocessors to form a complete system More power efficient than GPUs due to their reduced ability at general purpose computing.

Application Specific Integrated Circuit (ASIC) An integrated circuit designed for a specific purpose, such as multiplying matrices. Can get away with lower precision (even 8 bit) when calculating weight updates or outputs of nodes for faster training. Google has created the TPU (Tensor Processing Unit), which powered AlphaGo in its competition against Lee Sedol Faster and uses less power than Traditional GPUs, but can cost millions of dollars to create the circuitry. Only useful for larger scale operations (such as Google).