Tim Madden ODG/XSD.  Graphics Processing Unit  Graphics card on your PC.  “Hardware accelerated graphics”  Video game industry is main driver.  More.

Slides:



Advertisements
Similar presentations
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Advertisements

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
GPU PROGRAMMING David Gilbert California State University, Los Angeles.
1 Experiment 0 COE 205 Computer Organization & Assembly Language Programming Term 043.
GCSE Computing - The CPU
Computer Systems CS208. Major Components of a Computer System Processor (CPU) Runs program instructions Main Memory Storage for running programs and current.
Alex Becker.  Multi-core is short for “multiple cores”  Advances in technology allow for several discrete cores on one chip  This however is not multi-CPU.
Counters and Registers
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
 Chasis / System cabinet  A plastic enclosure that contains most of the components of a computer (usually excluding the display, keyboard and mouse)
The Internal Components of a Personal Computer (PC)
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Internal hardware and external components of a computer Three-box Model  Processor The brain of the system Executes programs A big finite state machine.
The Computer Systems By : Prabir Nandi Computer Instructor KV Lumding.
Practical PC, 7th Edition Chapter 17: Looking Under the Hood
Unit 2 - Hardware Microprocessors & CPUs. What is a microprocessor? ● The brain of the computer, the microprocessor is responsible for organizing and.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Translate the following message:
D75P 34R HNC Computer Architecture 1 Week 9 The Processor, Busses and Peripherals © C Nyssen/Aberdeen College 2003 All images © C Nyssen /Aberdeen College.
Computer Graphics Graphics Hardware
Tim Madden ODG/XSD.  Graphics Processing Unit  Graphics card on your PC.  “Hardware accelerated graphics”  Video game industry is main driver.  More.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Hardware & Software The CPU & Memory.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Computer Systems Organization CS 1428 Foundations of Computer Science.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
ECE 103 Engineering Programming Chapter 5 Programming Languages Herbert G. Mayer, PSU CS Status 6/19/2015 Initial content copied verbatim from ECE 103.
GPU Architecture and Programming
GPU DAS CSIRO ASTRONOMY AND SPACE SCIENCE Chris Phillips 23 th October 2012.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Tim Madden ODG/XSD.  Graphics Processing Unit  Graphics card on your PC.  “Hardware accelerated graphics”  Video game industry is main driver.  More.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Computer Software Types Three layers of software Operation.
Experiment 0 COE 205 Computer Organization & Assembly Language Programming Term 052.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
The Central Processing Unit (CPU)
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Computer Hardware & Processing Inside the Box CSC September 16, 2010.
CS 1410 Intro to Computer Tecnology Computer Hardware1.
Operating Systems A Biswas, Dept. of Information Technology.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Information Technology (IT). Information Technology – technology used to create, store, exchange, and use information in its various forms (business data,
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Stored Program Concept Learning Objectives Learn the meaning of the stored program concept The processor and its components The fetch-decode-execute and.
Computer Graphics Graphics Hardware
CUDA C/C++ Basics Part 2 - Blocks and Threads
CPU Lesson 2.
GCSE Computing - The CPU
GCSE OCR Computing A451 The CPU Computing hardware 1.
Modeling Big Data Execution speed limited by: Model complexity
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Chapter 10: Computer systems (1)
Chapter 2.1 CPU.
ECE 103 Engineering Programming Chapter 5 Programming Languages
What is GPU? how does it work?
Spatial Analysis With Big Data
THE CPU i Bytes 1.1.
Basic CUDA Programming
Drill Translate the following message:
CS 286 Computer Organization and Architecture
Presented by: Isaac Martin
Functional Units.
Computer Graphics Graphics Hardware
GCSE Computing - The CPU
6- General Purpose GPU Programming
Presentation transcript:

Tim Madden ODG/XSD

 Graphics Processing Unit  Graphics card on your PC.  “Hardware accelerated graphics”  Video game industry is main driver.  More recently used for non-graphics applications.

 Card on the PCI-Express buss.  GPU card contains its own RAM and processor(s).  What is a CORE?  A core is an ALU, arithmetic logic unit.  ALU is basically a single processor that can run a computer program.  Modern PCs have “Quad Core.” Basically 4 processors. This refers to the processor on the motherboard that runs Windows.  GPU has hundreds of Cores!

 Originally for graphics applications  Graphics code developed using DirectX SDK (Windows and X box) or Open GL (general platform).  OpenGL/DirectX are precompiled libraries that run on GPU. Only allow graphics.  CUDA- a general SDK allowing the writing of C programs that run on GPU.  CUDA allows for any general application to run on GPU. Non-graphics, scientific programming.

 Parallel programming?  What is a “Thread?”  A sequence of commands in a program that run after another. void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); }

 A typical program on a PC has many threads running at once.  An EPICS IOC has about 20 threads running.  This Powerpoint program is running 8 threads (at time of typing this sentence). void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); }

 The more threads running, the slower each thread.  Solution is to add more processors. A “core” is a processor.  “Quad Core” PC has 4 processors, each running hundreds of threads. void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } PROCESSOR void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } PROCESSOR void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } PROCESSOR void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } void oneThread(int N) { int counter=0; while(1) { printf(“Thread %d Count %d\n”, N, counter++); Sleep(1000); } PROCESSOR

 Make a thread, that in turn makes a new thread, etc…  Void haveChildren()  Update global thread counter, and printf.  Sleep 500ms  Call haveChildren() on a New thread.  Display a window  If OK is hit on window, then exit(0)  When haveChildren is called, an infinite number of threads is created. An infinite number of windows will display.  Threads show in Task Manager

 Instead of running 100’s of threads, let us run millions of threads!  GPU can have 1024 processors. Each processor can run 1000’s of threads at once.  Adding more processors speeds up the program.

Thread

 On the host (not the GPU) we write a single thread to process an image.  1 pixel at a time.  For a 1kx1k image, this is 1M operations in sequence. // My image data Short *image = new short[1024*1024]; Int k For (k=0; k<1024*1024; k++) { image[k] = image[k] + 1; }

 Write code for a single pixel, and call the code in 1M separate threads.  Cuda will dole out threads to Cores for you on the GPU.  Pixel X runs on thread X. __global__ void subtractDarkImage_k( unsigned short *d_Dst, unsigned short *d_Src, int dataSize ){ const int i = blockDim.x * blockIdx.x + threadIdx.x; if(i >= dataSize) return; d_Dst[i] =d_Src[i] +1; }

 Plugin to Area Detector to run calculations on GPU.  When new image comes from detector:  Host sends image to GPU  GPU does calcs.  Host retrieves result from GPU..  Host sends results to EPICS etc.  GPU code compiled as DLL.  Epics Area Detector loads DLL and runs.  Allows arbitrary calculations on GPU. Just make a new DLL.  Separates cross compile of GPU code, from EPICS build.  One Area Detector plugin for all GPU calculations.  Can define EPICS variables in the DLL. Host queries DLL for parameters and connects EPICS PVs.  Debug by attaching to IOC Process.  Set traps in DLL.  Recompile DLL  Restart IOC to load updated DLL. No IOC rebuild.

 Sending image to GPU and back.  Dark Subtraction on Host versus GPU.  Fast convolution on GPU versus Host.  Running several programs on GPU at once.

 Low End GPU, Nvidea Quadro NVS 290  Data transfer to/ from GPU: 4ms round trip for 1MB image.  Dark Subtraction: 1kx1k image, 16 bit.  8ms on Host  30ms on GPU  Fast Convolution :1k x 1k image, 16 bit.  250ms on Host  50ms on GPU  Overhead spawning threads on GPU?  Very simple calculation better on host.  Complex calculation better on GPU