Multicore and GPU Programming

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Parallelism Lecture notes from MKP and S. Yalamanchili.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
The University of Adelaide, School of Computer Science
Parallel computer architecture classification
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
XEON PHI. TOPICS What are multicore processors? Intel MIC architecture Xeon Phi Programming for Xeon Phi Performance Applications.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Computer Organization and Architecture 18 th March, 2008.
Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.
Chapter 17 Parallel Processing.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Lecture 2 : Introduction to Multicore Computing Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang University.
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Lecture 2 : Introduction to Multicore Computing
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
CS203 – Advanced Computer Architecture Performance Evaluation.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Processor Level Parallelism 1
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Prof. Zhang Gang School of Computer Sci. & Tech.
These slides are based on the book:
Conclusions on CS3014 David Gregg Department of Computer Science
CS203 – Advanced Computer Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Parallel Processing - introduction
Parallel Computing Lecture
Constructing a system with multiple computers or processors
Morgan Kaufmann Publishers
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Multi-Processing in High Performance Computer Architecture:
What is Parallel and Distributed computing?
Multicore / Multiprocessor Architectures
Symmetric Multiprocessing (SMP)
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 4 Multiprocessors
Introduction, background, jargon
The University of Adelaide, School of Computer Science
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Multicore and GPU Programming Multiple CPU lead to multiple core lead to GPU GPU – hundreds to Thousands of cores GPU – order of magnitude faster

Flynn’s Taxonomy of Parallel Architectures SISD SIMD MISD MIMD

Graphics Coding NVIDA – cuda INTEL – APU accelerated processing unit, OpenCL

Cell BE processor Sony’s PS3 Master – worker heterogeneous MIMD machine on a chip Master 2 threads PPE power processing elelment Runs OS and manages workers Workers 8 – 128 bit vector processors SPE synergistic processing elements Local memory 256k holds data and code SIMD

Cell BE continued PPE and SPE instruction set incompatible Difficult to program Speed versus ease of programming 102.4 Gflops IBM Roadrunner supercomputer fastest 2008-2009 12,240 PowerXCell 8i 6562 AMD Opteron processors PowerXCell 8i is an enhanced version of original cell processor Not built anymore – too complex

Nvidia’s Kepler The third GPU architecture Nvidia designed for compute applications Cores arranged in groups called Streaming Multiprocessors (SMX) 192 cores SIMD on a SMX Each SMX can run its own program A chip in the Kepler family is distinguished by the number of blocks GTX Titan has 15 SMXs 14 of which are usable, 14*192 = 2688 cores Dual GPU GTX Titan Z has 5760 cores

The Process: Send data to GPU Launch a kernel Wait to collect results

AMD APUs CPU and GPU on same chip Share memory, eliminates memory transfer Implement AMD’s APU Heterogeneous System Architecture (HSA) 2 core types Latency Compute Unit (LCU) a general CPU Supports native CPU instruction set HSA intermediate language (HSAIL) instruction set Throughput Compute Unit (TCU) a general GPU Supports only HSAIL Targets efficient execution

Multicore to Many-Core: Tilera’s Tile-GX8072 2007 2 dim grid Mesh computer Up to 72 cores Integrated with cpu/os

Intel’s Xeon Phi 2012 Used in 2 of the 10 super computers China’s Tianhe-2 is top 61x86 cores Handles 4 threads at the same time 512 bit wide Vector Processing Unit (VPU) SIMD 16 single precision, 8 double-precision floating point numbers per clock cycle Each core has 32K data, 32K instruction L1 caches and 512K L2 cache Easy to use, can use OpenMP

PERFORMANCE Speedup = time seq / time parallel Wall clock time Effected by Programmer skill Compiler Compiler switches OS File system (EXT4, NTFS….) Load

EFFICIENCY Efficiency = speedup / N = time seq/ N*time parallel N is number of CPUs or cores If speedup = N we have linear speedup --- ideal

Hyperthreading Runs 2 software threads by duplicating part of the CPU Makes OS think there are double number of processors 30% speedup

More resources – More speedup Not necessarily so Sequential part Super Linear speed up, finds solution in exactly the same way

Scaling More resources yield speedup If no scaling, probably a poor design Strong scaling efficiency = timesequential/N*timeparallel Same as general efficiency

Weak Scaling weakScaling Efficiency(N) = tseq/t’par T’par is time to solve a problem N times bigger than one the single machine is solving in time tseq GPU’s offer a bigger challenge Never uses 1 core for tseq No fair to use CPU GPU needs host CPU, does the host count

Building a parallel program Coordination problems Access to shared resources Load balancing issues Termination problems Halting problem in a coordinated fashion Etc.

How to build a parallel program Build a sequential one of the desired parallel program Shows efficiency Shows correctness Shows most time-consuming parts of problem (profiler) Shows how much performance gain can be expected

guidelines Duration of the whole execution, not just parallel part Create an avg over several runs Exclude outliers Scalability is important so run on different data sizes and workers Threads should not exceed number of processors or cores Hyperthreading should be disabled

Amdahl’s law Bunch of ants versus a heard of elephants

Gustafson-Barsis’s rebuttal Parallel program does more than just speed up a sequential pgm Can handle bigger problem instances Rather than consider parallel pgm relative to seq, consider seq compared to parallel one