General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Intermediate GPGPU Programming in CUDA

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

James Edwards and Uzi Vishkin University of Maryland 1.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

Performance Potential of an Easy-to- Program PRAM-On-Chip Prototype Versus State-of-the-Art Processor George C. Caragea – University of Maryland A. Beliz.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Chapter 17 Parallel Processing.

Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th, 2011.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

XMT-GPU A PRAM Architecture for Graphics Computation Tom DuBois, Bryant Lee, Yi Wang, Marc Olano and Uzi Vishkin.

Programmability and Portability Problems? Time for Hardware Upgrades Uzi Vishkin ~2003 Wall Street traded companies gave up the safety of the only paradigm.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Wrapup and Open Issues.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.

Sunpyo Hong, Hyesoon Kim

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.

My Coordinates Office EM G.27 contact time:

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Single Instruction Multiple Threads

CS5102 High Performance Computer Systems Thread-Level Parallelism

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

Lecture 5: GPU Compute Architecture

XMT Another PRAM Architectures

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

Mattan Erez The University of Texas at Austin

Chapter 4 Multiprocessors

Mattan Erez The University of Texas at Austin

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Benchmarks NameDescriptionCUDA SourceLines of Code DatasetParallel sectn. Threads/sectn. CUDAXMTCUDAXMTCUDAXMT BfsBreadth-First Search Harish and Narayanan, Rodinia M nodes, 6M edges 25121M87.4K BpropBack PropagationRodinia K nodes M19.4K ConvImage ConvolutionNVIDIA CUDA SDK x K512K MsortMerge-SortThrust library M keys K10.7K NWNeedleman- Wunsch Rodinia x2048 sequences K ReductParallel ReductionNVIDIA CUDA SDK M elts.335.5K44K SpmvSparse matrix- vector multiply Bell and Garland K x 36K, 4M non-zero K36K Performance Comparison When using 1024-TCU XMT configuration: 6.05x average speedup on irregular applications 2.07x average slowdown on regular applications When using 512-TCU XMT configuration 4.57x average speedup on irregular 3.06x average slowdown on regular Case study: BFS on low parallelism dataset Speedup of 73.4x over Rodinia implementation Speedup of 6.89x over UIUC implementation Speedup of 110.6x when using only 64 TCUs (lower latencies for the smaller design) Benchmarks Experimental Platform XMTSim: The cycle-accurate XMT simulator Timing modeled after the 64-TCU FPGA prototype Highly configurable to simulate any configuration Modular design, enables architectural exploration Part of XMT Software Release: SPAA’09: 10X over Intel Core Duo with same silicon area Current work: XMT outperforms GPU on all irregular workloads XMT does not fall behind significantly on regular workloads No need to pay high performance penalty for ease-of-programming Promising candidate for pervasive platform of the future: Highly parallel general-purpose CPU coupled with: Parallel GPU Future work: Power/energy comparison of XMT and GPU TESLAXMT Memory Latency Hiding and Reduction Heavy multithreading (requires large register files and state-aware scheduler) Limited local shared scratchpad memory No coherent private caches at SM or SP Large globally shared cache No coherent private TCU or cluster caches Software prefetching Memory and Cache Bandwidth Memory access patterns need to be coordinated by the user for efficiency (request coalescing) Scratchpad memories prone to bank conflicts Relaxed need for user-coordinated DRAM access due to caches Address hashing for avoiding memory module hotspots High bandwidth mesh-of-trees interconnect between clusters and caches Functional Unit (FU) Allocation Dedicated FUs for SPs and SFUs Less arbitration logic required Higher theoretical peak performance Heavy FUs (FPU and MDU) are shared through arbitrators Lightweight FUs (ALU, branch) are allocated per TCU ALUs do not include multiply-divide functionality Control Flow and Synchronization Single instruction cache and issue per SM. Warps execute in lock-step (penalizes diverging branches) Efficient local synchronization and communication within blocks. Global communication is expensive Switching between serial and parallel modes (i.e. passing control from CPU to GPU) requires off-chip communication One instruction cache and program counter per TCU enables independent progress of threads Coordination of threads performed via constant time prefix-sum. Other communication through the shared cache Dynamic hardware support for fast switching between serial and parallel modes and load balance of virtual threads GTX280XMT-1024 Principal Computational Resources Cores240 SP, 60 SFU1024 TCU Integer Units240 ALU+MDU1024 ALU, 64 MDU Floating Point Units240 FPU, 60 SFU64 FPU On Chip Memory Registers1920KB128KB Prefetch Buffers--32KB Regular Caches480KB4104KB Constant Cache240KB128KB Texture Cache480KB-- Need configurations with equivalent area constraints (576 mm 2 in 65nm) Can not simply set the number of functional units and memory to the same values Area estimation of the envisioned XMT chip is based on the 64 TCU XMT ASIC prototype (designed in 90nm IBM technology) More area intensive side is emphasized in each category. TESLA XMT Tested Configurations: GTX280 vs. XMT-1024 Paraleap: XMT PRAM-on-chip silicon XMT: An Easy-to-Program Many-Core XMT: Motivation and Background XMT Programming Model At each step, provide all instructions that can execute concurrently (not dependent on each other) PRAM/XMT abstraction: all such instructions execute immediately (“uniform cost”) PRAM-like programming: using reduced synchrony Main construct: spawn-join block. Can start any number of virtual-threads at once Many-cores are coming. But 40yrs of parallel computing: Never a successful general-purpose parallel computer (easy to program, good speedups, up & down scalable). IF you could program it  great speedups. XMT: Fix the IF XMT: Designed from the ground up to address that for on-chip parallelism Tested HW & SW prototypes Builds on PRAM algorithmics. Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase Ease of programming Necessary condition for success of a general-purpose platform In von Neumann’s 1947 specs Indications that XMT is easy to program: 1.XMT is based on rich algorithmic theory (PRAM) 2.Ease-of-teaching as a benchmark: a.Successfully taught parallel programming to middle-school, high- school and up b.Evaluated by education experts (SIGCSE 2010) c.XMT superior to MPI, OpenMP and CUDA 3.Programmer’s workflow for deriving efficient programs from PRAM algorithms 4.DARPA HPCS productivity study: XMT development time half of MPI Virtual-Threads advance at own speed, not lockstep Prefix-sum (ps): similar to atomic fetch-and-add Arrzz XMTC Programming Language int A[N],B[N] int base=0; spawn(0,N-1) { int inc=1; if (A[$]!=0) { ps(inc,base); B[inc]=A[$]; } int A[N],B[N] int base=0; spawn(0,N-1) { int inc=1; if (A[$]!=0) { ps(inc,base); B[inc]=A[$]; } C with simple SPMD extensions spawn: start any number of virtual threads $: unique thread ID ps/psm: atomic prefix sum. Efficient hardware implementation XMTC Example: Array Compaction Non-zero elements of A copied into B Order is not necessarily preserved After atomically executing ps(inc,base) base = base + inc inc gets original value of base Elements copied into unique locations in B Built FPGA prototype Announced in SPAA’07 Built using 3 FPGA chips 2 Virtex-4 LX200, 1 Virtex-4 FX100 Clock rate75 MHz DRAM size1GB DRAM channels1 Mem. data rate0.6GB/s No. cores (TCUs)64 Clusters8 Cache modules8 Shared cache256KB George C. CarageaFuat KeceliAlexandros TzannesUzi Vishkin