GP2: General Purpose Computation using Graphics Processors

GP2: General Purpose Computation using Graphics Processors
Dinesh Manocha & Avneesh Sud Lecture 2: January 17, 2006 Spring 2007 Department of Computer Science UNC Chapel Hill

Class Schedule Current Time Slot: 2:00 – 3:15pm, Mon/Wed, SN011
Office hours: TBD Class mailing list: (should be up and running)

GPGP The GPU on commodity video cards has evolved into an extremely flexible and powerful processor Programmability Precision Power This course will address how to harness that power for general-purpose computation (non-rasterization) Algorithmic issues Programming and systems Applications

Capabilities of Current GPUs
Modern GPUs are deeply programmable Programmable pixel, vertex, video engines Solidifying high-level language support Modern GPUs support 32-bit floating point precision Great development in the last few years 64-bit arithmetic may be coming soon Almost IEEE FP compliant

The Potential of GPGP The power and flexibility of GPUs makes them an attractive platform for general-purpose computation Example applications range from in-game physics simulation, geometric applications to conventional computational science Goal: make the inexpensive power of the GPU available to developers as a sort of computational coprocessor Check out

GPGP: Challenges GPUs designed for and driven by video games
Programming model is unusual & tied to computer graphics Programming environment is tightly constrained Underlying architectures are: Inherently parallel Rapidly evolving (even in basic feature set!) Largely secret No clear standards (besides DirectX imposed by MSFT) Can’t simply “port” code written for the CPU! Is there a formal class of problems that can be solved using current GPUs

Importance of Data Parallelism
GPUs are designed for graphics or gaming industry Highly parallel tasks GPUs process independent vertices & fragments Temporary registers are zeroed No shared or static data No read-modify-write buffers Data-parallel processing GPUs architecture is ALU-heavy Multiple vertex & pixel pipelines, multiple ALUs per pipe Hide memory latency (with more computation)

Goals of this Course A detailed introduction to general-purpose computing on graphics hardware Emphasis includes: Core computational building blocks Strategies and tools for programming GPUs Cover many applications and explore new applications Highlight major research issues

Course Organization Survey lectures
Instructors, other faculty, senior graduate students Breadth and depth coverage Student presentations

Course Contents Overview of GPUs: architecture and features
Models of computation for GPU-based algorithms System issues: Cache and data management; Languages and compilers Numerical and Scientific Computations: Linear algebra computations. Optimization, FFTrigid body simulation, fluid dynamics Geometric computations: Proximity computations; distance fields; motion planning and navigation Database computations: database queries: predicates, booleans, aggregates; streaming databases and data mining; sorting & searching GPU Clusters: Parallel computing environments for GPUs Rendering: Ray-tracing, photon mapping; Shadows

Student Load Stay awake in classes! One class lecture
Read a lot of papers 1-2 small assignments

Student Load Stay awake in classes! One class lecture
Read a lot of papers 1-2 small assignments A MAJOR COURSE PROJECT WITH RESEARCH COMPONENT

Course Projects Work by yourself or part of a small team
Develop new algorithms for simulation, geometric problems, database computations Formal model for GPU algorithms or GPU hacking Issues in developing GPU clusters for scientific computation Look into new architecture and parallel programming trends

Course Projects: Importance
If you are planning to take this course for credit, start thinking about the course project ASAP It is important that your project has some novelty to it: Shouldn’t be just a GPU-hack You need to work on a problem or application, for which GPUs are a good candidate For example, GPUs are not a good solution for many problems It is ok to work in groups of 2 or 3 (for a large project) Periodic milestones to monitor the progress Project proposals due by February 10 Monthly progress reports (will count towards the final grade)

Course Projects: Possible Topics
We are also interested in comparing GPU capabilities with other emerging architectures (e.g. Cell, multi-core, other data parallel processors) Numerical computations: Some of the prime candidates for GPU acceleration Sparse matrix computations Numerical linear algebra (SVD, QR computations) Applications (like WWW search) Power efficiency of GPU algorithms Programming environments of GPUs (talk to Jan Prins) GPU Clusters and high performance computing using GPUs Scientific computations (possible collaboration with RENCI) Data mining algorithms (talk to Wei Wang or Jan Prins) Physically-based simulation, e.g. fluid simulation (talk to Ming Lin) Others …

Course Topics & Lectures
Focus on Breadth Quite a few guest and student lectures Overview of OpenGL and GPU Programming (Wendt on Jan. 22) Cell processor (Stephen Olivier on Jan. 24) NVIDIA G80 Architecture (Steve Molnar, Jan. 29) CUDA Programming Environment (Lars Nyland, Jan. 31) Lectures on CTM (ATI)

Heterogeneous Computing Systems & GPUs

What are Heterogeneous Computing Systems?
Develop computer systems and applications that are scalable from a system with a single homogeneous processor to a high-end computing platform with tens, or even hundreds, of thousands of heterogeneous processors

What are Heterogeneous Computing Systems?
Heterogeneous computing systems are those with a range of diverse computing resources that can be local to one another or geographically distributed. The pervasive use of networks and the internet by all segments of modern society means that the number of connected computing resources is growing tremendously. From “International Workshop on Heterogeneous Computing”, from early 1990’s

Computing using Accelerators
GPU is one type of accelerator (commodity and easily available) Other accelerators: Cell processor Clearspeed

Organization Use of Accelerators
Current architectures Use of Accelerators Programming environments for accelerators

Current Architectures
Multi-core architectures Processors lowering communication costs Heterogeneous processors

Multi-Core Processor What is a Multicore processor?

Multi-Core Architectures http://gamma.cs.unc.edu/EDGE/SLIDES/agarwal.pdf
What is a Multicore processor? Three properties (Agarwal’06) Single chip Multiple distinct processing engines Multiple, independent threads of control (or program counters – MIMD)

Multi-Core: Motivation

Multi-Core: Growth Rate

Sun’s Niagra Chip: Chip Multi-Threaded Processor http://gamma. cs. unc

Multi-core architectures Processors lowering communication costs Heterogeneous processors

Efficient Processors Reduce communication costs [Dally’03]
PCA architectures: GPUs Streaming processors: Other data parallel processors (PPUs, ClearSpeed) FPGAs

Multi-core architectures Processors lowering communication costs Heterogeneous processors Combining different type of processors in one chip

Heterogeneous Processors
Cell BE Processor AMD Fusion Architecture

Cell BE Processor Overview
IBM, SCEI/Sony, Toshiba Alliance formed in 2000 Design Center opened in March 2001 Based in Austin, Texas ~$400M Investment February 7, 2005: First technical disclosures Designed for Sony PlayStation3 Commodity processor Cell is an extension to IBM Power family of processors Sets new performance standards for computation & bandwidth High affinity to HPC workloads Seismic processing, FFT, BLAS, etc.

Cell BE Processor Features
SPE Heterogeneous multi-core system architecture Power Processor Element for control tasks Synergistic Processor Elements for data-intensive processing Synergistic Processor Element (SPE) consists of Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (SMF) Data movement and synchronization Interface to high-performance Element Interconnect Bus LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF LS SXU SPU SMF 16B/cycle EIB (up to 96B/cycle) 16B/cycle PPE 16B/cycle 16B/cycle (2x) MIC BIC L2 PXU L1 PPU 16B/cycle 32B/cycle Dual XDRTM FlexIOTM 64-bit Power Architecture with VMX

Cell BE Architecture Combines multiple high performance processors in one chip 9 cores, 10 threads A 64-bit Power Architecture™ core (PPE) 8 Synergistic Processor Elements (SPEs) for data-intensive processing Current implementation—roughly 10 times the performance of Pentium for computational intensive tasks Clock: 3.2 GHz (measured at >4GHz in lab) Cell Pentium D Peak I/O BW 75 GB/s ~6.4 GB/s Peak SP Performance ~230 GFLOPS ~30 GFLOPS Area 221 mm² 206 mm² Total Transistors 234M ~230M

Peak GFLOPs (Cell SPEs only)
FreeScale DC 1.5 GHz PPC 970 2.2 GHz AMD DC 2.2 GHz Intel SC 3.6 GHz Cell 3.0 GHz

Cell BE Processor Can Support Many Systems
Game console systems Blades HDTV Home media servers HPC … XDRtm XDRtm XDRtm XDRtm Cell BE Processor Cell BE Processor IOIF BIF IOIF XDRtm XDRtm XDRtm XDRtm XDRtm XDRtm Cell BE Processor Cell BE Processor IOIF IOIF SW BIF BIF IOIF Cell BE Processor IOIF Processor Cell BE Processor Cell BE IOIF0 IOIF1 XDRtm XDRtm XDRtm XDRtm

Heterogeneous Processors
Cell BE Processor AMD Fusion Architecture

AMDs Fusion Architecture

Current architectures Use of Accelerators Single workstation (real-world) applications High performance computing Programming environments for accelerators

NON-Graphics Pipeline Abstraction (GPGPU)
programmable MIMD processing (fp32) data Courtesy: David Kirk, NVIDIA SIMD “rasterization” setup lists rasterizer programmable SIMD processing (fp32) data data fetch, fp16 blending data predicated write, fp16 blend, multiple output data memory

Sorting and Searching “I believe that virtually every important aspect of programming arises somewhere in the context of sorting or searching!” -Don Knuth

Massive Databases Terabyte-data sets are common
Google sorts more than 100 billion terms in its index > 1 Trillion records in web indexed (unconfirmed sources) Database sizes are rapidly increasing! Max DB sizes increases 3x per year ( Processor improvements not matching information explosion

General Sorting on GPUs
Design sorting algorithms with deterministic memory accesses – “Texturing” on GPUs 86 GB/s peak memory bandwidth (NVIDIA 8800) Can better hide the memory latency!! Require minimum and maximum computations – “Blending functionality” on GPUs Low branching overhead No data dependencies Utilize high parallelism on GPUs

Sorting on GPU: Pipelining and Parallelism
Input Vertices Texturing, Caching and 2D Quad Comparisons Sequential Writes

Comparison with prior GPU-Based Algorithms
3-6x faster than prior GPU-based algorithms!

Sorting: GPU vs. Multi-Core CPUs
2-2.5x faster than Intel high-end processors Single GPU performance comparable to high-end dual core Athlon Hand-optimized CPU code from Intel Corporation!

External Memory Sorting
N. Govindaraju, J. Gray, R. Kumar and D. Manocha, Proc. of ACM SIGMOD 2006 External Memory Sorting Performed on Terabyte-scale databases Two phases algorithm Limited main memory First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs” Second phase – Merge the “Runs” to generate the sorted file

External memory sorting using GPUs
External memory sorting on CPUs can have low performance due to High memory latency Low I/O performance Our GPU-based algorithm Sorts large data arrays on GPUs Perform I/O operations in parallel on CPUs

GPUTeraSort Govindaraju et al., SIGMOD 2006

Overall Performance Faster and more scalable than Dual Xeon processors (3.6 GHz)!

Performance/$ 1.8x faster than current Terabyte sorter
World’s best performance/$ system

GPUTeraSort: PennySort Winner 2006
“These results paint a clear picture for progress on processor speeds. When you measure records-sorted-per-cpu-second, the speed plateaued in 1995 at about 200k records/second/cpu . This year saw a breakthrough with GpuTeraSort which uses the GPU interface to drive the memory more efficiently (and uses the 10x more memory bandwidth inside the GPU). GpuTeraSort gave a 3x records/second/cpu improvement There is a lot of effort on multi-core processors, and comparatively little effort on addressing the “core” problems: (1) the memory architecture, and (2) the way processors access memory. Sort demonstrates those problems very clearly.” By Jim Gray (Microsoft) [NY Times, November 2006]

Download URL: http://gamma.cs.unc.edu/GPUFFTW
N. Govindaraju, S. Larsen, J. Gray and D. Manocha, SuperComputing 2006 GPUFFTW (1D & 2D FFT) 4x faster than IMKL on high-end Quad cores SlashDot Headlines, May 2006 Download URL:

Digital Breast Tomosynthesis (DBT)
Pioneering DBT work at Massachusetts General Hospital 100X reconstruction speed-up with NVIDIA Quadro FX 4500 GPU From hours to minutes Facilitates clinical use Improved diagnostic value Clearer images Fewer obstructions Earlier detection X-Ray tube Advanced Imaging Solution of the Year Axis of rotation Compression paddle Compressed breast Digital detector 11 Low-dose X-ray Projections Extremely Computationally Intense Reconstruction “Mercury reduced reconstruction time from 5 hours to 5 minutes, making DBT clinically viable. …among 70 women diagnosed with breast cancer, DBT pinpointed 7 cases not seen with mammography” © 2006 Mercury Computer Systems, Inc.

Electromagnetic Simulation
3D Finite-Difference and Finite-Element Modeling of: Cell phone irradiation MRI Design / Modeling Printed Circuit Boards Radar Cross Section (Military) Computationally Intensive! Large speedups with Quadro GPUs 18X Pacemaker with Transmit Antenna 10X 5X 1X Commercial, Optimized, Mature Software Single CPU, 3.x GHz 1 2 4 # Quadro FX 4500 GPUs

Havok FX Physics on NVIDIA GPUs
Physics-based effects on a massive scale 10,000s of objects at high frame rates Rigid bodies Particles Fluids Cloth and more

Dedicated Performance For Physics
Performance Measurement 15,000 Boulder Scene 64.5 fps Frame Rate 6.2 fps CPU Physics Dual Core P4EE GHz GeForce 7900GTX SLI CPU Multi-threading enabled GPU Physics Dual Core P4EE GHz GeForce 7900GTX SLI CPU Multi-threading enabled

GPUs: High Memory Throughput
50 GB/s on a single GPU (NVIDIA 7900) Peak Performance: Effectively hide memory latency with 15 GOP/s

Microsoft Vista & GPUs Windows Vista is the first Windows operating system that directly utilizes the power of a dedicated GPU. High-end GPUs are essential for accelerating the Windows Vista experience by offering an enriched 3D user interface, increased productivity, vibrant photos, smooth, high-definition videos, and realistic games.

GPUs as Accelerators GPUs are primarily designed for rasterization
GPUs are programmed using graphics APIs Specialized algorithms for different applications to demonstrate higher performance

GPUs as Accelerators GPUs are primarily designed for rasterization
GPUs are programmed using graphics APIs Specialized algorithms for different applications to demonstrate higher performance Inspite of these limitations good speedups were demonstrated

GPUs as Accelerators GPUs are primarily designed for rasterization GPUs are programmed using graphics APIs Specialized algorithms for different applications to demonstrate higher performance Inspite of these limitations good speedups were demonstrated What if we have the right API and programming environment for GPUs?

Accelerators for HPC Recent Trends is to use Accelerators to achieve TFlop performance RoadRunner (LANL): plans to use 16,000 cell processors (expected PetaFlop performance) Tsubame cluster (Tokyo): 360 ClearSpeed accelerators (47 TFlop performance)

Thread parallelism is upon us (Smith’06)
Uniprocessor performance is leveling off Instruction-level parallelism is nearing its limit Power per chip is painfully high for client systems

Uniprocessor performance is leveling off Instruction-level parallelism is nearing its limit Power per chip is painfully high for client systems Meanwhile, logic cost ($ per gate-Hz) continues to fall What are we going to do with all that hardware?

Uniprocessor performance is leveling off Instruction-level parallelism is nearing its limit Power per chip is painfully high for client systems Meanwhile, logic cost ($ per gate-Hz) continues to fall What are we going to do with all that hardware? Newer microprocessors are multi-core, and/or multithreaded So far, it’s just “more of the same” architecturally Now we also have heterogeneous processors

Thread parallelism We expect new “killer apps” will need more performance Semantic analysis and query Improved human-computer interfaces (e.g. speech, vision) Games Which and how much thread parallelism can we exploit? This is a good question for both hardware and software

Programming the Accelerators
Data parallel processors Improved APIs and interfaces

Possible Approaches Extend existing high-level languages with new data-parallel array types Ease of programming Implement as a library so programmers can use it now Eventually fold into base languages Build implementations with compelling performance Target GPUs and multi-core CPUs Create examples and applications Educate programmers, provide sample code

Challenges in using GPUs
Need a non-graphics interface For more flexibility Less execution overhead Need native GPU support Replace library with language built-ins Need to learn from users Retarget for multi-core

Research Issues Languages for mainstream parallel computing
Compilation techniques for parallel programs Debugging and performance tuning of parallel programs Operating systems for parallel computing at all scales Computer architecture for mainstream parallel computing

GP2: General Purpose Computation using Graphics Processors

Similar presentations

Presentation on theme: "GP2: General Purpose Computation using Graphics Processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GP2: General Purpose Computation using Graphics Processors

Similar presentations

Presentation on theme: "GP2: General Purpose Computation using Graphics Processors"— Presentation transcript:

Similar presentations

About project

Feedback