ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.

Advertisements

ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

ATI Stream Computing ATI Intermediate Language (IL) Micah Villmow May 30, 2008.

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

EVOLUTION OF MULTIMEDIA & DISPLAY MAZEN SALLOUM 26 FEB 2015.

Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.

Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

DESIGNING PHYSICS ALGORITHMS FOR GPU ARCHITECTURE Takahiro HARADA AMD.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

SAGE: Self-Tuning Approximation for Graphics Engines

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Copyright 2011, Atmel December, 2011 Atmel ARM-based Flash Microcontrollers 1 1.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.

HPEC 2007 Norm Rubin Fellow AMD Graphics Products Group norman.rubin at amd.com.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

Installation of Storage Foundation for Windows High Availability 5.1 SP2 1 Daniel Schnack Principle Technical Support Engineer.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

GPU Functional Simulator Yi Yang CDA 6938 term project Orlando April. 20, 2008.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

Sunpyo Hong, Hyesoon Kim

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

My Coordinates Office EM G.27 contact time:

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

Wi-Fi BT/BLE Combo Module WINC3400 hands-on

PPEP: ONLINE PERFORMANCE, POWER, AND ENERGY PREDICTION FRAMEWORK BO SU † JUNLI GU ‡ LI SHEN † WEI HUANG ‡ JOSEPH L. GREATHOUSE ‡ ZHIYING WANG † † NUDT.

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

EECE571R -- Harnessing Massively Parallel Processors ece

Central Controller 2009©HIMA Digital Entertainment.

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

Virtual frame buffer and VSYNC

The Small batch (and Other) solutions in Mantle API

Heterogeneous System coherence for Integrated CPU-GPU Systems

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

In-depth on the memory system

SOC Runtime Gregory Stoner.

libflame optimizations with BLIS

Interference from GPU System Service Requests

Simulation of exascale nodes through runtime hardware monitoring

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Compute Shaders Optimize your engine using compute

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

ECE 498AL Lecture 15: Reductions and Their Implementation

Advanced Micro Devices, Inc.

AMD GPU Performance Revealed

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Ajit Mathews Corp. VP Software Development ML Software Engineering

Presentation transcript:

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010

| ATI Stream Computing Update | Confidential 22 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration The problem for( many input values ) { histogram[ value ]++; } Many scattered read-modify-write accesses into small data structure On CPU, scattered r-m-w goes to cache by default  fast On GPU, goes to __global by default  worst case Solution: use __local memory & parallelize histogram compute

| ATI Stream Computing Update | Confidential 33 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration SIMD GPU Algorithm 1. Thread fetches input data from __global to __private (registers) 2. Scatter into __local sub-histograms in group (multiple LDS banks per bin) 3. Reduce __local bins into single histogram per group,.. 4. Reduce __global histograms (2 nd kernel for global sync point) __local Histograms Input Buffer __global SIMD __global flush to __global

| ATI Stream Computing Update | Confidential 44 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration SIMD GPU Algorithm 1. Thread fetches input data from __global to __private (registers) 2. Scatter into __local sub-histograms in group (multiple LDS banks per bin) 3. Reduce __local bins into single histogram per group,.. flush to __global 4. Reduce __global histograms (2 nd kernel for global sync point) __local Histograms Input Buffer __global SIMD __global SIMD Generic reduction performance Input bytes processed, approximate numbers ATI Radeon™ HD 5870, ATI Stream SDK v2.01 (256 MB to 320KB) (320 KB to 256 KB) (256 KB to 1 KB) 145 GB/s 109 GB/s 107 GB/s 103 GB/s Configuration: AMD Phenom™ 9950 X GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

| ATI Stream Computing Update | Confidential 55 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel launch setup At least as many threads as needed to optimally fetch input: Group size Configuration: AMD Phenom™ 9950 X GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

| ATI Stream Computing Update | Confidential 66 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Launch setup: assorted lore At least 1 group per SIMD 3-4 wavefronts per SIMD to keep SIMD stages busy (2 ALU, 1 fetch, 1 export) For memory bound kernels: >= 7 wavefronts per SIMD for __global latency hiding (> 8k threads on AMD “Cypress” GPU) Per-thread and per-group costs become noticeable at high thread counts (i.e. 1 thread per DWORD 4-vec) Good experimental starting point: 64 and/or 128 threads/group, >= 16k threads (on AMD “Cypress GPU”) On CPU: as few threads as possible, e.g. 1x – 2x number of compute units

| ATI Stream Computing Update | Confidential 77 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Launch setup, histogram Larger group size: better __local sharing between threads Smaller group size: __local reduction gets more expensive Experimental peak at 256 threads/group, 64k threads Configuration: AMD Phenom™ 9950 X GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

| ATI Stream Computing Update | Confidential 88 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Launch setup, histogram, cont’d #define NBINS 256 main() { nThreads = 64 * 1024; nThreadsPerGroup = 256; nGroups = nThreads / nThreadsPerGroup; n4Vectors = 4096 * 4096; n4VectorsPerThread = n4Vectors / nThreads; inputNBytes = n4Vectors * sizeof(cl_uint4); outputNBytes = nGroups * NBINS * sizeof(cl_uint); (static setup for benchmarking purpose only; a real app will take into account the image size and GPU type (wavefront size, # of compute units))

| ATI Stream Computing Update | Confidential 99 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel __kernel void histogramKernel( global uint4 *Image, global uint *Histogram, uint n4VectorsPerThread) { __local uint subhists[NBANKS * NBINS]; … input buffer processed as 4-vectors output buffer holds sub- and final histograms (256 bins * 256 groups * cl_uint = 256KB) __local buffer holds work-group sub-histograms (256 bins * 16 banks * cl_uint = 16KB per SIMD)

| ATI Stream Computing Update | Confidential 10 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, parallel LDS clear __local uint2 *p = (__local uint2 *) subhists; if( ltid < lmem_max_threads ) { for( ) p[idx] = 0; } barrier( CLK_LOCAL_MEM_FENCE ); Significant difference compared to single thread clear (4.5x) Slightly faster as uint2 vs. uint (2x more LDS requests per instruction) Configuration: AMD Phenom™ 9950 X GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

| ATI Stream Computing Update | Confidential 11 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, coalesced access uint tid = get_global_id(0); uint Stride = get_global_size(0); uint4 temp; for( i=0, idx = tid; i < n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; Each thread starts at its global thread ID Stride is the number of threads Resulting pattern over all threads is optimally coalesced … Loop 0Loop 1Loop 2 get_global_size(0)

| ATI Stream Computing Update | Confidential 12 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, serial access uint tid = get_global_id(0); uint4 temp; for( i=0, idx = tid*n4VectorsPerThread; i<n4VectorsPerThread; i++, idx++) { temp = Image[idx]; Each thread reads a block with stride 1 Resulting pattern is bad for uncached __global Ok on CPU and GPU cached Loop 0Loop 1Loop 2 n4VectorsPerThread

| ATI Stream Computing Update | Confidential 13 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Coalesced vs. serial access group size 64 Configuration: AMD Phenom™ 9950 X GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

| ATI Stream Computing Update | Confidential 14 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel: 4-vector pixel mask & shift 1.fetch: XYZWXYZWXYZWXYZW 2.mask: ___W___W___W___W 3.shift: _XYZ_XYZ_XYZ_XYZ 4.mask: ___Z___Z___Z___Z 5.shift: __XY__XY__XY__XY 6.mask: ___Y___Y___Y___Y 7.… Performs better than generic uchar4/uchar16 #define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS); for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; … temp = temp >> shft; temp2 = (temp & msk) * (uint4) NBANKS + offset; …

| ATI Stream Computing Update | Confidential 15 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, atomic scatter #define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS); for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; (void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w ); …

| ATI Stream Computing Update | Confidential 16 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, LDS banks #define NBANKS 16 uint offset = (uint) ltid % (uint) (NBANKS); for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; … ABCDEF LDS addr 0 NBANKS = LDS addr 0 LDS addr 0x10 LDS addr 0x20 NBANKS = 16

| ATI Stream Computing Update | Confidential 17 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration LDS banking performance Effective LDS rate: > 900 GB/sec Configuration: AMD Phenom™ 9950 X GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

| ATI Stream Computing Update | Confidential 18 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel, LDS reduction barrier( CLK_LOCAL_MEM_FENCE ); if( ltid < NBINS ) { uint bin = 0; for( i=0; i<NBANKS; i++ ) bin += subhists[ (ltid * NBANKS) + i ]; Histogram[ (get_group_id(0) * NBINS) + ltid ] = bin; } LDS addr 0 LDS addr 0x __global

| ATI Stream Computing Update | Confidential 19 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Kernel 2, __global reduction __kernel void reduceKernel( __global uint *Histogram, uint nSubHists ) { uint tid = get_global_id(0); uint bin = 0; for( int i=0; i < nSubHists; i++ ) bin += Histogram[ (i * NBINS) + tid ]; Histogram[ tid ] = bin; } __global

| ATI Stream Computing Update | Confidential 20 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Single component vs. 4-vector 4-vectors work best for many cases. Some corner cases can be faster using single component access.. For absolute peak performance, it’s worth trying both.

| ATI Stream Computing Update | Confidential 21 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Single component vs. 4-vector, histogram for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp = Image[idx]; temp2 = (temp & msk) * (uint4) NBANKS + offset; (void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w ); temp = temp >> shft;

| ATI Stream Computing Update | Confidential 22 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Single component vs. 4-vector, histogram, cont’d for( i=0, idx=tid; i<n4VectorsPerThread; i++, idx += Stride ) { temp.x = Image[idx].x; temp.y = Image[idx].y; temp.z = Image[idx].z; temp.w = Image[idx].w; temp2.x = (temp.x & msk) * (uint) NBANKS + offset; temp2.y = (temp.y & msk) * (uint) NBANKS + offset; temp2.z = (temp.z & msk) * (uint) NBANKS + offset; temp2.w = (temp.w & msk) * (uint) NBANKS + offset; (void) atom_inc( subhists + temp2.x ); (void) atom_inc( subhists + temp2.y ); (void) atom_inc( subhists + temp2.z ); (void) atom_inc( subhists + temp2.w ); temp.x = temp.x >> shft; temp.y = temp.y >> shft; temp.z = temp.z >> shft; temp.w = temp.w >> shft; 10 % faster! Configuration: AMD Phenom™ 9950 X GHz, MSI K9A2 Platinum, 8 GB RAM, Windows® 7 32-bit, ATI Radeon™ HD 5870 GPU, ATI Stream SDK v2.01, ATI Catalyst™ 10.2

| ATI Stream Computing Update | Confidential 23 | ATI Stream Computing – OpenCL™ Histogram Optimization Illustration Disclaimer & Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, AMD Phenom, Catalyst, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Microsoft, Windows, and Windows Vista are trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.