Local Memory optimizations

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.
Advertisements

Intermediate GPGPU Programming in CUDA
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Computer Organization and Architecture
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
CMPT 300: Operating Systems I Dr. Mohamed Hefeeda
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
NVIDIA’s Experience with Open64 Mike Murphy NVIDIA.
CS 1400 Using Microsoft Visual Studio 2005 if you don’t have the appropriate appendix.
C++ for Engineers and Scientists Third Edition
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 1: Introduction.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
CS Tutorial 1 Getting Started with Visual Studio 2012 (Visual Studio 2010 are no longer available on MSDNAA, please choose Visual Studio 2012 which.
Computer Graphics Ken-Yi Lee National Taiwan University.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
Implementing a Speech Recognition System on a GPU using CUDA
AMD-SPL Runtime Programming Guide Jiawei. Outline.
JPCM - JDC121 JPCM. Agenda JPCM - JDC122 3 Software performance is Better Performance tuning requires accurate Measurements. JPCM - JDC124 Software.
Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
National Tsing Hua University ® copyright OIA National Tsing Hua University HSA HW2.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
1 Getting Started with C++. 2 Objective You will be able to create, compile, and run a very simple C++ program on Windows, using Visual Studio 2008.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Maths & Technologies for Games Optimisation for Games 1 CO3303 Week 4.
Functions Math library functions Function definition Function invocation Argument passing Scope of an variable Programming 1 DCT 1033.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
GPU Programming Contest. Contents Target: Clustering with Kmeans How to use toolkit1.0 Towards the fastest program.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
1 COMP 3500 Introduction to Operating Systems Project 4 – Processes and System Calls Part 3: Adding System Calls to OS/161 Dr. Xiao Qin Auburn University.
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
1 COMP 3500 Introduction to Operating Systems Project 4 – Processes and System Calls Part 4: Managing File System State Dr. Xiao Qin Auburn University.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Homework 1.
The Present and Future of Parallelism on GPUs
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
GPUs: Not Just for Graphics Anymore
Protection of System Resources
Scheduler activations
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
NVIDIA Profiler’s Guide
/ Computer Architecture and Design
CUDA and OpenCL Kernels
Implementation of Efficient Check-pointing and Restart on CPU - GPU
GPU Programming using OpenCL
OS Virtualization.
Operation System Program 4
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Chapter 1: An Overview of Computers and Programming Languages
Tools.
CS791v Homework and Submission
Tools.
Introduction This Spring I partnered with an ASP.Net developer on a web application.  We collaborated on the analysis and design.  I wrote data manipulation.
Mattan Erez The University of Texas at Austin
Chapter 11 Processor Structure and function
Multicore and GPU Programming
Presentation transcript:

Local Memory optimizations SEMINAR 3 Local Memory optimizations

Outline of the seminar Student presentations Local memory optimization Results from last year Basis for grading the works

Local memory bank conflicts Local memory bank is 4 bytes wide and 256 bytes deep (AMD). 32 banks per CU Bank conflicts are checked checked within a half-wavefront Local memory performs best if there is one access to each bank by a half-wavefront or when the whole half-wavefront accesses the same bank (broadcast) Bank conflict means that work items within a half-wavefront request values from same banks in a single request Assuming a row work group (16,1,1) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Explicit copy to local memory __local uchar *local_left, __local uchar *local_right PASSED IN AS KERNEL ARGUMENTS const int2 gid = (int2)(get_global_id(0), get_global_id(1)); const int2 lid = (int2)(get_local_id(0), get_local_id(1)); // fill the left local memory buffer for (Ly = 0; Ly < local_height; Ly += get_local_size(1)) { for (Lx = 0; Lx < local_width; Lx += get_local_size(0)) { Lindex = (lid.y + Ly)*local_width + lid.x + Lx; Gindex = (gid.y + Ly) * width + gid.x + Lx; local_left[Lindex] = srcL[Gindex]; local_right[Lindex] = srcR[Gindex - MAX_DISP]; }

Results from 2016

Mali-T624 (Honor 6) results from 2016 OS Android 5.1.1 C (cpu, single thread) 300.36 s OpenCL (gpu) 25.176 s OpenCL Vectorization (gpu) 7.243 s

Odroid instructions Username and password for odroids are both ”odroid” There is a Mali_SDK shortcut on the desktop Open it and then open the samples folder Copy the template folder and rename it Copy your files to the folder and edit the Makefile Replace the template.cpp with your .cpp files on the SOURCES line On HEADERS line, include the required header files On line EXECUTABLE, rename the executable if you wish Open the MATE terminal Go to the folder you created Eg. cd /home/odroid/Desktop/Mali_OpenCL_SDK_v1.1.0/samples/your_folder Build your project Type ”make” Run your project ./your_executable

Brief CodeXL instructions 1. In Visual Studio, open the CodeXL tap 2. Switch to profile mode 3. Choose the GPU: Performance Counters 4. Start CodeXL GPU Profiling

About the grading If the work returned before the deadline 12.4. at midnight 2. Everything works + final report and training diary returned 3. Minor optimizations (native functions, fast floating point math etc.) 4. Vector optimization 5. Local memory optimization Extra +1 can be granted if CodeXL profiling performed and possible further actions to optimize the code based on the profiling feedback is given in the final report CodeXL available in the workstations in TS135 and TS351 If e.g. Nvidia & Intel have similar tools they can be used as well