CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
OpenSSL acceleration using Graphics Processing Units
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
CS 179 Lecture 14.
NVIDIA Fermi Architecture
Graphics Processing Unit
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

CUDA Compute Unified Device Architecture

Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

GPU Architecture Source: NVIDIA

GPU Architecture

Programming Model

cudaSetDevice()‏ –cudaGetDeviceCount()‏ –cudaGetDeviceProperties()‏

Programming Model cudaSetDevice()‏ –cudaGetDeviceCount()‏ –cudaGetDeviceProperties()‏ cudaMalloc() & cudaMemcpy()‏ –Constant memory cache –Texture memory cache

Programming Model cudaSetDevice()‏ –cudaGetDeviceCount()‏ –cudaGetDeviceProperties()‏ cudaMalloc() & cudaMemcpy()‏ –Constant memory cache –Texture memory cache >>kernel()‏

Programming Model cudaSetDevice()‏ –cudaGetDeviceCount()‏ –cudaGetDeviceProperties()‏ cudaMalloc() & cudaMemcpy()‏ –Constant memory cache –Texture memory cache >> – Optional argument to dynamically allocate shared memory

Programming Model cudaSetDevice()‏ –cudaGetDeviceCount()‏ –cudaGetDeviceProperties()‏ cudaMalloc() & cudaMemcpy()‏ –Constant memory cache –Texture memory cache >> – Optional argument to dynamically allocate shared memory – Optional stream ID for asynchronous, independent launches

Impure Parallelism __syncthreads()‏ Synchronize within a thread block Used for SISD approaches to parallelism CudaThreadSynchronize()‏ Block CPU until all threads on device finish Used to prevent large scale read-after-write issues atomicAdd(), atomicExch(), etc. CUDA Built-in atomic operations Used to replace classic locking mechanisms

Sugarscape Model Data: – 2 NxN single accuracy matrices for sugar levels and maximums – NxN matrix of pointers to agents To facilitate locating agents – N*N array of Agent data Agent struct contains location, vision, sugar level, and metabolism. Vision is an integer uniformly chosen between [1,10] Metabolism is a floating point uniformly chosen between [0.1, 1.0)‏

Sugarscape Model Each iteration: grow_sugars >> //updates sugar patches – Registers:4 feed_agents >> //agents eat from the sugar patches – Registers:10 move_agents >> //agents search and move to a location – Registers:17 – Collisions are prevented with atomicExch() operation – Upon colliding losing agent reevaluates memcpy //sugar levels and agent matrices are copied for display

Potential Optimization Techniques OpenGL interoperability – To eliminate unnecessary memory transfers – Maps data to OpenGL on the graphics card – Runs slower than transferring data to CPU and back

Potential Optimization Techniques OpenGL interoperability – To eliminate unnecessary memory transfers – Maps data to OpenGL on the graphics card – Runs slower than transferring data to CPU and back Texture fetching – To cache data accesses based on locality – No significant speed up without optimized locality

Potential Optimization Techniques OpenGL interoperability – To eliminate unnecessary memory transfers – Maps data to OpenGL on the graphics card – Runs slower than transferring data to CPU and back Texture fetching – To cache data accesses based on locality – No significant speed up without optimized locality Constant memory – 64KB of global memory cached on card – Too small for this model’s purposes

Potential Optimization Techniques OpenGL interoperability – To eliminate unnecessary memory transfers – Maps data to OpenGL on the graphics card – Runs slower than transferring data to CPU and back Texture fetching – To cache data accesses based on locality – No significant speed up without optimized locality Constant memory – 64KB of global memory cached on card – Too small for this model’s purposes Shared memory – On chip, fast access – Requires SISD parallelism

Potential Optimization Techniques OpenGL interoperability – To eliminate unnecessary memory transfers – Maps data to OpenGL on the graphics card – Runs slower than transferring data to CPU and back Texture fetching – To cache data accesses based on locality – No significant speed up without optimized locality Constant memory – 64KB of global memory cached on card – Too small for this model’s purposes Shared memory – On chip, fast access – Requires SISD parallelism Multiple streams – To launch multiple instruction sets simultaneously – Instruction sets most be independent of each other

Results

Further Research Increasing agent complexity – Internal processing Register limit is already pushed with minimal processing High cost of thread divergence on the GPU’s scalar processors – External interactions Operations such as searching around an agent and communication between agents present bottlenecks – Block approach to processing agents