- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.

Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Intermediate GPGPU Programming in CUDA

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Graphics Processing Units

Fluid Simulation using CUDA Thomas Wambold CS680: GPU Program Optimization August 31, 2011.

Contemporary Languages in Parallel Computing Raymond Hummel.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

AGENT SIMULATIONS ON GRAPHICS HARDWARE Timothy Johnson - Supervisor: Dr. John Rankin 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

Modeling GPU non-Coalesced Memory Access Michael Fruchtman.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

Computational Biology 2008 Advisor: Dr. Alon Korngreen Eitan Hasid Assaf Ben-Zaken.

GPU Architecture and Programming

Thread-safety and shared geometry data J. Apostolakis.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

A Thread-Parallel Geant4 with Shared Geometry Gene Cooperman and Xin Dong College of Computer and Information Science Northeastern University 360 Huntington.

Geant4 on GPU prototype Nicholas Henderson (Stanford Univ. / ICME)

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

SixTrack for GPU R. De Maria. SixTrack Status SixTrack: Single Particle Tracking Code [cern.ch/sixtrack]. 70K lines written in Fortran 77/90 (with few.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

MAUS Status A. Dobbs CM43 29 th October Contents MAUS Overview Infrastructure Geometry and CDB Detector Updates CKOV EMR KL TOF Tracker Global Tracking.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

GPU-based iterative CT reconstruction

NVIDIA Profiler’s Guide

Heterogeneous Computing with D

Lecture 2: Intro to the simd lifestyle and GPU internals

Operation System Program 4

Brook GLES Pi: Democratising Accelerator Programming

Faster File matching using GPGPU’s Deephan Mohan Professor: Dr

MASS CUDA Performance Analysis and Improvement

Debugging Tools Tim Purcell NVIDIA.

GPU Scheduling on the NVIDIA TX2:

6- General Purpose GPU Programming

Presentation transcript:

- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU

Introduction Outline & goals of project  Geant4 for GPU Overview of GPUs  GPGPU paradigm  Stream processing  Uses kernels Starting point Testing Progress

GPU limitations Avoid Conditionals  As far as possible all threads must execute same code  Threads dispatched in warps  Within a warp, no threads will "get ahead" of any others Memory transfer latency  Global memory access slow  Use shared (cache-like) memory and texture memory.

Platform: OpenCL/CUDA OpenCL  Open source, freely available  From the Khronos group  Designed to work GPU, CPU and DSP CUDA  Introduced by nVidia before OpenCL.  Easy to understand C like syntax Macros  Span OpenCL and CUDA – see work of Otto

Platform : Other Gpuocelot  Emulate CUDA on other platforms  Still immature.  Potential advantages – No rewrite; Debugging; Speed openMP, openACC …

Previous Work: Otto Seiskari (2010) Port of core of Navigation exists  5 types of solids: box, orb, tubs, cons, polycone  Physics volumes: only placements  Has “normal” and “voxel” navigation  Defines clones of Geant4 classes through structs. Without physics Uses Macros to span OpenCL and CUDA

Previous Work: G4VPhysicalVolume typedef struct G4VPhysicalVolume { G4RotationMatrix frot; G4ThreeVector ftrans; GEOMETRYLOC G4LogicalVolume *flogical; // The logicalvolume // representing the // physical and tracking attributes of // the volume GEOMETRYLOC G4LogicalVolume *flmother; // The current mother logical volume } G4VPhysicalVolume;

My Work: Goals Initially to simulate e- and Gamma particle interactions  Two implementations exist  We are in touch with the French team and have requested code of gamma & e- physics New focus: voxel navigation – critical for HEP  to improve the performance of the navigation code.  extend the functionality to additional solids

My Work : First steps Get existing code to run  Compilation errors  System –  ATI Mobility Radeon 5700  AMD APP SDK 2.7 with OpenCL 1.2 on Ubuntu  Error: kernel arguments can't be declared with types bool/half/size_t/ptrdiff_t/intptr_t/uintptr_t/pointer-to-pointer: __global G4VPhysicalVolume *worldVolumeAndGeomBuffer

My Work : First Steps Code ran on another system  Difference was in OpenCL version 1.1 Specifications -:  ATI Mobility Radeon 5xxx  AMD APP SDK 2.5  OpenCL 1.1

My Work : Debugging the code Code compiles with OpenCL 1.1 Problem with pointers on GPU CPU GPU X 0001 Geometry start G4Logical Volume G4VPhysi calVolume X0002X0050 X 0060X0061X0062 X 0001X0002X0050 X 1000 Geometry start G4Logical Volume G4VPhysi calVolume X1002X1050 X 1060X1061X1062 X 0001X0002X0050

My Work : Debugging the code Solved the problem  Relocation on GPU  Move pointer offsets  Calculate new addresses  New way of getting address  (int) starting_buffer  64-bit compatibility doubtful  Both methods confirmed to work  Confirmation by testing

My Work : Creating Tests Testing procedure  Allocate buffer on GPU  Add test integers to struct definition  Assign values to these ints on CPU  Implement/Modify kernel  Move these ints into buffer on GPU  Transfer back  Compare Even with tests, debugging with OpenCL can be hard

My Work : Automating Tests Created a set of tests  Check for Geometry –  Confirm offsets of pointers on GPU  Confirm density matches  PhysVol->LogicalVol->Solid->Material->Density  Check Distance –  Basic check.  Confirms step == distance moved Automated with Macros Tests are Solid basis for future improvements

My Work : Optimisation  Challenge-  Avoid overhead of Switch statement for solids ( for ‘Virtual’ call )  Different threads cannot run different code To get performance all must work on the same type of solid  New algorithm-  Threads execute more common code Calculate steps, one solid type at a time.  Uses fast local (shared) mem  Implemented as a New type of navigation

My Work : Optimisation Code tested and run with OpenCL 1.1 with AMD APP SDK. Code not yet tested for compatibility with CUDA Code still has to be profiled to check for performance gains.

My Work : CUDA v/s OpenCL Recently got access to an Nvidia GTX 680 card. ( courtesy Felice Pantaleo ) Was able to easily configure CUDA code to run. Some discrepancies in results between the two versions. Even the legacy versions of the code have this error. Normal navigation does not compile for both platforms.

Challenges Algorithms may have to be altered OpenCL can be challenging. CUDA is more C-like OpenCL and gpuocelot not easily debugged.

Challenges New tools from AMD should help ease the problem  Good support for Windows.  gDebugger, APP Profiler.. Code not tested for  Compatibility with newer versions of OpenCL

The way forward The next steps for the project-  Support more (all?) of Geant4 geometry definition  More tests  Documentation  If the Physics definition from French team can be used, we might be able to run one complete example on the GPU

Thank You Questions?