Optimizing the trace transform Using OpenMP and CUDA Tim Besard 2013-06-19.

Slides:

Advertisements

Similar presentations

Advertisements

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University.

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Parallel Programming – Barriers, Locks, and Continued Discussion of Parallel Decomposition David Monismith Jan. 27, 2015 Based upon notes from the LLNL.

Introduction to Openmp & openACC

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The OpenUH Compiler: A Community Resource Barbara Chapman University of Houston March, 2007 High Performance Computing and Tools Group

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*

th DAC Embedded Systems and Software CML CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA Yooseong Kim and Aviral Shrivastava Compiler-Microarchitecture.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

CS 284a, 4 November 1997 Copyright (c) , John Thornley1 CS 284a Lecture Tuesday, 4 November, 1997.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Chapter 9. Concepts in Parallelisation An Introduction

Using JetBench to Evaluate the Efficiency of Multiprocessor Support for Parallel Processing HaiTao Mei and Andy Wellings Department of Computer Science.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

CS 470/570:Introduction to Parallel and Distributed Computing.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Parallel implementation of RAndom SAmple Consensus (RANSAC) Adarsh Kowdle.

Lecture 8: Caffe - CPU Optimization

Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Parallel Edge Detection Daniel Dobkin Asaf Nitzan.

DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on a low power, embedded system School of Information Technology & Mathematical.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

GPU Architecture and Programming

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.

Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.

Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.

Developing the Demosaicing Algorithm in GPGPU Ping Xiang Electrical engineering and computer science.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

TM Parallel Concepts An introduction. TM The Goal of Parallelization Reduction of elapsed time of a program Reduction in turnaround time of jobs Overhead:

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

The Process From bare bones to finished product. The Steps Programming Debugging Performance Tuning Optimization.

CS 732: Advance Machine Learning

Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

GPU-based iterative CT reconstruction

The Art of Parallel Processing

Optimization: The Art of Computing

SHARED MEMORY PROGRAMMING WITH OpenMP

Allen D. Malony Computer & Information Science Department

Multithreading Why & How.

M4 and Parallel Programming

A Parallelism Profiler with What-If Analyses for OpenMP Programs

Multicore and GPU Programming

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Optimizing the trace transform Using OpenMP and CUDA Tim Besard

Trace transform needs to be real-time MATLAB – Slow – Difficult to optimize C++ base implementation – Allows for optimizations

Optimizing the trace transform How to parallelize? OpenMP CUDA Performance

How to parallelize? p   p 

Coarse-grained parallelism Rotate 0° T-functionals 359° …… Sinogram row Sinogram

How to parallelize? Fine-grained parallelism – Rotation – Functionals: prefix sum

OpenMP Compiler directives – #pragma omp parallel for – #pragma omp critical – #pragma omp barrier

OpenMP Compiler directives Address coarse-grained parallelism – Unobtrusive – Significant overhead 5× speed-up – 8-core machine – Unoptimized

CUDA Parallel computing platform Programming model – Lightweight threads – Massively parallel Address fine-grained parallelism – Pixel-centric approach – Complete re-implementation

CUDA Low-level details matter a lot – Memory access patterns – Branch divergence 10× speed-up – GeForce GTX TITAN (20% usage) – Unoptimized

Performance for 10 signatures Execution time in milliseconds MEX C++ OpenMP CUDA

Future work Optimize CUDA – Compare against state of the art Julia implementation – Algorithmic IR

Optimizing the trace transform Using OpenMP and CUDA Tim Besard