Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

Slides:

Advertisements

Similar presentations

GPU Programming using BU Shared Computing Cluster

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

OpenSSL acceleration using Graphics Processing Units

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

AGENT SIMULATIONS ON GRAPHICS HARDWARE Timothy Johnson - Supervisor: Dr. John Rankin 1.

GPU Programming with CUDA – Accelerated Architectures Mike Griffiths

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

Performance and Energy Efficiency of GPUs and FPGAs

GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

7th Workshop on Fusion Data Processing Validation and Analysis Integration of GPU Technologies in EPICs for Real Time Data Preprocessing Applications J.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Computer Performance Computer Engineering Department.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

3. April 2006Bernd Panzer-Steindel, CERN/IT1 HEPIX 2006 CPU technology session some ‘random walk’

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Accelerating MATLAB with CUDA

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

GPU Architecture and Programming

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.

Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.

1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.

CUDA-based Volume Rendering in IGT Nobuhiko Hata Benjamin Grauer.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

CS203 – Advanced Computer Architecture

NFV Compute Acceleration APIs and Evaluation

Brad Baker, Wayne Haney, Dr. Charles Choi

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs

Performance Tuning Team Chia-heng Tu June 30, 2009

Brook GLES Pi: Democratising Accelerator Programming

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi

2 Industry Direction  High performance COTS computing is moving to multi-core and heterogeneous silicon  Smaller, heterogeneous cores on same silicon  Multi-core CPU with 1-3 GPU co-processors  Heterogeneous multi-core (IBM Cell)  Multi-core CPU with smaller individual cores  GPU co-processor

3 CPU/GPU Integration GPU Used as a Math Co-Processor Stream Processing Objectives – Methods  Investigate challenges associated with supporting signal processing applications on GPUs Identify cost-effective alternative approaches  Trade studies performed Multi-core, Cell, and GPU Hardware GFLOP / Watt GFLOP / $ GPU (Stream) Software Stream processing allows easier parallelization using GPU hardware

4 Trade Study - Hardware  Multi-Core CPU 2 x 265HE 1.8Ghz AMD Opterons  GPU Assumes single GPU on defined hardware baseline*  Cell system Single CAB from Mercury Systems on defined hardware baseline** 8800GTX Gflop Calculation: MADD (2 FLOPs) + MUL (1 FLOP)) × 1350MHz × 128 SPs ＝ GFlop/s Evaluation Criteria Metric Units GPU * Cell ** CPU Price (Complete System)$7500$14500$6500$ Power Consumption Watts Memory Capacity Gb Cache Memory Mb Memory Bandwidth Gb/s Platform Maturity0.513years Software Composite Score2.548Subj. Theoretical Performance Gflop/s Hardware Baseline Power (watts) Gflop/s (theor) Gflop/s (obs) CostSize in U AMD Opteron 265HE – 1.8 Ghz $ 6,5001

5 Trade Study - Hardware  Score weighting system based on theoretical performance with one subjective score 1-10 performance scale rates each category as a percentage of the maximum performance Ratio scale relates specific performance scores to the highest performance to highlight the large differences between platforms  Scenario titles indicate weighting of specific metrics in the comparison Scenario 1 to 10Ratio GPUCellCPUGPUCellCPU Perf Perf, $ Perf, Power, $ Perf, Power, $, Software

6 Trade Study for GPU Platform  OS used – Centos Linux v4.4 Math SupportLibrary Functionality PeakStream 1D & 2D Arrays, Single and Double Precision, standard C/C++ math library, BLAS, Matrix Solver, Random Number Generators, 2K complex to complex FFTs. Runtime Virtual Machine that installs on top of OS. gcc 3.4.5, gcc 4.0.3, or Intel compiler 9.0 gdb 6.3 RapidMind Standard C++ libraries for stream processing types. Matrix support. 2K complex to complex FFTs Offers a transparency layer on top of the parallel processor platform. Brook Standard C Library. FFT and Matrix support.Brook extends C to include parallel data constructs. Offers a high level language that is platform independent using openGL, DX, or CTM. CUDA Standard FFT and BLAS libraries. 1D, 2D, and 3D transforms of complex and real ‐ valued data up to 16k. C, C++ not fully supported (no classes definitions but supports function templates). Supports thread communication. CTM No mathematic libraries provided. Examples provided by the vendor for FFTs and Matrix multiply. Program in the native instruction set and memory AMD Assembler.

7 Experiment Description  Sonar Passive Narrowband Processing: Multiple FFT operations and spectral processing of beamformed data  Implementation OOP design written in C++ 4k complex 1D FFT over 100 beams, 1 aperture Substituted Nvidia's CUDA FFT library routines in place of Intel's MKL FFT routines

8 Math Benchmark and Signal Processing String Results Application Results ArchitectureExecution Time VSIPL++ PNB Algorithm on Intel Core 2 Quad QX6700 CPU msec CUDA Batch Style PNB Algorithm on Nvidia g80 GPU msec Fundamental Math Benchmarks Software Platform (GPU) 1k SGEMM Gflops 1k 1d Complex FFT Peakstream (AMD r520) CUDA (Nvidia g80) RapidMind (Nvidia g80)247.5 RapidMind (AMD r520)264.9 Intel Core 2 Quad QX AMD Opteron 265HE Nvidia Cuda Platform  Utilizes most powerful GPU on market  Most extensive pre-built math library Approx. 50% Performance Gain Using CUDA’s FFT

9 Conclusion  Rudimentary math results on the GPU show improvements over traditional hardware  Application results impressive with minimal effort  GPU performance requires multiple math operations in sequence to form a larger kernel with minimal I/O transfers Automatically operates in parallel on large data structures I/O via PCI-E bus is the main bottleneck limiting the GPU  Multi-Core CPUs perform well on smaller data sets where I/O speed is critical  Tools needed to alleviate the burden of porting to vendor specific hardware