Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed Ye Feng IMAGe DAReS SIParCS University of Wyoming.

Slides:

Advertisements

Similar presentations

Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.

Advertisements

Slide 1 Insert your own content. Slide 2 Insert your own content.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Problem Solving & Program Design in C Sixth Edition By Jeri R. Hanly & Elliot B. Koffman 1-1.

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

CS4026 Formal Models of Computation Running Haskell Programs – power.

Quantiles Edexcel S1 Mathematics Introduction- what is a quantile? Quantiles are used to divide data into intervals containing an equal number of.

Prasanna Pandit R. Govindarajan

Reconfigurable Out-of-Order Mechanism Generator for Unstructured Grid Computation in Computational Fluid Dynamics Dept. of Information and Computer Science,

Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.

Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.

Shredder GPU-Accelerated Incremental Storage and Computation

Automatic Test Data Generation of Character String Zhao Ruilian.

List Ranking and Parallel Prefix

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

Error Detection / Correction Hamming code. Why might we need Error detection/correction? Even & Odd Parity — Error detection Hamming code — Used for error.

1 Directed Depth First Search Adjacency Lists A: F G B: A H C: A D D: C F E: C D G F: E: G: : H: B: I: H: F A B C G D E H I.

Linking Verb? Action Verb or. Question 1 Define the term: action verb.

Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.

Limits (Algebraic) Calculus Fall, What can we do with limits?

Signing Performance Measures Report April 25, 2008 District Engineer Presentation 2 Signing Policy, Measures & Target Policy Replace signs near.

Properties of Exponents

A roadmap for art 19.3 Trine Christiansen March 19, 2013.

Addition 1’s to 20.

1 Performance Monitoring A Guide for Larger Local Councils (First Draft)

Fc. Consider the following problem (csp5) variables V[1] to V[10] uniform domains D[1] to D[10] = {1,2,3} constraints V[1] = V[4] V[4] > V[7] V[7] = V[10]

by Colin Kriwox. 2 Contents Introduction credit card error checking what is a code purpose of error-correction codes Encoding naïve approach.

Test B, 100 Subtraction Facts

Fourier Transform Fourier transform decomposes a signal into its frequency components Used in telecommunications, data compression, digital signal processing,

(8) I. Word Processing Package (Mail Merge) Ex: MS Word, Open Office II. Spread sheet package (function) Ex: MS Excel III. Presentation software (picture,

CSE3201/4500 Information Retrieval Systems

Bottoms Up Factoring. Start with the X-box 3-9 Product Sum

Raspberry Pi Performance Benchmarking

X-box Factoring. X- Box 3-9 Product Sum Factor the x-box way Example: Factor 3x 2 -13x (3)(-10)= x 2x 3x 2 x-5 3x +2.

11/19/2002Yun (Helen) He, SC20021 MPI and OpenMP Paradigms on Cluster of SMP Architectures: the Vacancy Tracking Algorithm for Multi- Dimensional Array.

Application of Ensemble Models in Web Ranking

CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

Parallel Computation of the Minimum Separation Distance of Bezier Curves and Surfaces Lauren Bissett, Nicholas Woodfield,

Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

1 Chapter 1 Parallel Machines and Computations (Fundamentals of Parallel Processing) Dr. Ranette Halverson.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

GPU Architecture and Programming

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Evaluating Coprocessor Effectiveness for DART Ye Feng SIParCS UCAR/NCAR Boulder, CO University of Wyoming Mentors Helen Kershaw Nancy Collins.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Introduction to Parallelism.

Parallel Implementations of Ensemble Kalman Filters for Huge Geophysical Models Jeffrey Anderson, Helen Kershaw, Jonathan Hendricks, Nancy Collins, Ye.

CS/EE 217 – GPU Architecture and Parallel Programming

Presentation transcript:

Evaluating Coprocessor Effectiveness for the Data Assimilation Research Testbed Ye Feng IMAGe DAReS SIParCS University of Wyoming

Introduction Task: evaluating the feasibility and effectiveness of coprocessor on DART. Target: get_close_obs ( profiling result: computationally intensive & executed multiple times during a typical DART run.) Coprocessor: NVDIA GPUs with CUDA Fortran. Result: Parallel version of exhaustive search on GPU is faster.

Problem Base obs Obs(1)Obs(2)Obs(3)Obs(4)Obs(5)Obs(6)Obs(7)Obs(8)Obs(9)Obs(10)Obs(11)Obs(12)Obs(13)Obs(14)Obs(15)Obs(16) Calculate: the horizontal distances between base location and observation locations.

Base obs Obs(1)Obs(2)Obs(3)Obs(4)Obs(5)Obs(6)Obs(7)Obs(8)Obs(9)Obs(10)Obs(11)Obs(12)Obs(13)Obs(14)Obs(15)Obs(16) maxdist Find: the close observations.

cclose_ind cdist EASY! or is it? d1 d2 d5 d8 d9 d11 d12 d13

It is easy on CPU But GPU doesn’t work this way! Problems with data dependency usually don’t scale so well on GPU. cnum_close depends on previous cnum_close value. cclose_ind and cdist both depend on cnum_close. Data Dependency

d1 d2 d3 d4 d5 d6 d7 d8 - maxdist = Prefix Sum cnum_close diff psum dist GPU Scan: Take the 1 st bit of Most Significant Bit (1 st bit) 1, dist<maxdist (close) 0, dist>maxdist (not close)

d1 d2 d3 d4 d5 d6 d7 d diff psum dist GPU Scan: d1 d2 0 0 d5 0 0 d8 Diff_sum cdist diff

d1 d2 0 0 d5 0 0 d d1 d2 d5 d8 What we want What we have cclose_ind Diff_sum cdist How can we independently eliminate the zeros and extract the indices Thread ID Extract:

cclose_ind Diff_sum Thread ID If diff.not. 0 Then cclose_ind=Thread ID If diff = 0 Then throw it away diff 5 8 Solution?

cclose_ind Diff_sum Thread ID If diff.not. 0 Then cclose_ind=Thread ID If diff = 0 Then throw it away diff 5 8 NO Branching!

cclose_ind Thread ID d1 d2 0d5 0d8 0 0 cdist d1 d2 0 0 d5 0 0 d8 cnum_close cdistDiff_sum Solution!

Device Functions: gpu_dist gpu_scan si: number of iterations performed in this kernel. extract sn: number of gpu_scan blocks that each extract block in this kernel handles. Block 1Block 2 si=2 8 threads/block 16 element/block sn=4 dist array: Result from gpu_scan:

Conclusion CUDA Fortran on GPU gave significant speedup vs CPU (10x + ). Step outside the box (redesign the algorithm). In order to get good performance, si and sn need to be tuned. Be careful with using device memory. There’s still room to improve the performance of this project.

Acknowledgements UCARNCARUniversity of Wyoming DAReS/IMAGe Helen Kershaw (Mentor) Nancy Collins (Mentor) Jeff Anderson Tim Hoar Kevin Raeder Kristin Mooney Silvia Gentile Carolyn Mueller Richard Loft Raghu Raj Prasanna Kumar