© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, 2007-2012 1 ECE408 Applied Parallel Programming Lecture 17 Atomic.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.

DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

ADDING INTEGERS 1. POS. + POS. = POS. 2. NEG. + NEG. = NEG. 3. POS. + NEG. OR NEG. + POS. SUBTRACT TAKE SIGN OF BIGGER ABSOLUTE VALUE.

SUBTRACTING INTEGERS 1. CHANGE THE SUBTRACTION SIGN TO ADDITION

MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.

Photo Composition Study Guide Label each photo with the category that applies to that image.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 More on Performance Considerations.

GPU Programming Lecture 7 Atomic Operations and Histogramming

KaChing!, Lesson 3: Cash the Check and Track the Dough

David Luebke 1 6/7/2014 ITCS 6114 Skip Lists Hashing.

ITP © Ron Poet Lecture 4 1 Program Development. ITP © Ron Poet Lecture 4 2 Preparation Cannot just start programming, must prepare first. Decide what.

O X Click on Number next to person for a question.

Progam.-(6)* Write a program to Display series of Leaner, Even and odd using by LOOP command and Direct Offset address. Design by : sir Masood.

Past Tense Probe. Past Tense Probe Past Tense Probe – Practice 1.

Processes Management.

Addition 1’s to 20.

EC-111 Algorithms & Computing Lecture #11 Instructor: Jahan Zeb Department of Computer Engineering (DCE) College of E&ME NUST.

Test B, 100 Subtraction Facts

11 = This is the fact family. You say: 8+3=11 and 3+8=11

Copyright © 2000, Daniel W. Lewis. All Rights Reserved. CHAPTER 10 SHARED MEMORY.

O X Click on Number next to person for a question.

Copyright © 2012 Pearson Education, Inc. Chapter 14: More About Classes.

Drill down Reconciliation Analysis Report (RFMFGRCN_RP1) in the Background Instructions Guide June, 2012.

The University of Adelaide, School of Computer Science

L9: Floating Point Issues CS6963. Outline Finish control flow and predicated execution discussion Floating point – Mostly single precision until recent.

CIS 720 Lecture 2. Concurrency …shared variable declaration…. …shared variable initialization… co P 1 // P 2 // P 3 // P 4 oc Execution of P 1 … P 4 starts.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 11: Floating-Point Considerations.

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

Introduction to CUDA Programming Histograms and Sparse Array Multiplication Andreas Moshovos Winter 2009 Based on documents from: NVIDIA & Appendix A of.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.

CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.

CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

© David Kirk/NVIDIA and Wen-mei W

ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code

ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

L18: CUDA, cont. Memory Hierarchy and Examples

Lecture 5: GPU Compute Architecture for the last time

ECE 498AL Spring 2010 Lecture 11: Floating-Point Considerations

Parallel Computation Patterns (Scan)

Mattan Erez The University of Texas at Austin

Parallel Computation Patterns (Reduction)

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

© David Kirk/NVIDIA and Wen-mei W. Hwu,

ECE 498AL Lecture 10: Control Flow

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

Lecture 3: CUDA Threads & Atomics

ECE 498AL Spring 2010 Lecture 10: Control Flow

Lecture 5: Synchronization and ILP

Parallel Computation Patterns (Histogram)

A fixed-function NVIDIA GeForce graphics pipeline.

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 17 Atomic Operations and Histogramming

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, Objective To understand atomic operations –Read-modify-write in parallel computation –Use of atomic operations in CUDA –Why atomic operations reduce memory system throughput –How to avoid atomic operations in some parallel algorithms Histogramming as an example application of atomic operations –Basic histogram algorithm –Privatization

A Common Collaboration Pattern Multiple bank tellers count the total amount of cash in the safe Each grab a pile and count Have a central display of the running total Whenever someone finishes counting a pile, add the subtotal of the pile to the running total A bad outcome –Some of the piles were not accounted for © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

A Common Parallel Coordination Pattern Multiple customer service agents serving customers Each customer gets a number A central display shows the number of the next customer who will be served When an agent becomes available, he/she calls the number and he/she adds 1 to the display Bad outcomes –Multiple customers get the same number –Multiple agents serve the same number © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

A Common Arbitration Pattern Multiple customers booking air tickets Each –Brings up a flight seat map –Decides on a seat –Update the the seat map, mark the seat as taken A bad outcome –Multiple passengers ended up booking the same seat © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

6 Atomic Operations If Mem[x] was initially 0, what would the value of Mem[x] be after threads 1 and 2 have completed? –What does each thread get in their Old variable? The answer may vary due to data races. To avoid data races, you should use atomic operations thread1:thread2:Old Mem[x] New Old + 1 Mem[x] New Old Mem[x] New Old + 1 Mem[x] New

Timing Scenario #1 TimeThread 1Thread 2 1 (0) Old Mem[x] 2 (1) New Old (1) Mem[x] New 4 (1) Old Mem[x] 5 (2) New Old (2) Mem[x] New Thread 1 Old = 0 Thread 2 Old = 1 Mem[x] = 2 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

Timing Scenario #2 TimeThread 1Thread 2 1 (0) Old Mem[x] 2 (1) New Old (1) Mem[x] New 4 (1) Old Mem[x] 5 (2) New Old (2) Mem[x] New Thread 1 Old = 1 Thread 2 Old = 0 Mem[x] = 2 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

Timing Scenario #3 TimeThread 1Thread 2 1 (0) Old Mem[x] 2 (1) New Old (0) Old Mem[x] 4 (1) Mem[x] New 5 (1) New Old (1) Mem[x] New Thread 1 Old = 0 Thread 2 Old = 0 Mem[x] = 1 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

Timing Scenario #4 TimeThread 1Thread 2 1 (0) Old Mem[x] 2 (1) New Old (0) Old Mem[x] 4 (1) Mem[x] New 5 (1) New Old (1) Mem[x] New Thread 1 Old = 0 Thread 2 Old = 0 Mem[x] = 1 after the sequence © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, Atomic Operations – To Ensure Good Outcomes thread1: thread2:Old Mem[x] New Old + 1 Mem[x] New Old Mem[x] New Old + 1 Mem[x] New thread1: thread2:Old Mem[x] New Old + 1 Mem[x] New Old Mem[x] New Old + 1 Mem[x] New Or

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, Without Atomic Operations thread1: thread2:Old Mem[x] New Old + 1 Mem[x] New Old Mem[x] New Old + 1 Mem[x] New Both threads receive 0 Mem[x] becomes 1 Mem[x] initialized to 0

Atomic Operations in General Performed by a single ISA instruction on a memory location address –Read the old value, calculate a new value, and write the new value to the location The hardware ensures that no other threads can access the location until the atomic operation is complete –Any other threads that access the location will typically be held in a queue until its turn –All threads perform the atomic operation serially © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

Atomic Operations in CUDA Function calls that are translated into single instructions (a.k.a. intrinsics) –Atomic add, sub, inc, dec, min, max, exch (exchange), CAS (compare and swap) –Read CUDA C programming Guide 4.0 for details Atomic Add int atomicAdd(int* address, int val); reads the 32-bit word old pointed to by address in global or shared memory, computes (old + val), and stores the result back to memory at the same address. The function returns old. © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

More Atomic Adds in CUDA Unsigned 32-bit integer atomic add unsigned int atomicAdd(unsigned int* address, unsigned int val); Unsigned 64-bit integer atomic add unsigned long long int atomicAdd(unsigned long long int* address, unsigned long long int val); Single-precision floating-point atomic add (capability > 2.0) –float atomicAdd(float* address, float val); © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

Histogramming A method for extracting notable features and patterns from large data sets –Feature extraction for object recognition in images –Fraud detection in credit card transactions –Correlating heavenly object movements in astrophysics –… Basic histograms - for each element in the data set, use the value to identify a bin to increment © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,

A Histogram Example In sentence Programming Massively Parallel Processors build a histogram of frequencies of each letter A(4), C(1), E(1), G(1), … © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, How do you do this in parallel?

ANY MORE QUESTIONS © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois,