Outline Reading Data From Files Double Buffering GMAC ECE

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

A C++ Crash Course Part II UW Association for Computing Machinery Questions & Feedback.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

Process Management.

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Roy Ernest Database Administrator Pinnacle Sports Worldwide SQL Server High Availability.

Parallel Computational Steering and Analysis for HPC Applications using a ParaView Interface and the HDF5 DSM Virtual File Driver J. Biddiscombe, J. Soumagne,

VSCSE Summer School Programming Heterogeneous Parallel Computing Systems Lecture 6: Basic CUDA and MPI.

Carnegie Mellon 1 Dynamic Memory Allocation: Basic Concepts : Introduction to Computer Systems 17 th Lecture, Oct. 21, 2010 Instructors: Randy Bryant.

Multi-GPU and Stream Programming Kishan Wimalawarne.

Chris Riesbeck, Fall 2007 Dynamic Memory Allocation Today Dynamic memory allocation – mechanisms & policies Memory bugs.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

ME964 High Performance Computing for Engineering Applications “Software is like entropy: It is difficult to grasp, weighs nothing, and obeys the Second.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Cuda Streams Presented by Savitha Parur Venkitachalam.

An Introduction to C Programming (assuming that you already know Java; this is not an introduction to C++)

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Martin Kruliš by Martin Kruliš (v1.0)1.

SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.

GMAC Global Memory for Accelerators Isaac Gelado, John E. Stone, Javier Cabezas, Nacho Navarro and Wen-mei W. Hwu GTC 2010.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.

CP104 Introduction to Programming File I/O Lecture 33 __ 1 File Input/Output Text file and binary files File Input/output File input / output functions.

CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”

UNIX Files File organization and a few primitives.

CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

Copyright 2005, The Ohio State University 1 Pointers, Dynamic Data, and Reference Types Review on Pointers Reference Variables Dynamic Memory Allocation.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.

Synchronization These notes introduce:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

File Systems - Part I CS Introduction to Operating Systems.

Gramming An Introduction to C Programming (assuming that you already know Java; this is not an introduction to C++)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

C Programming Day 2. 2 Copyright © 2005, Infosys Technologies Ltd ER/CORP/CRS/LA07/003 Version No. 1.0 Union –mechanism to create user defined data types.

CUDA programming Performance considerations (CUDA best practices)

CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.

Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.

Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.

“Success consists of going from failure to failure without loss of enthusiasm.” Winston Churchill.

Real Numbers Device driver process within the operating system that interacts with I/O controller logical record 1 logical record 2 logical record 3.

An Introduction to C Programming (assuming that you already know Java; this is not an introduction to C++)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.

MYO: Shared Memory Programming Model for Intel® MIC Ravi Ganapathi

Heterogeneous Programming

An Introduction to C Programming

Basic CUDA Programming

Host-Device Data Transfer

Some things are naturally parallel

Outline Reading Data From Files Double Buffering GMAC ECE

Text and Binary File Processing

Measuring Performance

CUDA Execution Model – III Streams and Events

CUDA Programming Model

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Synchronization These notes introduce:

Presentation transcript:

ECE408/CS483 Applied Parallel Programming Lecture 7b: Performance Considerations and GMAC

Outline Reading Data From Files Double Buffering GMAC ECE408 2011

Reading a File in Theory Two stage process: Read from the file to host memory Copy from host memory to device memory Host Memory GPU Device Memory CPU ECE408 2011

Reading a File in Code (I) static const int FILE_NAME = 1; static const int LAST_PARAM = 2; int main(int argc, char *argv[]) { int fd = -1; struct stat file_stat; cudaError_t cuda_ret; void *host_buffer, *device_buffer; timestamp_t start_time, end_time; float bw; if(argc < LAST_PARAM) FATAL("Bad argument count"); /* Open the file */ if((fd = open(argv[FILE_NAME], O_RDONLY)) < 0) FATAL("Unable to open %s", argv[FILE_NAME]); ECE408 2011

Reading a File in Code (II) if(fstat(fd, &file_stat) < 0) FATAL("Unable to read meta data for %s", argv[FILE_NAME]); /* Allocate memory */ host_buffer = malloc(file_stat.st_size); if(host_buffer == NULL) FATAL("Unable to allocate host memory"); cuda_ret = cudaMalloc(&device_buffer, file_stat.st_size); if(cuda_ret != cudaSuccess) FATAL("Unable to allocate device memory"); /* Read the file to host memory and copy it to device memory */ start_time = get_timestamp(); if(read(fd, host_buffer, file_stat.st_size) < file_stat.st_size) FATAL("Unable to read data from %s", argv[FILE_NAME]); cuda_ret = cudaMemcpy(device_buffer, host_buffer, file_stat.st_size, cudaMemcpyHostToDevice); FATAL("Unable to copy data to device memory"); ECE408 2011

Reading a File in Code (and III) cuda_ret = cudaThreadSynchronize(); if(cuda_ret != cudaSuccess) FATAL("Unable to wait for device"); end_time = get_timestamp(); /* Calculate the bandwidth */ bw = 1.0f * file_stat.st_size / (end_time - start_time); fprintf(stdout, "%d bytes in %f msec : %f MBps\n", file_stat.st_size, 1e-3f * (end_time - start_time), bw); /* Release Resources */ cuda_ret = cudaFree(device_buffer); if(cuda_ret != cudaSuccess) FATAL("Unable to free device memory"); free(host_buffer); close(fd); return 0; ECE408 2011

Performance of File I/O ECE408 2011

Outline Reading Data From Files Double Buffering GMAC ECE408 2011

Double-Buffering in Theory Multi-stage process with concurrent transfers Read a piece of the file to host memory While copying the previous piece to device Host Memory GPU Device Memory CPU ECE408 2011

Double-Buffering in Code (I) static const size_t host_buffer_size = 512 * 1024; int main(int argc, char *argv[]) { int fd = -1; struct stat file_stat; cudaError_t cuda_ret; cudaStream_t cuda_stream; void *host_buffer, *device_buffer, *active, *passive, *tmp, *current; size_t pending, transfer_size; timestamp_t start_time, end_time; float bw; if(argc < LAST_PARAM) FATAL("Bad argument count"); /* Open the file */ if((fd = open(argv[FILE_NAME], O_RDONLY)) < 0) FATAL("Unable to open %s", argv[FILE_NAME]); ECE408 2011

Double-Buffering in Code (II) if(fstat(fd, &file_stat) < 0) FATAL("Unable to read meta data for %s", argv[FILE_NAME]); /* Create CUDA stream for asynchronous copies */ cuda_ret = cudaStreamCreate(&cuda_stream); if(cuda_ret != cudaSuccess) FATAL("Unable to create CUDA stream"); /* Create CUDA events */ cuda_ret = cudaEventCreate(&active_event); if(cuda_ret != cudaSuccess) FATAL("Unable to create CUDA event"); cuda_ret = cudaEventCreate(&passive_event); /* Alloc a big chunk of host pinned memory */ cuda_ret = cudaHostAlloc(&host_buffer, 2 * host_buffer_size, cudaHostAllocDefault); if(cuda_ret != cudaSuccess) FATAL("Unable to allocate host memory"); cuda_ret = cudaMalloc(&device_buffer, file_stat.st_size); if(cuda_ret != cudaSuccess) FATAL("Unable to allocate device memory"); ECE408 2011

Double-Buffering in Code (III) /* Start transferring */ start_time = get_timestamp(); /* Queue dummmy first event */ cuda_ret = cudaEventRecord(active_event, cuda_stream); if(cuda_ret != cudaSuccess) FATAL("UNable to queue CUDA event"); active = host_buffer; passive = (uint8_t *)host_buffer + host_buffer_size; current = device_buffer; pending = file_stat.st_size; while(pending > 0) { /* Make sure CUDA is not using the buffer */ cuda_ret = cudaEventSynchronize(active_event); if(cuda_ret != cudaSuccess) FATAL("Unable to wait for event"); transfer_size = (pending > host_buffer_size) ? host_buffer_size : pending; if(read(fd, active, transfer_size) < transfer_size) FATAL("Unable to read data from %s", argv[FILE_NAME]); ECE408 2011

Double-Buffering in Code (IV) /* Send data to the device asynchronously */ cuda_ret = cudaMemcpyAsync(current, active, transfer_size, cudaMemcpyHostToDevice, cuda_stream); if(cuda_ret != cudaSuccess) FATAL("Unable to copy data to device memory"); /* Record event to know when the buffer is idle */ cuda_ret = cudaEventRecord(active_event, cuda_stream); if(cuda_ret != cudaSuccess) FATAL("UNable to queue CUDA event"); /* Update counters and buffers */ pending -= transfer_size; current = (uint8_t *)current + transfer_size; tmp = active; active = passive; passive = tmp; tmp_event = active_event; active_event = passive_event; passive_event = tmp_event; } ECE408 2011

Double-Buffering in Code (and V) cuda_ret = cudaStreamSynchronize(cuda_stream); if(cuda_ret != cudaSuccess) FATAL("Unable to wait for device"); end_time = get_timestamp(); bw = 1.0f * file_stat.st_size / (end_time - start_time); fprintf(stdout, "%d bytes in %f msec : %f MBps\n", file_stat.st_size, 1e-3f * (end_time - start_time), bw); cuda_ret = cudaFree(device_buffer); if(cuda_ret != cudaSuccess) FATAL("Unable to free device memory"); cuda_ret = cudaFreeHost(host_buffer); if(cuda_ret != cudaSuccess) FATAL("Unable to free host memory"); close(fd); return 0; } ECE408 2011

Performance of Double-Buffering ECE408 2011

Outline Reading Data From Files Double Buffering GMAC ECE408 2011

GMAC Memory Model Unified CPU / GPU virtual address space Asymmetric address space visibility Shared Data Memory CPU GPU CPU Data ECE408 2011

ADSM Consistency Model Implicit acquire / release primitives at accelerator call / return boundaries CPU GPU CPU GPU ECE408 2011

GMAC Memory API Memory allocation Example usage gmacError_t gmacMalloc(void **ptr, size_t size) Allocated memory address (returned by reference) Gets the size of the data to be allocated Error code, gmacSuccess if no error Example usage #include <gmac/cuda.h> int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . } ECE408 2011

GMAC Memory API Memory release Example usage gmacError_t gmacFree(void *ptr) Memory address to be release Error code, gmacSuccess if no error Example usage #include <gmac/cuda.h> int main(int argc, char *argv[]) { float *foo = NULL; gmacError_t error; if((error = gmacMalloc((void **)&foo, FOO_SIZE)) != gmacSuccess) FATAL(“Error allocating memory %s”, gmacErrorString(error)); . . . gmacFree(foo); } ECE408 2011

ADSM Eager Update CPU GPU Asynchronous data transfers while the CPU computes Optimized data transfers: Memory block granularity Avoid copies on Integrated GPU Systems System Memory Accelerator Memory CPU GPU ECE408 2011

ADSM Coherence Protocol Only meaningful for the CPU Three state protocol: Modified: data in the in the CPU Shared: data is in both, CPU and GPU Invalid: data is in the GPU I Read Write M Invalidate S Flush ECE408 2011

GMAC Optimizations Global Memory for Accelerators : user-level ADSM run-time Optimized library calls: memset, memcpy, fread, fwrite, MPI_send, MPI_receive Double buffering of data transfers Transfer to GPU while reading from disk ECE408 2011

GMAC in Code (I) int main(int argc, char *argv[]) { FILE *fp; struct stat file_stat; gmacError_t gmac_ret; void *buffer; timestamp_t start_time, end_time; float bw; if(argc < LAST_PARAM) FATAL("Bad argument count"); if((fp = fopen(argv[FILE_NAME], "r")) < 0) FATAL("Unable to open %s", argv[FILE_NAME]); if(fstat(fileno(fp), &file_stat) < 0) FATAL("Unable to read meta data for %s", argv[FILE_NAME]); ECE408 2011

GMAC in Code (II) gmac_ret = gmacMalloc(&buffer, file_stat.st_size); if(gmac_ret != gmacSuccess) FATAL("Unable to allocate memory"); start_time = get_timestamp(); if(fread(buffer, 1, file_stat.st_size, fp) < file_stat.st_size) FATAL("Unable to read data from %s", argv[FILE_NAME]); gmac_ret = gmacThreadSynchronize(); if(gmac_ret != gmacSuccess) FATAL("Unable to wait for device"); end_time = get_timestamp(); bw = 1.0f * file_stat.st_size / (end_time - start_time); fprintf(stdout, "%d bytes in %f msec : %f MBps\n", file_stat.st_size, 1e-3f * (end_time - start_time), bw); gmac_ret = gmacFree(buffer); if(gmac_ret != gmacSuccess) FATAL("Unable to free device memory"); fclose(fp); return 0; } ECE408 2011

Performance of GMAC ECE408 2011

Any More Questions?