Download presentation
Presentation is loading. Please wait.
Published byReynard King Modified over 8 years ago
1
Parallel Computing on Graphics Cards Keith Kelley, CS6260
2
Flynn’s Taxonomy
5
SIMD vs MISD SIMD: One instruction runs on an array of data (at one time) MISD: Different instructions run against a single piece of data
6
SIMD: Vector Processing Vector Processor aka Array Processor Defined by an instruction set including operations on vectors Vector Computer (1) Processor Array: a vector computer built from identical, synchronized processors that perform the same operation on different data Vector Computer (2) Pipelined Vector Processor: streams vectors from memory to the CPU
7
SIMD: Stream Processing The data is the “stream” Operations are applied to each element in the stream Uses many many stream processors 100+ per GPU
8
SIMD: Parallel Stream Pseudocode for(int i=0;i<100 * 4;i++) result[i]=source0[i]+source1[i]; result[i]=source0[i]+source1[i];VS. streamElements 100 streamElementFormat 4 numbers ElementKernel “@arg0+@arg1” result = kernel(source0, source1)
9
SIMD: GPGPU General-Purpose computing on Graphics Processing Units
10
Advantages of GPGPU Space (Rackspace U per ? TFlops) Power Consumption (GFlops per Watt) Cost ($ per TFlop) Speed (Speedup) Availability (hardware not uptime)
11
Suitable Problems Financial analysis Weather research Molecular modeling Computational Fluid Dynamics Life Sciences Signal Processing Imaging Intrusion Detection via Packet Inspection Using a state machine – anything?
12
General Purpose GPU Brands Nvidia Tesla ATI Firestream Intel Larrabee (sort of, and not yet)
13
Intel Larrabee A many core CPU for graphics processing (and general parallel computing) Based on x86 instructional pipelines Scalar (mostly), not vector (mostly) Has some “vector processing units” Not out yet
14
Nvidia Tesla Graphics card without a graphics port Lots of Stream Processors on one card Tesla 10 contains 1.4 billion transistors, 240 cores (30 thread processor arrays each containing 8 thread processors) First generation card basically an 8800GTX without a graphics port 1 C1060 board has 1 10-series GPU with GB memory A machine with 4 boards reaches 4 TFlops
15
Tesla: Thread Processor Another term for Stream Processor NVIDIA term TP (Thread Processor) TPA (Thread Processor Array) Each TP is a core, but smaller (and different) than a CPU core
16
Tesla 10: Performance Four cards in one chassis maximum One chassis maximum 8 TPs*30 TPAs=240 stream processors and 4GB RAM per card 960 stream processors maximum 1 TFlop per card 4 TFlops per deskside supercomputer
17
Tesla: Availability Cost Per Card $1500 range Cost Per Computer (4 cards) $6k-10k Compatibility with standard PC slots Around 150 watts per TFlop (per card) Nvidia has shipped over 100 million CUDA capable cards Vendor computers start at ~$8000 but you can also build your own In the same amount of rack space one reviewer built a XEON based machine that costs 4 times as much but performed 16 times slower
18
Tesla: Language CUDA – Compute Unified Device Architecture OpenCL – Second quarter 2009. Either uses threads, which many consider difficult
19
Tesla: CUDA Sample Code cudaArray* cu_array; texture tex; // Allocate array cudaMalloc(&cu_array, cudaCreateChannelDesc (), width, height); // Copy image data to array cudaMemcpy(cu_array, image, width*height, cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTexture(tex, cu_array); // Run kernel dim3 blockDim(16, 16, 1); dim3 gridDim(width / blockDim.x, height / blockDim.y, 1); kernel >>(d_odata, width, height); cudaUnbindTexture(tex); __global__ void kernel(float* odata, int height, int width) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; float c = texfetch(tex, x, y); float c = texfetch(tex, x, y); odata[y*width+x] = c; odata[y*width+x] = c;}
20
Tesla: Future Plans GPU Clusters Fortran C++ OpenCL support (very soon)
21
AMD/ATI Firestream Firestream 9250 (2 nd generation) stream processor occupies 1 PCIe 2.0 x16 slot Consumes less than 150 watts Up to eight gigaflops per watt 9170 is the 1 st Gen GPU at 500 GFlops, 2GB RAM 9250 is 1Tflop and 9270 is 1.2 Tflops 9250 has 1GB memory, 9270 2GB 9270 retails at $1499 8 cards in one 4U chassis reaches 9.6 TFlops
22
Firestream: OpenCL Open Standard Started by Apple Maintained by Khronos Group (like OpenGL, OpenAL) Adopted by AMD, Nvidia soon, others. Also backed by Intel Developed primarily by teams at AMD, Intel, and Nvidia
23
Firestream: OpenCL Example // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL); // create the compute program program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL); // build the compute program executable clBuildProgramExecutable(program, false, NULL, NULL); // create the compute kernel kernel = clCreateKernel(program, "fft1D_1024"); // create N-D range object with work-item dimensions global_work_size[0] = n; local_work_size[0] = 64; range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size); // set the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); // execute kernel clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);
24
Different speed measurements Both measurements are floating point Single point (SP) speeds Double point (DP) speeds SP speeds tend to be around 2.5 times faster than DP
25
Other Supercomputer Speeds The Fastest supercomputer in last year’s top 500 list used Opterons and is at Los Alamos National Laboratories and measures in at about 1105 TFlops Tesla-based heterogenous cluster in top 500 supercomputers. 170 Tesla S1070 1U systems. 170 TFlops of theoretical peak performance Last announced fastest vector supercomputer was an NEC SX-9 at 839 TFlops (June 2008) Last year’s Top 500 list contained only one supercomputer using vector processors The fastest vector supercomputer in 2006 was 144 TFlops
26
Other SIMD Computers: Languages SVL PVM MPI Paris CMMD NESL CVL Many others
27
GPU Cluster A cluster with a GPU on each node Uses MPI or similar clustering API
28
My New “Computer” NVIDIA GeForce 8800GT w/512MB $100 on ebay 110W 504 GFlops 754 million transistors Core clock speed: 600 MHz 7*16=112 stream processors Stream processor clock speed: 1.5 GHz Similar to the 8800GTX (Tesla 8)
29
Question Question: What are two advantages of General Purpose Computing on Graphics Processing Units? Answer: power consumption, space, cost, speed (in some cases), availability of hardware
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.