Parallel Computing on Graphics Cards Keith Kelley, CS6260.

Parallel Computing on Graphics Cards Keith Kelley, CS6260

Flynn’s Taxonomy

SIMD vs MISD  SIMD: One instruction runs on an array of data (at one time)  MISD: Different instructions run against a single piece of data

SIMD: Vector Processing  Vector Processor aka Array Processor  Defined by an instruction set including operations on vectors  Vector Computer (1) Processor Array: a vector computer built from identical, synchronized processors that perform the same operation on different data  Vector Computer (2) Pipelined Vector Processor: streams vectors from memory to the CPU

SIMD: Stream Processing  The data is the “stream”  Operations are applied to each element in the stream  Uses many many stream processors  100+ per GPU

SIMD: Parallel Stream Pseudocode for(int i=0;i<100 * 4;i++) result[i]=source0[i]+source1[i]; result[i]=source0[i]+source1[i];VS. streamElements 100 streamElementFormat 4 numbers ElementKernel “@arg0+@arg1” result = kernel(source0, source1)

SIMD: GPGPU  General-Purpose computing on Graphics Processing Units

Advantages of GPGPU  Space (Rackspace U per ? TFlops)  Power Consumption (GFlops per Watt)  Cost ($ per TFlop)  Speed (Speedup)  Availability (hardware not uptime)

Suitable Problems  Financial analysis  Weather research  Molecular modeling  Computational Fluid Dynamics  Life Sciences  Signal Processing  Imaging  Intrusion Detection via Packet Inspection  Using a state machine – anything?

General Purpose GPU Brands  Nvidia Tesla  ATI Firestream  Intel Larrabee (sort of, and not yet)

Intel Larrabee  A many core CPU for graphics processing (and general parallel computing)  Based on x86 instructional pipelines  Scalar (mostly), not vector (mostly)  Has some “vector processing units”  Not out yet

Nvidia Tesla  Graphics card without a graphics port  Lots of Stream Processors on one card  Tesla 10 contains 1.4 billion transistors, 240 cores (30 thread processor arrays each containing 8 thread processors)  First generation card basically an 8800GTX without a graphics port  1 C1060 board has 1 10-series GPU with GB memory  A machine with 4 boards reaches 4 TFlops

Tesla: Thread Processor  Another term for Stream Processor  NVIDIA term  TP (Thread Processor)  TPA (Thread Processor Array)  Each TP is a core, but smaller (and different) than a CPU core

Tesla 10: Performance  Four cards in one chassis maximum  One chassis maximum  8 TPs*30 TPAs=240 stream processors and 4GB RAM per card  960 stream processors maximum  1 TFlop per card  4 TFlops per deskside supercomputer

Tesla: Availability  Cost Per Card $1500 range  Cost Per Computer (4 cards) $6k-10k  Compatibility with standard PC slots  Around 150 watts per TFlop (per card)  Nvidia has shipped over 100 million CUDA capable cards  Vendor computers start at ~$8000 but you can also build your own  In the same amount of rack space one reviewer built a XEON based machine that costs 4 times as much but performed 16 times slower

Tesla: Language  CUDA – Compute Unified Device Architecture  OpenCL – Second quarter 2009.  Either uses threads, which many consider difficult

Tesla: CUDA Sample Code cudaArray* cu_array; texture tex; // Allocate array cudaMalloc(&cu_array, cudaCreateChannelDesc (), width, height); // Copy image data to array cudaMemcpy(cu_array, image, width*height, cudaMemcpyHostToDevice); // Bind the array to the texture cudaBindTexture(tex, cu_array); // Run kernel dim3 blockDim(16, 16, 1); dim3 gridDim(width / blockDim.x, height / blockDim.y, 1); kernel >>(d_odata, width, height); cudaUnbindTexture(tex); __global__ void kernel(float* odata, int height, int width) { unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; float c = texfetch(tex, x, y); float c = texfetch(tex, x, y); odata[y*width+x] = c; odata[y*width+x] = c;}

Tesla: Future Plans  GPU Clusters  Fortran  C++  OpenCL support (very soon)

AMD/ATI Firestream  Firestream 9250 (2 nd generation) stream processor occupies 1 PCIe 2.0 x16 slot  Consumes less than 150 watts  Up to eight gigaflops per watt  9170 is the 1 st Gen GPU at 500 GFlops, 2GB RAM  9250 is 1Tflop and 9270 is 1.2 Tflops  9250 has 1GB memory, 9270 2GB  9270 retails at $1499  8 cards in one 4U chassis reaches 9.6 TFlops

Firestream: OpenCL  Open Standard  Started by Apple  Maintained by Khronos Group (like OpenGL, OpenAL)  Adopted by AMD, Nvidia soon, others.  Also backed by Intel  Developed primarily by teams at AMD, Intel, and Nvidia

Firestream: OpenCL Example // create a compute context with GPU device context = clCreateContextFromType(CL_DEVICE_TYPE_GPU); // create a work-queue queue = clCreateWorkQueue(context, NULL, NULL, 0); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL); // create the compute program program = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL); // build the compute program executable clBuildProgramExecutable(program, false, NULL, NULL); // create the compute kernel kernel = clCreateKernel(program, "fft1D_1024"); // create N-D range object with work-item dimensions global_work_size[0] = n; local_work_size[0] = 64; range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size); // set the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL); clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL); // execute kernel clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);

Different speed measurements  Both measurements are floating point  Single point (SP) speeds  Double point (DP) speeds  SP speeds tend to be around 2.5 times faster than DP

Other Supercomputer Speeds  The Fastest supercomputer in last year’s top 500 list used Opterons and is at Los Alamos National Laboratories and measures in at about 1105 TFlops  Tesla-based heterogenous cluster in top 500 supercomputers. 170 Tesla S1070 1U systems. 170 TFlops of theoretical peak performance  Last announced fastest vector supercomputer was an NEC SX-9 at 839 TFlops (June 2008)  Last year’s Top 500 list contained only one supercomputer using vector processors  The fastest vector supercomputer in 2006 was 144 TFlops

Other SIMD Computers: Languages  SVL  PVM  MPI  Paris  CMMD  NESL  CVL  Many others

GPU Cluster  A cluster with a GPU on each node  Uses MPI or similar clustering API

My New “Computer”  NVIDIA GeForce 8800GT w/512MB  $100 on ebay  110W  504 GFlops  754 million transistors  Core clock speed: 600 MHz  7*16=112 stream processors  Stream processor clock speed: 1.5 GHz  Similar to the 8800GTX (Tesla 8)

Question Question: What are two advantages of General Purpose Computing on Graphics Processing Units? Answer: power consumption, space, cost, speed (in some cases), availability of hardware

Parallel Computing on Graphics Cards Keith Kelley, CS6260.

Similar presentations

Presentation on theme: "Parallel Computing on Graphics Cards Keith Kelley, CS6260."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Computing on Graphics Cards Keith Kelley, CS6260.

Similar presentations

Presentation on theme: "Parallel Computing on Graphics Cards Keith Kelley, CS6260."— Presentation transcript:

Similar presentations

About project

Feedback