CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou.

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Intermediate GPGPU Programming in CUDA

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.

CS 104 Introduction to Computer Science and Graphics Problems

CSCE 121, Sec 200, 507, 508 Fall 2010 Prof. Jennifer L. Welch.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Martin Kruliš by Martin Kruliš (v1.0)1.

Extracted directly from:

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Overloading Binary Operators Two ways to overload –As a member function of a class –As a friend function As member functions –General syntax Data Structures.

CIS 565 Fall 2011 Qing Sun

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Optimised C/C++. Overview of DS General code Functions Mathematics.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

ICOM 4035 – Data Structures Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 – August 23, 2001.

CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.

Synchronization These notes introduce:

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

CUDA programming Performance considerations (CUDA best practices)

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

CUDA C/C++ Basics Part 2 - Blocks and Threads

Problem Solving: Brute Force Approaches

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

Heterogeneous Programming

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Pipeline parallelism and Multi–GPU Programming

Problem Solving: Brute Force Approaches

Presented by: Isaac Martin

CS 179 Lecture 14.

Lecture 5: GPU Compute Architecture for the last time

Lecture 5: Synchronization and ILP

Synchronization These notes introduce:

6- General Purpose GPU Programming

Presentation transcript:

CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou

Scrabble Word Searching!

Scrabble Word Searching – Problem Overview Given the current board & tiles at hand, find the highest scoring word(s) at the highest scoring position(s). Requires searching 173k-268k words, depending on the Lexicon. Fundamentally a brute force problem in nature. Why? Finding adjacent words requires only one linear search. Finding intersecting words greatly increases the search complexity. ~26x more possibilities must be considered, per intersecting word! Players may have also blank tiles; effectively a “wildcard.” Produces ~26x more valid results, per blank tile! (Continued on next slide…)

Scrabble Word Searching – Problem Overview (cont’d) We don’t know what specific words we’re looking for. We only know: Letters a word may not have. Letters a word may have. Letters a word must have. Letters may be in any order. Result: linear and brute force in nature. Small performance tricks, but still fundamentally true. No heuristics. (Chess)

Why this problem? Abundant parallelizing opportunities within. Independent results at the lowest levels. Difficult to parallelize. Must be aggregated into higher level results. Requires further synchronization. Discard duplicate results. 1.Algorithmic nuances. 2.Result pollution. Linear search problems are abundant in the real world. Usually highly parallelizable. The once infeasible may now be feasible with multi-processing.

Overview of Scrabble word searching code for (three different searching algorithm sections) Adjacent words, intersecting words, and multiply intersecting words. for (each column) for (each row) find(all potential words): May be done by the CPU, or the GPU  Core of the algorithm. … or both simultaneously! (Coming slide.) Critical section: Aggregate potential word list.  Important milestone: Possibilities greatly reduced.  (Continued on next slide…)

Overview of Scrabble word searching code (cont’d) for (each potential word found) We now need to determine if it is not just valid, but insertable. for (each position of the word) Try inserting it. Might be insertable in multiple positions. Performed by the CPU. for (each insertable word) Cross-check “inadvertent” perpendicular words.  Important milestone: Word has passed all tests.  Perform word value computations. Critical section: Add the word to the aggregate list of valid found words, but discard previously mentioned duplicates.

Focus of our research Hypotheses: Parallelizing at the high level Simple (OpenMP) Near linear performance improvements. Parallelizing at the low level Complex (CUDA) Potentially significantly greater performance improvements. Question: Why not do both?

First step – Restructuring of code Minimal restructuring at the high level. Code designed to handle one run per execution. When an invalid word is played, the word effectively becomes part of the lexicon. Why? Cross-checking of potential words! Must be dynamically added to the lexicon! Future queries contaminated by past queries.

CPU Parallelization i7-920 (4 cores / GHz)

Migration from OpenMP to CUDA Three core searching functions that find words containing … Core 1: input tiles[ 1%] Core 2: input tiles and a letter on the board[17%] Core 3: input tiles and multiple letters on the board[82%] C++ strings cannot easily be accessed by GPU typedef struct { char word[32]; } word_simple; Due to SIMT architecture, branching is inefficient in GPU kernel Reduce branching in core searching functions count the numbers of a to z (and _) in tiles in advance Sort the words in a lexicon with ascending number of letters all threads in a warp may have similar loads Locking is not easily handled in CUDA Use a bit vector (0 – rejected, 1 – accepted) no contention, no lock is needed

CUDA Parallelization --- First Attempt Initialization (once per game): Allocate global and constant memory on GPU Convert C++ strings to static storage Copy the converted words to GPU memory Searching (4 stages): Preparation Copy the tile counters (and wildcard string) into GPU constant memory Kernel launch Convert CPU searching functions into GPU functions __global__ void findWords(…) __device__ bool hasTiles_GPU (…) __device__ bool wildcardMatch_GPU(…) Replace OpenMP loop with GPU kernel function call findWords >> (…); Copy the bit vector back cudaMemcpy(…, cudaMemcpyDeviceToHost); Generate the resultant vector

CUDA Parallelization --- Second Thoughts Pinned memory allocation Accelerate memory transfer between CPU and GPU Synchronous copy can be more efficient Asynchronous copy is supported Heterogeneous GPU/CPU parallelization Assign major part of the data set to GPU Asynchronous GPU kernel launch Assign minor part of the data set to CPU Original OpenMP parallelization Reduced cost in GPU kernel, memory transfer, and result vector generation

Hide Latency Cost in four stages (Core 2, repeated times): Preparation ~ 0.1sMemory copy back ~ 0.9s Kernel launch ~ 2.6sPost generation ~ 0.7s Latency of cudaMemcpy is comparable to the kernel Hide Latency by asynchronous parallelization between GPU and CPU: findwords >>(…);// returns immediately cudaMemcpyAsync(…); // returns immediately … CPU operations (OpenMP loop on the minor part of the data set) … cudaThreadSynchronize(); After asynchronous parallelization: Preparation ~ 0.1sPost generation ~ 0.5s Kernel launch + Memory copy back ~ 2.4s 30%

Minimize Setup cost Core 3 preparation cost (10000 times) ~ 0.2s Transfer tile counters Transfer wildcard string “cudaMemcpyToSymbol” is the major cost of preparation Group constant variable together into a struct: typedef struct { int count_GPU[7];// = 28 char char wildCards_GPU[32]; } grouped; Declare a single __constant__ grouped variable One “cudaMemcpyToSymbol” per preparation No overhead in combining variables much faster Using grouped variable, preparation cost ~ 0.1s 50%

Minimize CUDA kernel cost Kernel + Memcpy cost (10000 times): core 2 ~ 2.4score 3 ~ 5.9s Word finding: --ptxas-options=-v shows GPU register and memory utilization Thread tile counters are in local GPU memory (off-chip and not cached) Use on-chip shared memory (__shared__) for fast access Hardcode assignment of counters as 7 integers instead of 28 chars Wildcard matching: Avoid nested loops and multiple conditional statements Use a much simplified algorithm Specially designed for *[pattern]* After optimization: core 2 ~ 1.5score 3 ~ 3.7s 40%

Minimize Post-generation Cost Post-generation cost (10000 times): core 2 ~ 0.5score 3 ~ 0.58s For the bit vector returned from GPU, use multiple CPU threads to generate the result vector Locking? Local vector + critical section? 30% Depends on the amount of contention For low contention (core 3), use locking For high contention, store the results of each thread in a local vector and gather them using critical section After proper parallelization: core 2 ~ 0.36score 3 ~ 0.38s

CPU vs GPU on Pup cluster E6850 (2 3GHz) / NVidia GTX 260 ( GHz)

GPU Parallelization on Pup cluster E6850 (2 3GHz) / NVidia GTX 260 ( GHz)

Conclusion The characteristics of scrabble word searching are hard for efficient GPU parallelization: Only integer (or char) operations, no floating-point operations A lot of branching and little coalesced memory access High communication-to-computation ratio Design of CUDA parallelization: Asynchronous GPU/CPU parallelization may reduce memory copy latency A large transfer is more efficient than many small transfers On-chip shared memory is much faster than off-chip local memory Locking or local variable depends on the amount of contention Future Work: Further hide latency using multiple streams on a single GPU Multi-GPU parallelization GPU parallelization on other levels