Download presentation
Presentation is loading. Please wait.
Published byKevin Kelley Modified over 9 years ago
1
CPU & GPU Parallelization of Scrabble Word Searching Jonathan Wheeler Yifan Zhou
2
Scrabble Word Searching!
3
Scrabble Word Searching – Problem Overview Given the current board & tiles at hand, find the highest scoring word(s) at the highest scoring position(s). Requires searching 173k-268k words, depending on the Lexicon. Fundamentally a brute force problem in nature. Why? Finding adjacent words requires only one linear search. Finding intersecting words greatly increases the search complexity. ~26x more possibilities must be considered, per intersecting word! Players may have also blank tiles; effectively a “wildcard.” Produces ~26x more valid results, per blank tile! (Continued on next slide…)
4
Scrabble Word Searching – Problem Overview (cont’d) We don’t know what specific words we’re looking for. We only know: Letters a word may not have. Letters a word may have. Letters a word must have. Letters may be in any order. Result: linear and brute force in nature. Small performance tricks, but still fundamentally true. No heuristics. (Chess)
5
Why this problem? Abundant parallelizing opportunities within. Independent results at the lowest levels. Difficult to parallelize. Must be aggregated into higher level results. Requires further synchronization. Discard duplicate results. 1.Algorithmic nuances. 2.Result pollution. Linear search problems are abundant in the real world. Usually highly parallelizable. The once infeasible may now be feasible with multi-processing.
6
Overview of Scrabble word searching code for (three different searching algorithm sections) Adjacent words, intersecting words, and multiply intersecting words. for (each column) for (each row) find(all potential words): May be done by the CPU, or the GPU Core of the algorithm. … or both simultaneously! (Coming slide.) Critical section: Aggregate potential word list. Important milestone: Possibilities greatly reduced. (Continued on next slide…)
7
Overview of Scrabble word searching code (cont’d) for (each potential word found) We now need to determine if it is not just valid, but insertable. for (each position of the word) Try inserting it. Might be insertable in multiple positions. Performed by the CPU. for (each insertable word) Cross-check “inadvertent” perpendicular words. Important milestone: Word has passed all tests. Perform word value computations. Critical section: Add the word to the aggregate list of valid found words, but discard previously mentioned duplicates.
8
Focus of our research Hypotheses: Parallelizing at the high level Simple (OpenMP) Near linear performance improvements. Parallelizing at the low level Complex (CUDA) Potentially significantly greater performance improvements. Question: Why not do both?
9
First step – Restructuring of code Minimal restructuring at the high level. Code designed to handle one run per execution. When an invalid word is played, the word effectively becomes part of the lexicon. Why? Cross-checking of potential words! Must be dynamically added to the lexicon! Future queries contaminated by past queries.
10
CPU Parallelization i7-920 (4 cores / 8 threads @ 2.66GHz)
11
Migration from OpenMP to CUDA Three core searching functions that find words containing … Core 1: input tiles[ 1%] Core 2: input tiles and a letter on the board[17%] Core 3: input tiles and multiple letters on the board[82%] C++ strings cannot easily be accessed by GPU typedef struct { char word[32]; } word_simple; Due to SIMT architecture, branching is inefficient in GPU kernel Reduce branching in core searching functions count the numbers of a to z (and _) in tiles in advance Sort the words in a lexicon with ascending number of letters all threads in a warp may have similar loads Locking is not easily handled in CUDA Use a bit vector (0 – rejected, 1 – accepted) no contention, no lock is needed
12
CUDA Parallelization --- First Attempt Initialization (once per game): Allocate global and constant memory on GPU Convert C++ strings to static storage Copy the converted words to GPU memory Searching (4 stages): Preparation Copy the tile counters (and wildcard string) into GPU constant memory Kernel launch Convert CPU searching functions into GPU functions __global__ void findWords(…) __device__ bool hasTiles_GPU (…) __device__ bool wildcardMatch_GPU(…) Replace OpenMP loop with GPU kernel function call findWords >> (…); Copy the bit vector back cudaMemcpy(…, cudaMemcpyDeviceToHost); Generate the resultant vector
13
CUDA Parallelization --- Second Thoughts Pinned memory allocation Accelerate memory transfer between CPU and GPU Synchronous copy can be more efficient Asynchronous copy is supported Heterogeneous GPU/CPU parallelization Assign major part of the data set to GPU Asynchronous GPU kernel launch Assign minor part of the data set to CPU Original OpenMP parallelization Reduced cost in GPU kernel, memory transfer, and result vector generation
14
Hide Latency Cost in four stages (Core 2, repeated 10000 times): Preparation ~ 0.1sMemory copy back ~ 0.9s Kernel launch ~ 2.6sPost generation ~ 0.7s Latency of cudaMemcpy is comparable to the kernel Hide Latency by asynchronous parallelization between GPU and CPU: findwords >>(…);// returns immediately cudaMemcpyAsync(…); // returns immediately … CPU operations (OpenMP loop on the minor part of the data set) … cudaThreadSynchronize(); After asynchronous parallelization: Preparation ~ 0.1sPost generation ~ 0.5s Kernel launch + Memory copy back ~ 2.4s 30%
15
Minimize Setup cost Core 3 preparation cost (10000 times) ~ 0.2s Transfer tile counters Transfer wildcard string “cudaMemcpyToSymbol” is the major cost of preparation Group constant variable together into a struct: typedef struct { int count_GPU[7];// = 28 char char wildCards_GPU[32]; } grouped; Declare a single __constant__ grouped variable One “cudaMemcpyToSymbol” per preparation No overhead in combining variables much faster Using grouped variable, preparation cost ~ 0.1s 50%
16
Minimize CUDA kernel cost Kernel + Memcpy cost (10000 times): core 2 ~ 2.4score 3 ~ 5.9s Word finding: --ptxas-options=-v shows GPU register and memory utilization Thread tile counters are in local GPU memory (off-chip and not cached) Use on-chip shared memory (__shared__) for fast access Hardcode assignment of counters as 7 integers instead of 28 chars Wildcard matching: Avoid nested loops and multiple conditional statements Use a much simplified algorithm Specially designed for *[pattern]* After optimization: core 2 ~ 1.5score 3 ~ 3.7s 40%
17
Minimize Post-generation Cost Post-generation cost (10000 times): core 2 ~ 0.5score 3 ~ 0.58s For the bit vector returned from GPU, use multiple CPU threads to generate the result vector Locking? Local vector + critical section? 30% Depends on the amount of contention For low contention (core 3), use locking For high contention, store the results of each thread in a local vector and gather them using critical section After proper parallelization: core 2 ~ 0.36score 3 ~ 0.38s
18
CPU vs GPU on Pup cluster E6850 (2 cores @ 3GHz) / NVidia GTX 260 (216 SP @ 1.24GHz)
19
GPU Parallelization on Pup cluster E6850 (2 cores @ 3GHz) / NVidia GTX 260 (216 SP @ 1.24GHz)
20
Conclusion The characteristics of scrabble word searching are hard for efficient GPU parallelization: Only integer (or char) operations, no floating-point operations A lot of branching and little coalesced memory access High communication-to-computation ratio Design of CUDA parallelization: Asynchronous GPU/CPU parallelization may reduce memory copy latency A large transfer is more efficient than many small transfers On-chip shared memory is much faster than off-chip local memory Locking or local variable depends on the amount of contention Future Work: Further hide latency using multiple streams on a single GPU Multi-GPU parallelization GPU parallelization on other levels
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.