Download presentation
Presentation is loading. Please wait.
Published byBlake Bryan Modified over 9 years ago
1
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley) 1
2
Outline Motivation CUDA Programming Model Parallel CKY Parsing on GPUs Experimental Results Conclusions 2
3
Outline Motivation CUDA Programming Model Parallel CKY Parsing on GPUs Experimental Results Conclusions 3
4
Why Faster Parsers? Parsing is the backbone of most NLP applications Machine translation Question answering Information extraction High-accuracy parsing takes time: What if we want to parse the web? 4
5
Great Speedups: GPUs GPUs: manycore Hundreds of processing cores, massive parallelism Allows general-purpose computing Computer vision: Catanzaro, B. et al. 2009. Efficient, high-quality image contour detection. In ICCV ‘09. Speech recognition: Chong, J. et al. 2009. Scalable HMM based inference engine in large vocabulary continuous speech recognition. In ICME’09. We want to bring GPUs to the NLP community 5 130x 10.5x
6
CKY Parsing I love you. love you. you.. you you love love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2) Constituency parsing with a weighted CFG Using Dynamic Programming to iteratively build parse trees with larger spans from smaller spans In O(|G|n 3 ) n: #words in a sentence, 20 on average |G|: grammar constant, proportional to #rules High-accuracy grammars have 1,000,000 rules! More impact on speed than n 6
7
Outline Motivation CUDA Programming Model Computational Model Memory Model Parallel CKY Parsing on GPUs Experimental Results Conclusions 7
8
CUDA Computational Model Two levels of hierarchy Thread blocks Threads Thread blocks (Blocks) Independent execution units Max. threads per block: 512 or 1024 Threads in a block Not independent Work best as if using vectorized units Communicate via “shared memory” 8
9
CUDA Memory Model Global memory Off-chip, slow but large Shared memory On-chip, fast but small Shared among threads in a thread block Texture Memory Fast memory written from CPU Works best with read- only data 9
10
CUDA Programming Principles Mapping computations to blocks and threads Load balancing among threads in a block saves time Efficient usage of different types of memory Reduce global memory accesses 10 block
11
Outline Motivation CUDA Programming Model Parallel CKY Parsing on GPUs Mapping Thread-Based vs. Block-Based Sequential Spans vs. Parallel Spans Atomic Operations vs. Parallel Reduction Reducing Global Memory Accesses Experimental Results Conclusions 11
12
Parallelism in CKY Parsing Bottleneck: binary relaxation Parallelism in spans, symbols and rules 12 I love you. love you. you.. you you love love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2) spans symbols S 1 S 1 S 2 S 1 S 2 S 30 S 1 S 44 S 53 S 100 S 2 S 26 S 100 S 22 S 3 S1S1 S2S2 S 100 … S 2 S 2 S 4 S 2 S 2 S 5 S 2 S 99 S 52 … S 2 S 3 S 33 … rules
13
Mapping A symbol a thread? Load imbalance 13 symbols S 1 S 1 S 2 S 1 S 2 S 30 S 1 S 44 S 53 S 100 S 2 S 26 S 100 S 22 S 3 S1S1 S2S2 S 100 … S 2 S 2 S 4 S 2 S 2 S 5 S 2 S 99 S 52 … S 2 S 3 S 33 …
14
Thread-Based Mapping A rule a thread Flatten out the symbol dimension (+) 850k rules: great parallelism (+) load balanced (-) a block may handle rules with different parent symbols Harder to get maximum scores for symbols 14 rules S 1 S 1 S 2 S 1 S 2 S 30 S 1 S 44 S 53 S 100 S 2 S 26 S 100 S 22 S 3 … S 2 S 2 S 4 S 2 S 2 S 5 S 2 S 99 S 52 … S 2 S 3 S 33 … …
15
Block-Based Mapping A symbol a block A rule a thread (+) All the threads in the same block have the same parent (-) What if #rules of a symbol exceeds the limit of #threads per block? Splitting symbols to virtual symbols 15 symbols S 1 S 1 S 2 S 1 S 2 S 30 S 1 S 44 S 53 S 100 S 2 S 26 S 100 S 22 S 3 … S 2 S 2 S 4 S 2 S 2 S 5 S 2 S 99 S 52 … S 2 S 3 S 33 … … S1S1 S2S2 S 100 …
16
Sequential Spans 16 symbols S 1 S 1 S 2 S 1 S 2 S 30 S 1 S 44 S 53 S 100 S 2 S 26 S 100 S 22 S 3 … S 2 S 2 S 4 S 2 S 2 S 5 S 2 S 99 S 52 … S 2 S 3 S 33 … … S1S1 S2S2 S 100 … spans … … … … … …
17
Parallel Spans 17 symbols S 1 S 1 S 2 S 1 S 2 S 30 S 1 S 44 S 53 S 100 S 2 S 26 S 100 S 22 S 3 … S 2 S 2 S 4 S 2 S 2 S 5 S 2 S 99 S 52 … S 2 S 3 S 33 … … S1S1 S2S2 S 100 … spans … … … … … …
18
Atomic Operations Multiple threads update the scores of the same parent symbol Schedule the updates so that they don’t happen simultaneously to ensure correctness Atomic operations Guarantee a memory location is accessed by one thread at any time Serialize operations if necessary 18 S 1 S 1 S 2 S 1 S 2 S 30 S 1 S 44 S 53 S 2 S 2 S 4 scores[S 1 ] scores[S 2 ]
19
Parallel Reduction Binary tree reduction An efficient O(logN) runtime All the threads in a block must have the same parent symbol An option only for block-based mapping 19
20
Reducing Global Memory Accesses Shared memory: frequently accessed data Scores of parent symbols Texture Memory: read-only data Grammar information such as rule scores Scores of symbols with smaller spans Changing the layout of scores Minimize the overhead of copying data to texture memory 20 span = 1 span = 2 span = 3 span = 4 I love you. love you. you.. you you love love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2)
21
Outline Motivation CUDA Programming Model Parallel CKY Parsing on GPUs Experimental Results Conclusions 21
22
Setup Two GPU architectures NVIDIA GTX285 (Tesla) NVIDIA GTX480 (Fermi) GTX480 better than GTX285 in #cores, support of cache, size of memory Benchmark 1000 sentences of sec. 22 of WSJ in Penn Treebank Speedups Comparing to a serial C implementation of Berkeley Parser 22
23
GTX285 (Tesla) No cache memory supported Lower memory bandwidth speedups 23 serial 1.0 PSpan: Parallel Spans SSpan: Sequential Spans reduce: parallel reduction tex: texture memory thread- atomic- PSpan6.4 block- atomic- PSpan 8.1 block- atomic- SSpan 11.1 block- atomic- SSpan- tex 11.9 block- reduce- PSpan 10.1 block- reduce- SSpan 14.2 block- reduce- SSpan- tex 17.4
24
GTX480 (Fermi) Cache memory supported Higher memory bandwidth speedups 24 PSpan: Parallel Spans SSpan: Sequential Spans reduce: parallel reduction tex: texture memory 1.0 serial thread- atomic- PSpan 13.2 block- atomic- PSpan 14.1 block- atomic- SSpan 15.2 block- atom- SSpan- tex13.9 block- reduce- PSpan 25.8 Block- reduce- SSpan 23.4 block- reduce- SSpan- tex 22.2
25
Conclusions We explored the design space for parallelizing CKY parsing on GPUs Different mappings, synchronization methods Utilizing different types of memories We compared two GPU architectures 26X on GTX480, 17X on GTX285 We expect a scalable performance gain as the number of processing cores increases in the future GPUs 25
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.