Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Parallel Computation Patterns (Reduction)
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
General Purpose Graphics Processing Units (GPGPUs)
ECE 498AL Lecture 10: Control Flow
ECE 498AL Spring 2010 Lecture 10: Control Flow
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

CKY Parsing Find the most likely parse tree for a given sentence Parse trees can be used in many NLP applications –Machine translation –Question answering –Information extraction Dynamic Programming in O(|G|n 3 ) –n is number of words in a sentence –|G| is size of the grammar I love you. love you. you.. you love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2)

Why Faster Parsers? O(|G|n 3 ) –n is on average about 20 –|G| is much more larger grammars with high accuracy: >1,000,000 rules We need faster parsers for real-time NL processing with high accuracy!

GPUs Manycore era –Due to “Power Wall”, it is unlikely that CPUs with faster clock frequency appear –Instead, number of processing cores will continue to increase GPU (Graphics Processing Unit) –Currently available manycore architecture: –480 processing cores in GTX480

Overall Structure Hierarchical parallel platform –Several Streaming Processors (SP) grouped into a Streaming Mulitprocessor (SM) …

Memory Types Different types of memory –Global, shared, texture, constant memory Can you guys please add a bit more here?

CUDA CUDA (Compute Unified Device Architecture) –Parallel programming framework for GPUs Programming model, language, compilers, APIs –Allows general purpose computing on GPUs

Thread and Thread Block in CUDA Thread blocks (Blocks) –Independent execution units Threads –Maximum threads per block: 512 or 1024 Warps –Group of threads executed together: 32 Kernel –Configured as #blocks, #threads

9 Fork-join programming model, host+device program –Serial or modestly parallel parts in host C code –Highly parallel parts in device kernel C code Serial code (host)‏... Parallel code in kernel (device)‏ KernelA >>(args); Serial code (host)‏ Parallel code = kernel (device)‏ KernelB >>(args); © David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign Programming Model in CUDA

SIMT model in CUDA __global__ void Kenrel1(..) { if( threadIdx.x < a)... else... } SIMT (Single Instruction Multiple Thread) –Not SIMD (Single Instruction Multiple Data) because… Threads can actually execute different locations of the program –Not SPMD (Single Program Multiple Data) because… Threads with different execution path cannot execute in parallel __global__ void Kenrel2(..) { int tx = threadIdx.x; for(i=0; i<LoopCount[tx]; i++)... }

Parallelisms in CKY Parsing Dynamic Programming –Iterations must be executed in serial But, in each iteration –About a million rules (with thousands of symbols) need to be evaluated for each span I love you. love you. you.. yo u love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2) Rules Spans Unary Rule Relaxation Binary Rule Relaxation # rules, # span

Thread-Mapping Map a symbol to a thread? –Not good for load balancing –Remember SIMT! Map a rule to a thread? –850K rules  good concurrency –Thread blocks are just groups of the same # of threads …

Block-Mapping Map each symbol to a thread block –and map the rules to threads in the thread block that corresponds to the parent symbol –(+) All the threads in the same thread block has the same parent –(-) What if #rules of a symbol exceeds the #thread limit? …

Block-Mapping … … Symbol i Virtual Symbol j Virtual Symbol j+1

Span-Mapping It is easy to further parallelize another level of parallelism orthogonally –Simply add another dimension in the grid of thread blocks …… blockIndex.y=0 blockIndex.y=1 blockIndex.y= n-len+1 blockIndex.x=sym0 blockIndex.x=sym1 … … span index

Synchronization Massive number of threads with the same parent symbol need to update its computed score correctly such that the reduced final value is the maximum value

Atomic Operations atomicMax(&max,value); –CUDA API –Much efficient for shared memory than global memory shared memory global memory

Parallel Reduction After log 2 N steps (N is #threads in a block), the reduced value is obtained –All the threads work for the same symbol –An option only for block-mapping __syncthreads()

Reducing Global Memory Using Texture Memory Grammar information –parent[], lchild[], rchild[] –Read-only throughout the whole program Scores updated in the previous iterations of dynamic programming –scores[][][] –Read-only Locate such read-only data in texture memory! But, in case of scores[][][], we need to locate newly updated scores in the current iteration to the texture memory –Locating array in texture memory = cudaBindTexture( ) –The execution time of this API is proportional to the array size –(-) scores[start][stop][S] is huge array… S j  S r S s scores[w p ][w d ][S r ], scores[w d+1 ][w q ][S s ]

Reducing Global Memory Using Texture Memory (Cont’d) Change the layout –scores[start][stop][S]  scores[len][start][S] –We only need to update the part of scores[][][] when len=current iteration I love you. love you. you.. you love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2) len=1 len=2 len=3 len=4

Experimental Results GTX285 –No cache memory supported –Low memory bandwidth speedup thread- atom 6.4 block- atom 8.1 block -pr 10.1 block - atom- SS 11.1 block -pr- SS 14.2 block- atom- SS- tex 11.9 block- pr-SS- tex 17.4

Experimental Results GTX480 –Cache memory supported –Higher memory bandwidth speedup thread- atom 13.2 block- atom block -pr 25.8 block - atom- SS 15.2 block -pr- SS 23.4 block- atom- SS- tex 13.9 block -pr- SS- tex 22.2

Conclusions We explored design space for parallelizing CKY parsing on a GPU –Different mappings, synchronization methods, –Utilizing different types of memories We compared each version two GPUs –26X on GTX480, 17X on GTX285 We expect scalable performance gains as the number of processing cores increases in future GPUs