The introduction of GPGPU and some implementations on model checking Zhimin Wu Nanyang Technological University, Singapore.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Computer System Architectures Computer System Software
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Lecture 2: Intro to the simd lifestyle and GPU internals
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Multithreaded Programming
Operating System Introduction.
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

The introduction of GPGPU and some implementations on model checking Zhimin Wu Nanyang Technological University, Singapore

Outline What is GPGPU The Architecture and Features of latest NVIDIA Kepler GK110 The reference to current researches on the implementation of GPGPU in model checking Brief introduction of my current work

General-purpose computing on graphics processing units the utilization of a graphic processing units (GPU), which typically handles computation only for computer graphic, to perform computation in applications traditionally handled by the central processing unit (CPU) CPU: ILP and TLP GPU: DLP High Parallelism, High Memory Bandwidth, High computational Power Compute Framework: CUDA

Kepler GK110 Chip Designed for performance and power efficiency 7.1 billion transistors Over 1 TFlop of double precision throughput 3x performance per watt of Fermi New features in Kepler GK110: Dynamic Parallelism Hyper-Q with GK110 Grid Management Unit (GMU) NVIDIA GPUDirect™ RDMA (United State Space)

Kelper GK110 Full chip block diagram

Kepler GK110 supports the new CUDA Compute Capability 3.5 GTX 470/480s have GT100s C2050s on grid06 and grid07 are compute cap 2.0

An SMX 192 single‐precision CUDA cores, 64 double‐precision units, 32 special function units (SFU), and 32 load/store units (LD/ST). Full Kepler GK110 has 15 SMXs Some products may have 13 or 14 SMXs

Quad Warp Scheduler The SMX schedules threads in groups of 32 parallel threads called warps. Each SMX features four warp schedulers and eight instruction dispatch units, allowing four warps to be issued and executed concurrently. (128 threads) Kepler GK110 allows double precision instructions to be paired with other instructions.

One Warp Scheduler Unit

Each thread can access up to 255 registers (x4 of Fermi) New Shuffle instruction which allows threads within a warp to share data without passing data through shared memory: Atomic operations: Improved by 9x to one operation per clock – fast enough to use frequently with kernel inner loops

New: 48 KB Read-only memory cache Compiler/programmer can use to advantage Shared memory/L1 cache split: Each SMX has 64 KB on‐chip memory, that can be configured as: 48 KB of Shared memory with 16 KB of L1 cache, or 16 KB of shared memory with 48 KB of L1 cache or (new) a 32KB / 32KB split between shared memory and L1 cache. Faster than L2

Dynamic Parallelism Fermi could only launch one kernel at a time on a single device. Kernel had to complete before calling for another GPU task. “In Kepler GK110 any kernel can launch another kernel, and can create the necessary streams, events and manage the dependencies needed to process additional work without the need for host CPU interaction.” “.. makes it easier for developers to create and optimize recursive and data‐dependent execution patterns, and allows more of a program to be run directly on GPU.”

“Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).” Control must be transferred back to CPU before a new kernel can execute Only return to CPU when all GPU operations are completed. Why is this faster?

Image attribution Charles Reid “With Dynamic Parallelism, the grid resolution can be determined dynamically at runtime in a data dependent manner. Starting with a coarse grid, the simulation can “zoom in” on areas of interest while avoiding unnecessary calculation in areas with little change …. ”

Hyper ‐ Q “The Fermi architecture supported 16‐way concurrency of kernel launches from separate streams, but ultimately the streams were all multiplexed into the same hardware work queue.” “Kepler GK110 … Hyper‐Q increases the total number of connections (work queues) … by allowing 32 simultaneous, hardware‐managed connections..” “… allows connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process. Applications that previously encountered false serialization across tasks, thereby limiting GPU utilization, can see up to a 32x performance increase without changing any existing code.”

Hyper ‐ Q “Each CUDA stream is managed within its own hardware work queue … “

“The redesigned Kepler HOST to GPU workflow shows the new Grid Management Unit, which allows it to manage the actively dispatching grids, pause dispatch and hold pending and suspended grids.”

“GPUDirect RDMA allows direct access to GPU memory from 3rd‐party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.”

The implementation of GPGPU in Model Checking 1.LTL Model Checking:  J. Barnat, P. Bauch, L. Brim, and M. Ceska. Computing strongly connected components in parallel on cuda. In IPDPS 2011, pages 544–555. IEEE,  J. Barnat, P. Bauch, L. Brim, andM. Ceska. Designing fast LTL model checking algorithms for many-core GPUs. Journal of Parallel and Distributed Computing, 72(9):1083–1097,  Edelkamp, Stefan and Sulewski, Damian. Efficient Explicit-state Model Checking on General Purpose Graphics Processors. In SPIN, pages 106–123. Springer, Probilistic Model :  D. Bosnacki, S. Edelkamp, D. Sulewski, and A. Wijs. Parallel probabilistic model checking on general purpose graphics processors. International Journal on Software Tools for Technology Transfer, 13(1):21–35,  A. J. Wijs and D. Bosnacki. Improving GPU sparse matrix-vector multiplication for probabilistic model checking. In Model Checking Software, pages 98–116. Springer, State Space Exploration  S. Edelkamp and D. Sulewski. Parallel state space search on the gpu. In Proceedings of the International Symposium on Combinatorial Search,  Wijs, Anton and Bosnacki, Dragan. GPUexplore: Many-Core On-the-Fly State Space Exploration Using GPUs. In TACAS, pages 233–247. Springer, 2014

My current work 1. Dynamic Counterexample Generation without CPU involvement State Space: Known Abstract: Strongly Connected Component (SCC) based searching is one of the popular LTL model checking algorithms. When the SCCs are huge, the counterexample generation process can be time consuming, especially when dealing with fairness assumptions. In this work, we propose a GPU accelerated counterexample generation algorithm, which improves the performance by paralleling the Breadth First Search (BFS) used in the counterexample generation. BFS work is irregular, which means it is hard to allocate resources and may suffer from im-balance load. We make use of the features of latest CUDA Compute Architecture-NVIDIA Kepler GK110 to achieve the dynamic parallelism and memory hierarchy handle the irregular searching pattern in BFS. We build dynamic queue management, task scheduler and path recording such that the counterexample generation process can be completely taken by GPU without involving CPU. We have implemented the proposed approach in PAT model checker. Our experiments show that our approach is effective and scalable. Implementation of GPU in BFS-related Model Checking Problems:

2. On-the-fly Deadlock Verification: State Space: Unknown Key Ideas: Compact Encoding, state space. Collaborative Synchronization based on SIMD. Hierarchical Hash. Structure for successor generation

Thank you