Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

QCAdesigner – CUDA HPPS project

GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Sunpyo Hong, Hyesoon Kim

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Computer Engg, IIT(BHU)

Gwangsun Kim, Jiyun Jeong, John Kim

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

6- General Purpose GPU Programming

Presentation transcript:

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College Station, TX ‡ Nascentric, Inc. Austin, TX

Outline Introduction CUDA programming model Approach Experiments Conclusions

Introduction SPICE is the de facto industry standard for VLSI circuit simulations Significant motivation for accelerating SPICE simulations without losing accuracy Increasing complexity and size of VLSI circuits Increasing impact of process variations on the electrical behavior of circuits Require Monte Carlo based simulations We accelerate the computationally expensive portion of SPICE – transistor model evaluation – on Graphics Processing Units (GPUs) Our approach is integrated into a commercial SPICE accelerator tool OmegaSIM (already x faster than traditional SPICE implementations) With our approach, OmegaSIM achieves a further speedup of 2.36X (3.07X) on average (max)

Introduction GPU – a commodity stream processor Highly parallel Very fast Single Instruction Multiple Data (SIMD) operation GPUs, owing to their massively parallel architecture, have been used to accelerate several scientific computations Image/stream processing Data compression Numerical algorithms LU decomposition, FFT etc For our implementation we used NVIDIA GeForce 8800 GTS (128 processors, 16 multiprocessors) Compute Unified Device Architecture (CUDA) For programming and interfacing with the GPU

CUDA Programming Model The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Host Device Memory Kernel Threads (instances of the kernel) PCIe (CPU) (GPU)

Thread Batching: Grids and Blocks A kernel is executed as a grid of thread blocks (aka blocks) All threads within a block share a portion of data memory A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free common memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Source : “NVIDIA CUDA Programming Guide” version 1.1

Device Memory Space Overview Each thread has: R/W per-thread registers (max registers/MP) R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Main means of communicating data between host and device Contents visible to all threads Not cached, coalescing needed Read only per-grid constant memory Cached, visible to all threads Read only per-grid texture memory Cached, visible to all threads The host can R/W global, constant and texture memories (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Source : “NVIDIA CUDA Programming Guide” version 1.1

Approach We first profiled SPICE simulations over several benchmarks 75% of time spent in BSIM3 device model evaluations Billions of calls to device model evaluation routines Every device in the circuit is evaluated for every time step Possibly repeatedly until the Newton Raphson loop for solving non- linear equations converges Asymptotic speedup of 4X considering Amdahl’s law. These calls are parallelizable Since they are independent of each other Each call performs identical computations on different data Conform to the GPU’s SIMD operating paradigm

Approach CDFG-guided manual partitioning of BSIM3 evaluation code Limitation on the available hardware resources Registers (8192/per multiprocessor) Shared Memory (16KB/per multiprocessor) Bandwidth to global memory (max. sustainable is ~80 GB/s) If entire BSIM3 model is implemented as a single kernel Number of threads that can be issued in parallel are not enough To hide global memory access latency If BSIM3 code is partitioned into many (small) kernels Requires large amounts of data transfer across kernels Done using global memory (not cached) Negatively impacts performance So, in our approach, we: Create CDFG of the BSIM3 equations Use maximally disconnected components of this graph as different kernels, considering the above hardware limitations

Approach Vectorizing ‘if-else’ statements BSIM3 model evaluation code has nested if-else statements For a SIMD computation - they are restructured using masks CUDA compiler has inbuilt ability to restructure these statements if( A < B ) x = v1 + v2; else x = v1 * v2; = = = < + ) ) ( ( x mask A B 325 v1 v2

Approach Take GPU memory constraints into account Global Memory Used to store intermediate data – which is generated by one kernel and needed by another Instead of transferring this data to host Texture Memory Used for storing ‘runtime parameters’ Device parameters which remain unchanged throughout the simulation Advantages It is cached, unlike global memory No coalescing requirements, unlike global memory No bank conflicts, such as possible in shared memory CUDA’s efficient built in texture fetching routines are used Small texture memory loading overhead is easily amortized

Experiments Our device model evaluation is implemented and integrated into a commercial SPICE accelerator tool – OmegaSIM Modified version of OmegaSIM referred to as AuSIM Hardware used: CPU: Intel Core 2 Quad, 2.4 GHz, 4GB RAM GPU: NVIDIA GeForce 8800 GTS, 128 Processors, 675 MHz, 512 MB RAM Comparing BSIM3 model evaluation alone # Eval.GPU runtimes (ms)CPU runtimes (ms)Speedup Proc.Tran.Tot. 1M X 2M X

Ckt. Name# Trans. Total # Evals. OmegaSIM (s)AuSIM (s)Speedup CPU-aloneGPU+CPU Industrial_ X X Industrial_ X X Industrial_ X X Buf_ X X Buf_ X X Buf_ X X ClockTree_ X X ClockTree_ X X Avg.2.36X Experiments - Complete SPICE Sim. With increase in number of transistors, speedup obtained is higher More device evaluation calls made in parallel, latencies are better hidden High accuracy with single precision floating point implementation Over 1M device evals. avg. (max.) error of 2.88 X (9.0 X ) Amp. Newer devices with double precision capability already in market

Conclusions Significant interest in accelerating SPICE 75% of the SPICE runtime spent in BSIM3 model evaluation – allows asymptotic speedup of 4X Our approach of accelerating model evaluation using GPUs has been implemented and integrated with a commercial fast SPICE tool Obtained speedup of 2.36 X on average. BSIM3 model evaluation can be sped up by 30-40X over 1M-2M calls With a more complicated model like BSIM4 Model evaluation would possibly take a yet larger fraction of SPICE runtime Our approach would likely provide a higher speedup With increase in number of transistors, a higher speedup is obtained