Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,

Slides:

Advertisements

Similar presentations

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Performance and Energy Efficiency of GPUs and FPGAs

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Accelerating a Software Radio Astronomy Correlator By Andrew Woods Supervisor: Prof. Inggs & Dr Langman.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

ICAL GPU 架構中所提供分散式運算之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Sunpyo Hong, Hyesoon Kim

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.

NFV Compute Acceleration APIs and Evaluation

CS427 Multicore Architecture and Parallel Computing

Accelerating MapReduce on a Coupled CPU-GPU Architecture

6- General Purpose GPU Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University, College Station

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

Introduction Static timing analysis (STA) is heavily used in VLSI design to estimate circuit delay Impact of process variations on circuit delay is increasing Therefore, statistical STA (SSTA) was proposed It includes the effect of variations while estimating circuit delay Monte Carlo (MC) based SSTA accounts for variations by Generating N delay samples for each gate (random variable) Executing STA for each sample Aggregating results to generate full circuit delay under variations MC based SSTA has several advantages (over Block based SSTA) High accuracy, simplicity and compatibility to fabrication line data Main disadvantage is extremely high runtime cost

Introduction We accelerate MC based SSTA using graphics processing units (GPUs) A GPU is essentially a commodity stream processor Highly parallel Very fast SIMD (Single-Instruction, Multiple Data) operating paradigm GPUs, owing to their massively parallel architecture, have been used to accelerate several scientific computations Image/stream processing Data compression Numerical algorithms LU decomposition, FFT etc

Introduction We implemented our approach on the NVIDIA GeForce 8800 GTX GPU By careful engineering, we maximally harness the GPU’s Raw computational power and Huge memory bandwidth Used Compute Unified Device Architecture (CUDA) framework Open source GPU programming and interfacing tool When using a single 8800 GTX GPU card ~260X speedup in MC based SSTA is obtained Accounts for CPU processing and data transfer times as well The SSTA runtimes are projected for a Quad GPU system NVIDIA SLI technology allows 4 GPU devices on the same board: ~788X speedup is possible

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

GeForce 8800 GTX Technical Specs. 367 GFLOPS peak performance times of current high-end microprocessors 265 GFLOPS sustained for appropriate applications Massively parallel, 128 cores Partitioned into 16 Multiprocessors Massively threaded, sustains 1000s of threads per application 768 MB device memory 1.4 GHz clock frequency CPU at 3.6 GHz 86.4 GB/sec memory bandwidth CPU at 8 GB/sec front side bus Multi-GPU servers available SLI Quad high-end NVIDIA GPUs on a single motherboard Also 8-GPU servers announced recently

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

CUDA Programming Model The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Host Device Memory Kernel Threads (instances of the kernel) PCIe (CPU) (GPU)

CUDA Programming Model Data-parallel portions of an application are executed on the device in parallel on many threads Kernel : code routine executed on GPU Thread : instance of a kernel Differences between GPU and CPU threads GPU threads are lightweight Very little creation overhead GPU needs 1000s of threads to achieve full parallelism Allows memory access latencies to be hidden Multi-core CPUs require fewer threads, but the available parallelism is lower

Thread Batching: Grids and Blocks A kernel is executed as a grid of thread blocks (aka blocks) All threads within a block share a portion of data memory Threads/blocks have 1D/2D/3D IDs A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free common memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Source : “NVIDIA CUDA Programming Guide” version 1.1

Device Memory Space Overview Each thread has: R/W per-thread registers (max registers/MP) R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Main means of communicating data between host and device Contents visible to all threads Coalescing recommended Read only per-grid constant memory Cached, visible to all threads Read only per-grid texture memory Cached, visible to all threads The host can R/W global, constant and texture memories (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Source : “NVIDIA CUDA Programming Guide” version 1.1

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

Approach - STA STA at a gate Over all inputs compute the MAX of SUM of Input arrival time for input i and Pin-to-output (P-to-O) rising (or falling) delay from pin i to output For example, let AT i fall denote the arrival time of a falling signal at node i AT i rise denote the arrival time of a rising signal at node i AT c rise = MAX [(AT a fall + MAX (D 11→00, D 11→01 )), (AT b fall + MAX (D 11→00, D 11→10 ))] MAX (D 11→00, D 11→01 ) denotes the P-to-O rising delay from a to c MAX (D 11→00, D 11→10 ) denotes the P-to-O rising delay from b to c a c b

Approach - STA In order to implement STA at a gate, on the GPU The P-to-O rising (or falling) delay from every input to output is stored in a lookup table (LUT) LUT stored in texture memory of GPU. Key advantages: Cached memory: LUT data easily fits into available cache No memory coalescing requirements Efficient built-in texture fetching routines available in CUDA Non-zero time taken to load texture memory, but cost amortized For an n-input gate, do the following Fetch n pin-to-output rising (or falling) delays from texture memory Using the gate type offset, pin number and falling/rising delay information n SUM computations (of the pin-to-output delay and input arrival time) n-1 MAX computations (since CUDA only supports 2 operand MAX operations)

Approach - STA typedef struct __align__(8){ int offset; //Gate type’s offset float a, b, c, d; // Arrival times for the inputs } threadData;

Approach - SSTA SSTA at a gate Need (µ, σ) for the n Gaussian distributions of the pin-to-output rising and falling delay values for n inputs Store (µ, σ) for every input in the LUT As opposed to storing the nominal delay Mersenne Twister (MT) pseudo random number generator is used. It has the following advantages Long period Efficient use of memory Good distribution properties Easily parallelizable (in a SIMD paradigm) The uniformly distributed random number sequences are then transformed into the normal distribution N(0,1) Using the Box-Muller transformations (BM) Both algorithms, MT and BM are implemented as separate kernels

Approach - SSTA

For a circuit, SSTA is performed topologically from inputs to outputs Delays of gates at logic depth i are computed, and stored in global memory Gates at logic higher depths may use this data as their input arrival times GPU performance is maximized by ensuring that: Data dependency between threads issued in parallel is avoided Threads issued in parallel execute same instruction, but on different data Conforms to the SIMD architecture of GPUs SIMD-based implementation for MT pseudorandom number generator is used Specific to G80 architecture Texture Memory is used for storing LUT for µ and σ values Global memory writes for level i gates (and reads for level i+1 gates) are performed in a coalesced fashion Can be easily extended for Statistical Timing Analysis with spatial correlations Existing approaches to implement principal component analysis (PCA) in a SIMD fashion Approach - SSTA

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

Experiments MC based SSTA on 8800 GTX runtimes compared to a CPU based implementation 30 large IWLS and ITC benchmarks. Monte Carlo analysis performed by using 64 K samples for all 30 circuits. CPU runtimes are computed On 3.6 GHz, 3GB RAM Intel processor running Linux. Using getrusage (system + user) time GPU (wall clock) time computed using CUDA on GeForce 8800 GTX

Experiments GPU time includes data transfer time GPU ↔ CPU CPU → GPU : arrival time at each primary input µ and σ for all pin-to-output delays of all gates GPU → CPU: 64K delay values at each primary output GPU times also include the time spent in the MT and BM kernels, and loading texture memory Computation results have been verified for correctness For the SLI Quad system, the runtimes are obtained by scaling the processing times only Transfer times are included as well (not scaled)

Results Using a single GPU, the speedup over CPU is ~260X Projecting to SLI Quad shows speedup of ~788X Recently an 8-GPU NVIDIA Tesla server has been announced Block based SSTA can achieve ~500X [Le et al DAC 04] speedup over MC based SSTA However, they report a 2% error compared to MC based SSTA CircuitRuntime (s) GPU SLI Quad CPU Speedup GPU SLI Quad s9234_ s s s :::::: b22_ b b15_ Avg. (30 Ckts.)

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

STA is used in VLSI design flow to estimate circuit delay Process variations are growing larger and less systematic MC based SSTA accounts for variations, and has several advantages like high accuracy and compatibility to data obtained from the fab line Main disadvantage is extremely high runtime cost We accelerate MC based SSTA using graphics processing units (GPUs) By careful engineering, we maximally harness the GPU’s Raw computational power and Huge memory bandwidths When using a Single 8800 GTX GPU ~260X speedup in MC based SSTA is obtained The SSTA runtimes are projected on a Quad GPU system ~785X speedup is possible

Thank You!