Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

Slides:



Advertisements
Similar presentations
Non-Gaussian Statistical Timing Analysis Using Second Order Polynomial Fitting Lerong Cheng 1, Jinjun Xiong 2, and Lei He 1 1 EE Department, UCLA *2 IBM.
Advertisements

Lecture 13: Sequential Circuits
Design Rule Generation for Interconnect Matching Andrew B. Kahng and Rasit Onur Topaloglu {abk | rtopalog University of California, San Diego.
Fast Algorithms For Hierarchical Range Histogram Constructions
Slide 1 Bayesian Model Fusion: Large-Scale Performance Modeling of Analog and Mixed- Signal Circuits by Reusing Early-Stage Data Fa Wang*, Wangyang Zhang*,
Instructor: Sazid Zaman Khan Lecturer, Department of Computer Science and Engineering, IIUC.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Parameterized Timing Analysis with General Delay Models and Arbitrary Variation Sources Khaled R. Heloue and Farid N. Najm University of Toronto {khaled,
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
Non-Linear Statistical Static Timing Analysis for Non-Gaussian Variation Sources Lerong Cheng 1, Jinjun Xiong 2, and Prof. Lei He 1 1 EE Department, UCLA.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 23: April 22, 2009 Statistical Static Timing Analysis.
May 14, ISVLSI 09 Algorithms for Estimating Number of Glitches and Dynamic Power in CMOS Circuits with Delay Variations Jins Davis Alexander Vishwani.
Simple Linear Regression
Weiping Shi Department of Computer Science University of North Texas HiCap: A Fast Hierarchical Algorithm for 3D Capacitance Extraction.
Using random numbers Simulation: accounts for uncertainty: biology (large number of individuals), physics (large number of particles, quantum mechanics),
Statistical Crosstalk Aggressor Alignment Aware Interconnect Delay Calculation Supported by NSF & MARCO GSRC Andrew B. Kahng, Bao Liu, Xu Xu UC San Diego.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 14: March 19, 2008 Statistical Static Timing Analysis.
“Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Computations” By Ravi, Ma, Chiu, & Agrawal Presented.
A Probabilistic Method to Determine the Minimum Leakage Vector for Combinational Designs Kanupriya Gulati Nikhil Jayakumar Sunil P. Khatri Department of.
Statistical Gate Delay Calculation with Crosstalk Alignment Consideration Andrew B. Kahng, Bao Liu, Xu Xu UC San Diego
Statistical Critical Path Selection for Timing Validation Kai Yang, Kwang-Ting Cheng, and Li-C Wang Department of Electrical and Computer Engineering University.
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
GPGPU platforms GP - General Purpose computation using GPU
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Database and Stream Mining using GPUs Naga K. Govindaraju UNC Chapel Hill.
Practical PC, 7th Edition Chapter 17: Looking Under the Hood
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Monte Carlo Simulation CWR 6536 Stochastic Subsurface Hydrology.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
-Global Illumination Techniques
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Physics 270 – Experimental Physics. Standard Deviation of the Mean (Standard Error) When we report the average value of n measurements, the uncertainty.
The Central Processing Unit
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Diane Marinkas CDA 6938 April 30, Outline Motivation Algorithm CPU Implementation GPU Implementation Performance Lessons Learned Future Work.
A Fast Hardware Approach for Approximate, Efficient Logarithm and Anti-logarithm Computation Suganth Paul Nikhil Jayakumar Sunil P. Khatri Department of.
Random Number Generators 1. Random number generation is a method of producing a sequence of numbers that lack any discernible pattern. Random Number Generators.
Monte Carlo Methods.
Monte Carlo Methods Versatile methods for analyzing the behavior of some activity, plan or process that involves uncertainty.
Chapter 4 Stochastic Modeling Prof. Lei He Electrical Engineering Department University of California, Los Angeles URL: eda.ee.ucla.edu
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Presented by: Sergey Volkovich Vladimir Dibnis Spring 2011 Supervisor: Mony Orbach.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
On the Assumption of Normality in Statistical Static Timing Analysis Halim Damerdji, Ali Dasdan, Santanu Kolay February 28, 2005 PrimeTime Synopsys, Inc.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
Sunpyo Hong, Hyesoon Kim
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © 2005 Dr. John Lipp.
G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1 Statistical Data Analysis: Lecture 5 1Probability, Bayes’ theorem 2Random variables and.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
Genomic Data Clustering on FPGAs for Compression
Streaming & sampling.
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Chapter 4b Statistical Static Timing Analysis: SSTA
On the Improvement of Statistical Timing Analysis
Chapter 4C Statistical Static Timing Analysis: SSTA
Presentation transcript:

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University, College Station, TX ASPDAC 2009

Outline Preliminaries Previous works The proposed approach Experimental results Conclusions

Outline Preliminaries Previous works The proposed approach Experimental results Conclusions

Preliminaries Static Timing Analysis Statistical Static Timing Analysis Monte Carlo method Some differences between GPU and CPU

Static Timing Analysis (STA) At each gate, the MAX of the SUM of the input arrival time at pin i plus the pin-to- output rising (or falling) delay from pin i to the output is computed. Use LUT for storing delay of each type of gates or compute the delay according to specific equations. Worst case delay as the representa-tive value.

STA example We use a 2-inputs NAND as a example.

Pros and Cons of STA Pros  Can be computed very fast.  Very easy to understand the meaning. Cons  Not that precise.  Hard to deal with the process variation.  Moreover, variations become less systematic now.

Statistical Static Timing Analysis (SSTA) Apply probability and statistics in signals, gates, etc. Basic ideas is the same: MAX and SUM. Need to generate random samples or deal with probability distribution functions (PDFs) directly.

Why SSTA? To deal with variations and to move beyond the limitations of the deterministic nature of traditional STA techniques. The main idea is to include the effect of variations in order to analyze circuit delay more accurately.

Pros and Cons of SSTA Pros  Could deal with variations.  High accuracy. Cons  High runtime cost for accurate method.  May have big difference between different methods.

Monte Carlo method There is no single Monte Carlo method; instead, the term describes a large and widely-used class of approaches. However, these approaches tend to follow a particular pattern:  Define a domain of possible inputs  Generate inputs randomly from the domain using a certain specified probability distribution  Perform a deterministic computation using the inputs  Aggregate the results of the individual computations into the final result

A simple example for Monte Carlo method How can we approximate π? Draw a square and a circle within it on the ground. Uniformly scatter some uniform size object into the square. Counting the number of objects in the circle and dividing by the total number of objects in the square will yield an approximation for π / 4

A simple example for Monte Carlo method (cont.)

Generally speaking  The more the objects (samples), the more the preciseness.  The smaller the objects (unit of samples), the more the preciseness.  Distribution of the objects (distribution function of samples) affects the result.

About some differences between GPU and CPU GPU (NVIDIA GeForce 8800 GTX) CPU (Intel Pentium 4) Cores and clock rate 128 / 575MHz (core clock), 1.35GHz (shader clock) 1 / 3.0GHz flops345.6G~12G Memory bandwidth 86.4GB/s (900MHz memory clock, 384 bit interface, 2 issues) 6.4GB/s (800MHz memory clock, 32 bit interface, 2 issues) Access time of global memory Slow (about 500 memory clock cycles) Fast (about 5 memory clock cycles)

Abstract comparisons of memory between GPU and CPU (cont.) CPU (Intel Pentium 4) GPU (NVIDIA GeForce 8800 GTX) Register CacheTexture cache or Constant cache Main memoryShared memory Hard diskGlobal memory, Texture memory, Constant memory

Outline Preliminaries Previous works The proposed approach Experimental results Conclusions

Previous works Block-based SSTA  Perform statistical MAX and SUM operations and traverse the circuit in a level-wise BFS  Fast but not that accurate Path-based SSTA  Calculate delay PDF of each selected path  Maybe accurate but hard to decide the path that should be selected

Previous works (cont.) Block-based SSTA like [14][15][16] are fast but only an approximation. Path-based SSTA like [17] using Gaussian distribution propagation is also approximation. [19][20][21] propose faster algorithm that compute only the bound of result. [22][23][24][25] do operations on PDFs.

Outline Preliminaries Previous works The proposed approach Experimental results Conclusions

The proposed approach Monte Carlo based SSTA on GPU with Mersenne Twister pseudo-random number generator and Box-Muller transformations. Compute delay of gates like path- based SSTA approach. Traverse circuit like block-based SSTA approach.

Monte Carlo based SSTA Generate gate delay samples according to μ and σ. Do STA for each set of samples. Aggregate results to produce the full circuit delay distribution. The spirit of Monte Carlo method – The more the objects (samples), the more the preciseness.

Why Monte Carlo based SSTA on GPU? Sample parallelism  the generation of samples and the corresponding static timing analysis for a single gate computation can be executed in parallel, with no data-dependency Data parallelism  gates at the same logic level can execute Monte Carlo based SSTA in parallel

Why Monte Carlo based SSTA on GPU? (cont.) SIMD of GPU  Parallel execute Mersenne Twister pseudo-random number generator followed by Box-Muller transformations Large memory bandwidth of GPU  Extremely fast in lookup Many threads of GPU  STA with lots of samples can be executed fast  Memory access time can be hided well

Mersenne Twister pseudo-random number algorithm Developed in 1997 by Makoto Matsumoto and Takuji Nishimura that is based on a matrix linear recurrence over a finite binary field F2. For a k-bit word length, the Mersenne Twister generates numbers with an almost uniform distribution in the range [0,2^k -1]. Long period, efficient use of memory, good distribution properties and high performance

Box-Muller transformations Given a source of uniformly distributed random numbers. A method of generating pairs of independent standard normally distributed (zero expectation, unit variance) random numbers  Transform into N(0,1) Developed by George Edward Pelham Box and Mervin Edgar Muller at 1958.

Monte Carlo based SSTA kernel

Example Suppose a random number sequence:

Outline Preliminaries Previous works The proposed approach Experimental results Conclusions

Experimental results NVIDIA GeForce 8800 GTX graphic card  768MB memory  Some are listed in previous slides The environment that is compared with  3.6GHz CPU with 3GB memory  Linux Monte Carlo analysis was performed with 64K samples

Experimental results - Some comparisons Running 16M threads of SSTA kernel  CPU took sec  GPU tool sec  About 320x faster Mersenne Twister generator  CPU generates about 2.24*10^7 number/sec  GPU generates about 2.33*10^9 number/sec  About 100x faster

Experimental results – 30 cases

Outline Preliminaries Previous works The proposed approach Experimental results Conclusions

Monte Carlo based SSTA on GPU Mersenne Twister generator and Box- Muller transformation Combination of path-based SSTA approach and block-based SSTA approach No loss of accuracy and ultra fast