Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University, College Station, TX ASPDAC 2009

Outline Preliminaries Previous works The proposed approach Experimental results Conclusions

Preliminaries Static Timing Analysis Statistical Static Timing Analysis Monte Carlo method Some differences between GPU and CPU

Static Timing Analysis (STA) At each gate, the MAX of the SUM of the input arrival time at pin i plus the pin-to- output rising (or falling) delay from pin i to the output is computed. Use LUT for storing delay of each type of gates or compute the delay according to specific equations. Worst case delay as the representa-tive value.

STA example We use a 2-inputs NAND as a example.

Pros and Cons of STA Pros  Can be computed very fast.  Very easy to understand the meaning. Cons  Not that precise.  Hard to deal with the process variation.  Moreover, variations become less systematic now.

Statistical Static Timing Analysis (SSTA) Apply probability and statistics in signals, gates, etc. Basic ideas is the same: MAX and SUM. Need to generate random samples or deal with probability distribution functions (PDFs) directly.

Why SSTA? To deal with variations and to move beyond the limitations of the deterministic nature of traditional STA techniques. The main idea is to include the effect of variations in order to analyze circuit delay more accurately.

Pros and Cons of SSTA Pros  Could deal with variations.  High accuracy. Cons  High runtime cost for accurate method.  May have big difference between different methods.

Monte Carlo method There is no single Monte Carlo method; instead, the term describes a large and widely-used class of approaches. However, these approaches tend to follow a particular pattern:  Define a domain of possible inputs  Generate inputs randomly from the domain using a certain specified probability distribution  Perform a deterministic computation using the inputs  Aggregate the results of the individual computations into the final result

A simple example for Monte Carlo method How can we approximate π? Draw a square and a circle within it on the ground. Uniformly scatter some uniform size object into the square. Counting the number of objects in the circle and dividing by the total number of objects in the square will yield an approximation for π / 4

A simple example for Monte Carlo method (cont.)

Generally speaking  The more the objects (samples), the more the preciseness.  The smaller the objects (unit of samples), the more the preciseness.  Distribution of the objects (distribution function of samples) affects the result.

About some differences between GPU and CPU GPU (NVIDIA GeForce 8800 GTX) CPU (Intel Pentium 4) Cores and clock rate 128 / 575MHz (core clock), 1.35GHz (shader clock) 1 / 3.0GHz flops345.6G~12G Memory bandwidth 86.4GB/s (900MHz memory clock, 384 bit interface, 2 issues) 6.4GB/s (800MHz memory clock, 32 bit interface, 2 issues) Access time of global memory Slow (about 500 memory clock cycles) Fast (about 5 memory clock cycles)

Abstract comparisons of memory between GPU and CPU (cont.) CPU (Intel Pentium 4) GPU (NVIDIA GeForce 8800 GTX) Register CacheTexture cache or Constant cache Main memoryShared memory Hard diskGlobal memory, Texture memory, Constant memory

Previous works Block-based SSTA  Perform statistical MAX and SUM operations and traverse the circuit in a level-wise BFS  Fast but not that accurate Path-based SSTA  Calculate delay PDF of each selected path  Maybe accurate but hard to decide the path that should be selected

Previous works (cont.) Block-based SSTA like [14][15][16] are fast but only an approximation. Path-based SSTA like [17] using Gaussian distribution propagation is also approximation. [19][20][21] propose faster algorithm that compute only the bound of result. [22][23][24][25] do operations on PDFs.

The proposed approach Monte Carlo based SSTA on GPU with Mersenne Twister pseudo-random number generator and Box-Muller transformations. Compute delay of gates like path- based SSTA approach. Traverse circuit like block-based SSTA approach.

Monte Carlo based SSTA Generate gate delay samples according to μ and σ. Do STA for each set of samples. Aggregate results to produce the full circuit delay distribution. The spirit of Monte Carlo method – The more the objects (samples), the more the preciseness.

Why Monte Carlo based SSTA on GPU? Sample parallelism  the generation of samples and the corresponding static timing analysis for a single gate computation can be executed in parallel, with no data-dependency Data parallelism  gates at the same logic level can execute Monte Carlo based SSTA in parallel

Why Monte Carlo based SSTA on GPU? (cont.) SIMD of GPU  Parallel execute Mersenne Twister pseudo-random number generator followed by Box-Muller transformations Large memory bandwidth of GPU  Extremely fast in lookup Many threads of GPU  STA with lots of samples can be executed fast  Memory access time can be hided well

Mersenne Twister pseudo-random number algorithm Developed in 1997 by Makoto Matsumoto and Takuji Nishimura that is based on a matrix linear recurrence over a finite binary field F2. For a k-bit word length, the Mersenne Twister generates numbers with an almost uniform distribution in the range [0,2^k -1]. Long period, efficient use of memory, good distribution properties and high performance

Box-Muller transformations Given a source of uniformly distributed random numbers. A method of generating pairs of independent standard normally distributed (zero expectation, unit variance) random numbers  Transform into N(0,1) Developed by George Edward Pelham Box and Mervin Edgar Muller at 1958.

Monte Carlo based SSTA kernel

Example Suppose a random number sequence: 0.1 -0.2 0.2 -0.2 0.4 0.1 -0.3 0 0.5 0.1 -0.4 0.2 0.3 -0.2 -0.5 0.3 0.1 0

Experimental results NVIDIA GeForce 8800 GTX graphic card  768MB memory  Some are listed in previous slides The environment that is compared with  3.6GHz CPU with 3GB memory  Linux Monte Carlo analysis was performed with 64K samples

Experimental results - Some comparisons Running 16M threads of SSTA kernel  CPU took 37.158 sec  GPU tool 0.115 sec  About 320x faster Mersenne Twister generator  CPU generates about 2.24*10^7 number/sec  GPU generates about 2.33*10^9 number/sec  About 100x faster

Experimental results – 30 cases

Monte Carlo based SSTA on GPU Mersenne Twister generator and Box- Muller transformation Combination of path-based SSTA approach and block-based SSTA approach No loss of accuracy and ultra fast

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

Similar presentations

Presentation on theme: "Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

Similar presentations

Presentation on theme: "Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,"— Presentation transcript:

Similar presentations

About project

Feedback