Transient Analysis CK Cheng UC San Diego CK Cheng UC San Diego Jan. 25, 2007
Outline Research Directions Simulation test case results Overview of Simulation Commercial Package Alternating direction implicit (ADI) Method General Operator Splitting Method Distributed Computing Conclusions and Future Works
Research Directions Simulation: SPICE, STA Network on Chip: topology and wire styles, Power, and Clock Networks Data Path Components: adders, shifters, multipliers, division Packaging: passive distortion compensation
6x6 Bump Simulation Results The Circuit: –184K Capacitors, 17K Current Sources, 120K Inductors and 246K Resistors. –306K Nodes Accuracy: –Waveform and measurement results match Fujitsu ’ s with less than 0.002% error. Runtime / Memory Comparison: CPU_TimeMemoryComputer Used UCSD678s600.2MPentium 4 3.2G, Linux Fujistu Log File1845s771Munknown
6x6 Bump Simulation Results Measurement results and waveform Min_pwr_l_est_ Min_ Min_ UCSD Fujistu Log File Error0.002%0.004%0.005% (Red curve is UCSD result)
703KR Simulation Results The Circuit: –514K Capacitors, 76K Current Sources, 370K Inductors and 703K Resistors. –1.3M Nodes Accuracy: –Measurement results match Fujitsu ’ s with less than 0.02% error. Runtime / Memory Comparison: CPU_TimeMemoryComputer Used UCSD2575s (0.7h)1.7GPentium 4 3.2G, Linux Fujistu Log File864561s (240h)2.28Gunknown
703KR Simulation Results Measurement results and waveform Min_ Min_ Min_ UCSD Fujistu Log File Error0.015%0.02%0.026% (UCSD results only. Fujitsu waveform is not available for comparison)
Further Speed-ups Reduce iteration count by 50% for pure linear circuits (like 6x6 bump and 703KR) –2x speed up More effective time step control –DVDT, breakpoint, truncation error x speed up Use Multigrid solver – x speed up for medium circuits (6x6 bump) –2x – 10x speed up for large circuits (703KR) Parallel simulation –4 or more processors on linux cluster –32 to hundreds of processors on supercomputer. Overall speed-up –6x - 60x speed up without parallel simulation –12x x speed up with parallel simulation
Performance and capacity prediction Cases 10x-100x larger than 703KR. Preferred SolverCpu TimeMemory Small - Medium 0.3M nodes LU Decomposition11 minutes600M Medium - Large 1.3M nodes Multigrid43 minutes1.7G Huge 10–100 M nodes Multigrid + Parallel 5 – 100 hours15G - 200G
Overview of Simulation Our research Fast speed with SPICE accuracy Nonlinear devices Efficient matrix solvers Effective integration methods Time step controls according to different integration methods Distributed computing Yes Load Circuit Device Evaluation LU Decomposition N-R Converge? Next Time Point Time Step Control Integration Approximation Linearization No
Overview of Simulation Matrix Solver LU Decomposition Iterative Approach Integration Time Step Control ADI Nonlinear Devices Two Stage Newton Raphson Distributed Computing Commercial Implementation
Overview of Simulation Integration Time Step Control ADI (two-way partitioning) Operator Splitting (multi-way) Distributed Computing MPI Partitioning Three Ph.D. Students
Commercial Package: Fastrack Design Founded in January 2001 Headquartered in San Jose Privately funded, cash-flow positive Two Business Units Design Services Technology Products
Analog Designs Design # Elements Sim. Len HSpicemSPICESPEEDUPFACTOR LVDS us80h26h 3.1X Oscillator2221 ms13,706s2,670s 5.1X Biasing Circuit ns427s82s 5.2X PLL us67d12d 5.6X PLL (post-layout) 300K40us290d (est)16d 18.1X
Digital Blocks DesignNameDevicesRuntime Speedup Factor MOSRCmSPICE Traditional Spice ALU10.1k12.7k7.5k6.9m7m1.0X CONTROL69k83.7k52.5k1.5h9.5h6.3X YN_BLK205K242.8k203.9k3.5h> 2d>13.7X THP437k499.3k313.5k5.0hCOULD NOT RUN ∞ VCON936k753k561k15.0hCOULD NOT RUN ∞
Memory Blocks Design#Tr#R#C # Vectors / Sim. Length mSPICE Run Time BRAM (pre)220K hours SRAM (pre) 8Kx8 SP 410K0027 hours eRAM (post) 256x16 72K28K427K48ns8 hours BRAM (post)220K1320K870K218 hours 100% accurate Spice simulation
mSPICE-Parallel Industry’s first practical parallel Spice simulation solution –Increases capacity further –Dramatically improves throughput Uses Matrix Level Partitioning –No loss of accuracy –Client-Server configuration –Minimal memory requirement for client nodes
Client-Server Configuration Server distributes sub-matrices to clients Clients communicate partial solutions Minimal memory requirements for clients
Experimental Results DesignTotalElements Sim. Length Runtime 1-proc2-proc4-proc ASIC1.2M8ns12.2h7.0h (1.7X) 5.1h (2.4X) 38IO SSO1.4M30ns3.0h2.0h (1.5X) 1.4h (2.2X) Signal-power2.1M1.2us13d7d18h (1.7X) 5d12h (2.4X) 4096x8 RAM (extracted) 2.3M10ns32h18.5h (1.7X) 13.4h (2.4X) 120IO SSO3.5M30ns6.2h4.1h (1.5X) 3.1h (2.0X)
ADI: Previous Works 1999, Namiki and Ito –the alternating direction implicit (ADI) is used to simulate a 2D TE wave. 2001, Zheng etc. –extend to 3D problem 2001 & 2003, Lee and Chen –ADI is used to transmission line modeled power grid The alternation is among different geometric directions, so the simulated geometric structure is constrained.
Alternating Direction Implicit (ADI) ADI Integration Method –Two way partition of the circuit –One partition is used for each backward integration –Unconditional stable (A-stable: independent of time step size) –Time step size according to local truncation error.
Alternating Direction Implicit (ADI) ADI method formulation Circuit partition algorithm Local truncation error estimation Stability discussion Experimental results
SPICE Formulation Equations for RLC circuits where C: capacitance matrix L: inductance matrix R: resistance matrix G: conductance matrix E: incidence matrix
ADI Formulation Transient simulation –Split the resistors and inductors branches into two parts G = G1 + G2 E = E1 + E2 R = R1 + R2 –Alternate Backward and Forward integration on each partition
ADI Formulation (Cont.) Equations of ADI method –the size of left-hand-side matrix remains unchanged –the number of non-zero elements is decreased –direct solving methods can be efficient
Experiments of non-zero fill-ins A small ASIC Design Spice matrix : Dimension: 10,286 The number of non-zero elements: 46,655 The number of non-zero fill-ins: 90,960 A large I/O Design Spice matrix : Dimension: 615,436 The number of non-zero elements: 2,126,246 Sub-matrix1Sub-matrix2Total # non-zero fill-ins # non-zero elements # non-zero fill-ins # non-zero elements # non-zero fill-ins Case 138,5722,61842,02010,04012,658 Case 21,176,20812,421,534950,03814,772,06827,193,602
Local Truncation Error (LTE) Time step control using LTE –In circuit transient analysis, the next time step can be estimated from the local truncation error at the present time point –LTE is defined as the difference between the calculated solution and the exact solution –To ensure the consistency, the local truncation error should not exceed the error tolerance, thus the time step can be estimated using
Local Truncation Error (Cont.) LTE of ADI method (1) equations let,, and then
Local Truncation Error (Cont.) LTE of ADI method (2) Estimate exact solution we characterize the input as a simple ramp over the interval (t n, t n+1 ), the exact analytic solution with time step t n:
Local Truncation Error (Cont.) LTE of ADI method (3) Estimate ADI solution
Local Truncation Error (Cont.) LTE of ADI method (3) Estimate ADI solution
Local Truncation Error (Cont.) LTE of ADI method (4) LTE estimation
Local Truncation Error (Cont.) LTE of ADI method (5) Time step control
Local Truncation Error (Cont.) LTE of ADI method (5) Time step control
Stability Discussion The stability is concerned with whether the accumulated error grows or decays as time evolves through a series of time steps. One-step integration approximations, the error is accumulated by a factor of If the final steady state error vector is smaller than the initial, then the integration method is stable. In ADI integration method: –It can be proved to be unconditional stable
Experimental Results Circuit1Cuicuit2Circuit31k-cell #Nodes10,00040,00090,00010,200 #Transistors0006,500 Period10ns SPICE3CPU time (sec) , #steps ADICPU time (sec) #steps Speedup2.7x4.1x11.1x-
Voltage drop of Circuit3 (power mesh with sinks)
Signal in 1k_cell (ASIC design)
General Operator Splitting General operator splitting method –Multiple way partitions –Each partition is considered separately in each time step simulation –No geometry constrains –Local truncation error is used to dynamically control time step size
General Operator Splitting Fundamental theory Operator splitting formulation Local truncation error estimation Stability discussion Experimental results
Fundamental theory In circuit transient simulation, the integration approximation is actually the approximation of the exponential operator The exponential operators can be approximated in any order using a general scheme of fractal decomposition The decomposition of exponential operators corresponds to the circuit multi-way partition New integration approximation in transient simulation
Fundamental theory Approximation of exponential operator –General circuit equation and solution –If we characterize the input as a simple ramp over the interval (t n, t n+1 ), the exact analytic solution with time step t n –Exponential operator approximation Forward Euler Backward Euler Trapezoidal
Fundamental theory Decomposition of exponential operators (Masuo Suzuki, 1991, Physics) –Function –First order: –Second order: –Third order: –(2m-1)th and (2m)th order:
Fundamental theory Decomposition of exponential operators
General Operator Splitting Formulation Transient simulation: –Apply the second order approximation –In each time step, every partition is calculated separately and trapezoidal integration is used for every partition –The size of left-hand-side matrix may be changed –The number of non-zero elements is definitely decreased –Can be easily extended to multi-way partitions
General Operator Splitting Formulation Equations
Local Truncation Error (Cont.) LTE of general operator splitting method Estimate solution
Local Truncation Error (Cont.) LTE of general operator splitting method Estimate solution
Local Truncation Error (Cont.) LTE of general operator splitting method LTE estimation
Local Truncation Error (Cont.) LTE of general operator splitting method LTE estimation
Local Truncation Error (Cont.) LTE of general operator splitting method LTE estimation
Stability Discussion The trapezoidal integration method is unconditional stable for stable system. In our operator splitting method, trapezoidal method is used for all the sub-systems still unconditional stable
Experimental Results Circuit1Cuicuit2Circuit3 #Nodes10,00040,00090,000 #Transistors000 Period10ns SPICE3CPU time (sec) ,061.1 #steps GOSCPU time (sec) #steps102 Comparison2.1x2x1.1x
Voltage drop of Circuit3 (power mesh with sinks)
Conclusions We investigate alternating direction implicit and general operator splitting integration methods for transistor-level circuit transient simulation. In both methods, the circuit will be divided into several sub-circuits, thus the direct matrix solver is still efficient because the matrix is simplified. Both methods are second order accurate and unconditional stable. Overhead: –Circuit partition –Each time step consists of many sub-steps, each sub-step is a N-R iteration process Better for circuits with large linear network
Distributed Processors –Cluster –Supercomputer –Multi-Core Processors (Intel Dual/Quad-Core, IBM Cell etc.) Standard –MPI –Partitioning –Matrix Solver Capabilities –Speed-up ( ) –Memory Capacity ( ) Distributed Computing
Future Works ADI method –More experiments General operator splitting method –Design and implement multi-way circuit partition algorithm –Implement multi-way general operator splitting program –Derive LTE for general multi-way situation –More experiments Distributed Computing –MPI Standard –Distributed Partitioning, Matrix Solver