WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines Wei Dong, Peng Li, Xiaoji Ye Department of ECE,

1 WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University {weidong, pli, yexiaoji} @

2 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 2 Multi-Core Implications  Multi-core shift is changing the landscape of computing  New challenges & opportunities for EDA –Free ride of single-threaded EDA applications on Moore’s Law is coming to an end  Question: How to fully exploit increasingly parallel hardware and achieve good runtime scaling?? Courtesy Intel Courtesy AMD Courtesy IBM

3 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 3 Why Parallel Transient Simulation?  SPICE-like transient simulation is key to wide ranges of ICs –Memories, custom digital, analog/RF/mixed-signal  Long simulation time presents significant bottleneck in design –CPU time > days, weeks (e.g. transistor-level PLL simulation) –Can lead to insufficient verification, non-optimal design, chip failure Natural target for parallelization!

4 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 4 Prior Work  Fine-grained parallelization –Parallel matrix solves, device model evaluations –The efficiency of parallel matrix solvers deteriorates quickly  Parallel waveform relaxation [White et al ’87,Reichelt et al ICCAD’03] –Limited convergence property  Domain decomposition [Wever et al, HICSS’96] –Can create dense problems –Applicability highly application dependent Performance of a public parallel matrix solver on a 8-processor server

5 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 5 Our Strategies  Exploit coarse-grained & application-level parallelisms –Lessons learned before [T. Mattson, Intel] –>100 parallel languages/environments developed in the 90’s ! –Only a few with significant domain knowledge made successful –Develop simulation algorithms parallelizable by construction  Goals/Benefits –Reduce parallel overhead via applying domain knowledge –Create rich parallelisms for multi-/many-core platforms (pairing with fine grained methods) –Ease in parallel programming, debug and code reuse –Do not jeopardize accuracy & convergence

6 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 6 Proposed Approach  Time-domain MNA formulation  How to parallelize along the time axis?  Data dependency : vector of unknowns : static nonlinearities : dynamic nonlinearities Nonlinear DAEs : inputs t1t1 t2t2 t3t3 t4t4 t5t5 t1t1 t2t2 t3t3 t4t4 t5t5 One-step integration two-step integration

7 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 7 Waveform Pipelining (WavePipe) … Backward Pipelining … Forward Pipelining Multi-Step Num. Integration Predictive Computing T Current/base Position Granularity of Waveform Pipelining Schedule T1 T2 T3 T4 … Solve Fine Grained Parallel Assists Parallel Matrix Solve/Device Evaluation Multi-/Many-Core Machine

8 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 8 Outline  Motivation  Overview  Parallel backward pipelining  Parallel forward pipelining  Experimental results  Summary

9 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 9 Parallel Backward Pipelining  Move backwards in time  Create additional independent computing tasks along T axis  Why useful? –Employ under variable-stepsize multi-step numerical integration –Contribute to a larger future time step … … Backward Pipelining Forward Pipelining Multi-Step Num. Integration Predictive Computing T Current Position

10 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 10 Variable-Stepsize Multi-Step Gear’s Method  Gear’s integration formula  Two-step Gear’s method [Shichman, Trans. Circuit Theory, 1970] : order of numerical integration : circuit response at time point i : coefficients

11 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 11 Local Truncation Error (LTE)  Numerical integration error incurred “locally” at each point –All the previous solutions are assumed to be accurate  LTEs in Gear’s methods Two-step Three-step

12 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 12 LTE based Time Step Control (Gear2)  Control the time step to meet an LTE tolerance  LTE’s dependency on h n & h n+1  Key observation –Smaller h n  greater h n+1 : if DD3 nonincreasing –Exploit for parallel computing T ? h n+1 hnhn

13 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 13 Parallel Backward Pipelining  Serial Gear2  Double-threaded Gear2  Balance between efficiency and robustness:  Extensible to multi-step methods (e.g. Gear3) Initial conditions @ t1 & t2 Tr1: t3 (h3  h2) Tr2: back to t3’ Tr1: t4 (h4  h3’) Tr2: back to t4’ time t1 t2 t3 t3’ h2h2 h3h3 h4h4 h3’h3’ t4 t4’ h4’h4’ Thread 1 Thread 2 t1 t2 t3 h2h2 h3h3 h4h4 t4

14 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 14 Parallel Forward Pipelining  Move forwards in time  Exploit predictive computing along the forward T direction  Question –How to resolve data dependency & ensure accuracy … … Backward Pipelining Forward Pipelining Multi-Step Num. Integration Predictive Computing T Current Position

15 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 15 Parallel Forward Pipelining  Ex: double threaded Init. condition @ t1 & t2 Time point t3 (h3  h2) FE estimate sol@t3 Time point t4 (h4  h3) Solve sol@t3 & sol@t4 Time point t5 (h5  h4) FE estimate sol@t5 Time point t6 (h6  h5) Solve sol@t5 & sol@t6 time t1 t2 t3 t4 h2h2 h3h3 h5h5 h4h4 t5 t6 h6h6 Thread 1 Thread 2

16 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 16 Complications  Time steps for forward points may not be estimated accurately –Data dependency on initial conditions –Apply a damping factor (β<1.0) for time step estimation –Revoke forward results in thread scheduling cycle (covered later)  Forward points based on inaccurate initial conditions –Addressed by inter-thread communication –Tradeoffs provided by fine/coarse grained communications … Forward Pipelining T Base Position h=? … Forward Pipelining T Base Position Accuracy?

17 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 17 Coarse Grained Inter-thread Communication FE Estimation Newton Loop One or more iter. Convergence Time point 2 Thread 2 time FE Estimation Newton Loop One or more iter. Convergence Time point 1 Thread 1 time … FE Estimation Newton Loop One or more iter. Convergence Time point 3 Thread 3 …  Iterate on the converged initial condition

18 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 18 Fine Grained Inter-thread Communication time  Communicate at the granularity of NR iterations  Beneficial to large circuits FE Estimation NR Iteration 1 Convergence Time point 1 Thread 1 Time point 2 Thread 2 Time point 3 Thread 3 time NR Iteration 2 NR Iteration 3 FE Estimation NR Iteration 1 Convergence NR Iteration 2 NR Iteration 3 FE Estimation NR Iteration 1 Convergence NR Iteration 2 NR Iteration 3 …

19 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 19 Multi-threaded WavePipe  Combine backward with forward waveform pipelining  Ex: 4T (1-backward-2-forward) WavePipe T1 T2 T3 T4 Initial Solutions … … Backward Forward 2 nd Forward Base Gear2 point One Thread Scheduling Cycle FE Newton FE Newton FE Newton FE Newton Time step T2: backward T1: standard T3: forward T4: 2 nd forward

20 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 20 Thread Scheduling  The work done over an overestimated step is discarded Without Step Size Overestimation Cycle Starts Cycle Completes Initial Conditions … Cycle Completes … Time Backward Forward 2 nd Forward Standard 4-Thread WavePipe (1-backward-2-forward scheme) With Step Size Overestimation Cycle Starts Partially Completes Cycle Completes … … Time Initial Conditions

21 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 21 Experimental Setup  A 8-processor Linux server with four dual-core processors  WavePipe implemented in C/C++ using pThreads (Gear2)  Compare with –Reference serial SPICE-like (Gear2) transient simulation –Low level parallel matrix solve (SuperLU) and device evaluation  Test circuits IndexCircuitSizeTime PointsSerial Run Time (s) 1VCO2086,02337.59 2Power Amplifier8113,97230.12 3DB mixer27134,61248.11 4Ring Oscillator61110,037206.37 5Frequency Divier1744,79518.49 6Digital Adder1122,5588.93 7RLC mesh 113,0976642,704.08 8RLC mesh 227,6701432,659.35

22 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 22 Experimental Results – Accuracy & Profiling  3-T (1 backward + 1 forward) WavePipe vs. serial (DB mixer)  Real-time threading profiling (mesh ckt)

23 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 23 Experimental Results – 2T Speedups  2T 1-backward & 2T 1-forward Circuit 2T 1-backward2T 1-forward T(s)SpeedupT(s)Speedup VCO27.31.3823.11.63 Power Amplifier21.71.3918.11.66 DB mixer36.91.3030.81.56 Ring Oscillator149.31.38121.91.69 Frequency Divier15.31.2112.61.47 Digital Adder7. RLC mesh 12245.11.201814.61.49 RLC mesh 22159.31.231742.21.53 1.29X 1.57X

24 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 24 Experimental Results – 3T Speedups  3T 1-backward-1-forward & 3T 2-forward Circuit 3T 1-back-1-forward3T 2-forward T(s)SpeedupT(s)Speedup VCO 20.31.8519.61.92 Power Amplifier 16.21.8615.41.96 DB mixer 27.61.7426.31.83 Ring Oscillator 112.41.84107.21.93 Frequency Divier 11.21.6510.71.73 Digital Adder 5.41.655.11.75 RLC mesh 1 1679.61.611559.01.73 RLC mesh 2 1589.31.671487.41.79 1.73X 1.83X

25 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 25 Experimental Results – 4T Speedups  4T 1-backward-2-forward & 4T 3-forward Circuit 4T 1-back-2-forward4T 3-forward T(s)SpeedupT(s)Speedup VCO 16.82.2416.12.33 Power Amplifier 13.82.1813.22.28 DB mixer 22.72.1221.62.23 Ring Oscillator 94.72.1891.02.27 Frequency Divier Digital Adder RLC mesh 1 1390.21.951324.62.04 RLC mesh 2 1330.82.001265.42.10 2.09X 2.19X

26 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 26 Experimental Results – Runtime Scaling  2-4 threads

27 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 27 Experimental Results  Low-level scheme –Parallel matrix solve & device model evaluation  Proposed scheme –1-4 threads: WavePipe –8 threads: 3-forward WavePipe + parallel matrix sol. & model eval.

28 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 28 Summary  Multi-core challenges & opportunities for EDA  Application-level coarse-grained parallelism for transient simulation –Parallelize at a granularity of single time-point circuit solution –Inherent low inter-core communication overhead –Maintain accuracy & convergence –Ease in implementation and code reuse  Rich sets of parallelisms for multi-core or many-core systems –New parallel opportunities orthogonal to fine-grained schemes –Pair with parallel matrix solve, device evaluation and low-level parallel programming assists

29 DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 29 Thanks

