WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University {weidong, pli, neo.tamu.edu
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 2 Multi-Core Implications Multi-core shift is changing the landscape of computing New challenges & opportunities for EDA –Free ride of single-threaded EDA applications on Moore’s Law is coming to an end Question: How to fully exploit increasingly parallel hardware and achieve good runtime scaling?? Courtesy Intel Courtesy AMD Courtesy IBM
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 3 Why Parallel Transient Simulation? SPICE-like transient simulation is key to wide ranges of ICs –Memories, custom digital, analog/RF/mixed-signal Long simulation time presents significant bottleneck in design –CPU time > days, weeks (e.g. transistor-level PLL simulation) –Can lead to insufficient verification, non-optimal design, chip failure Natural target for parallelization!
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 4 Prior Work Fine-grained parallelization –Parallel matrix solves, device model evaluations –The efficiency of parallel matrix solvers deteriorates quickly Parallel waveform relaxation [White et al ’87,Reichelt et al ICCAD’03] –Limited convergence property Domain decomposition [Wever et al, HICSS’96] –Can create dense problems –Applicability highly application dependent Performance of a public parallel matrix solver on a 8-processor server
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 5 Our Strategies Exploit coarse-grained & application-level parallelisms –Lessons learned before [T. Mattson, Intel] –>100 parallel languages/environments developed in the 90’s ! –Only a few with significant domain knowledge made successful –Develop simulation algorithms parallelizable by construction Goals/Benefits –Reduce parallel overhead via applying domain knowledge –Create rich parallelisms for multi-/many-core platforms (pairing with fine grained methods) –Ease in parallel programming, debug and code reuse –Do not jeopardize accuracy & convergence
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 6 Proposed Approach Time-domain MNA formulation How to parallelize along the time axis? Data dependency : vector of unknowns : static nonlinearities : dynamic nonlinearities Nonlinear DAEs : inputs t1t1 t2t2 t3t3 t4t4 t5t5 t1t1 t2t2 t3t3 t4t4 t5t5 One-step integration two-step integration
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 7 Waveform Pipelining (WavePipe) … Backward Pipelining … Forward Pipelining Multi-Step Num. Integration Predictive Computing T Current/base Position Granularity of Waveform Pipelining Schedule T1 T2 T3 T4 … Solve Fine Grained Parallel Assists Parallel Matrix Solve/Device Evaluation Multi-/Many-Core Machine
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 8 Outline Motivation Overview Parallel backward pipelining Parallel forward pipelining Experimental results Summary
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 9 Parallel Backward Pipelining Move backwards in time Create additional independent computing tasks along T axis Why useful? –Employ under variable-stepsize multi-step numerical integration –Contribute to a larger future time step … … Backward Pipelining Forward Pipelining Multi-Step Num. Integration Predictive Computing T Current Position
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 10 Variable-Stepsize Multi-Step Gear’s Method Gear’s integration formula Two-step Gear’s method [Shichman, Trans. Circuit Theory, 1970] : order of numerical integration : circuit response at time point i : coefficients
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 11 Local Truncation Error (LTE) Numerical integration error incurred “locally” at each point –All the previous solutions are assumed to be accurate LTEs in Gear’s methods Two-step Three-step
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 12 LTE based Time Step Control (Gear2) Control the time step to meet an LTE tolerance LTE’s dependency on h n & h n+1 Key observation –Smaller h n greater h n+1 : if DD3 nonincreasing –Exploit for parallel computing T ? h n+1 hnhn
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 13 Parallel Backward Pipelining Serial Gear2 Double-threaded Gear2 Balance between efficiency and robustness: Extensible to multi-step methods (e.g. Gear3) Initial t1 & t2 Tr1: t3 (h3 h2) Tr2: back to t3’ Tr1: t4 (h4 h3’) Tr2: back to t4’ time t1 t2 t3 t3’ h2h2 h3h3 h4h4 h3’h3’ t4 t4’ h4’h4’ Thread 1 Thread 2 t1 t2 t3 h2h2 h3h3 h4h4 t4
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 14 Parallel Forward Pipelining Move forwards in time Exploit predictive computing along the forward T direction Question –How to resolve data dependency & ensure accuracy … … Backward Pipelining Forward Pipelining Multi-Step Num. Integration Predictive Computing T Current Position
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 15 Parallel Forward Pipelining Ex: double threaded Init. t1 & t2 Time point t3 (h3 h2) FE estimate Time point t4 (h4 h3) Solve & Time point t5 (h5 h4) FE estimate Time point t6 (h6 h5) Solve & time t1 t2 t3 t4 h2h2 h3h3 h5h5 h4h4 t5 t6 h6h6 Thread 1 Thread 2
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 16 Complications Time steps for forward points may not be estimated accurately –Data dependency on initial conditions –Apply a damping factor (β<1.0) for time step estimation –Revoke forward results in thread scheduling cycle (covered later) Forward points based on inaccurate initial conditions –Addressed by inter-thread communication –Tradeoffs provided by fine/coarse grained communications … Forward Pipelining T Base Position h=? … Forward Pipelining T Base Position Accuracy?
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 17 Coarse Grained Inter-thread Communication FE Estimation Newton Loop One or more iter. Convergence Time point 2 Thread 2 time FE Estimation Newton Loop One or more iter. Convergence Time point 1 Thread 1 time … FE Estimation Newton Loop One or more iter. Convergence Time point 3 Thread 3 … Iterate on the converged initial condition
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 18 Fine Grained Inter-thread Communication time Communicate at the granularity of NR iterations Beneficial to large circuits FE Estimation NR Iteration 1 Convergence Time point 1 Thread 1 Time point 2 Thread 2 Time point 3 Thread 3 time NR Iteration 2 NR Iteration 3 FE Estimation NR Iteration 1 Convergence NR Iteration 2 NR Iteration 3 FE Estimation NR Iteration 1 Convergence NR Iteration 2 NR Iteration 3 …
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 19 Multi-threaded WavePipe Combine backward with forward waveform pipelining Ex: 4T (1-backward-2-forward) WavePipe T1 T2 T3 T4 Initial Solutions … … Backward Forward 2 nd Forward Base Gear2 point One Thread Scheduling Cycle FE Newton FE Newton FE Newton FE Newton Time step T2: backward T1: standard T3: forward T4: 2 nd forward
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 20 Thread Scheduling The work done over an overestimated step is discarded Without Step Size Overestimation Cycle Starts Cycle Completes Initial Conditions … Cycle Completes … Time Backward Forward 2 nd Forward Standard 4-Thread WavePipe (1-backward-2-forward scheme) With Step Size Overestimation Cycle Starts Partially Completes Cycle Completes … … Time Initial Conditions
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 21 Experimental Setup A 8-processor Linux server with four dual-core processors WavePipe implemented in C/C++ using pThreads (Gear2) Compare with –Reference serial SPICE-like (Gear2) transient simulation –Low level parallel matrix solve (SuperLU) and device evaluation Test circuits IndexCircuitSizeTime PointsSerial Run Time (s) 1VCO2086, Power Amplifier8113, DB mixer27134, Ring Oscillator61110, Frequency Divier1744, Digital Adder1122, RLC mesh 113, , RLC mesh 227, ,659.35
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 22 Experimental Results – Accuracy & Profiling 3-T (1 backward + 1 forward) WavePipe vs. serial (DB mixer) Real-time threading profiling (mesh ckt)
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 23 Experimental Results – 2T Speedups 2T 1-backward & 2T 1-forward Circuit 2T 1-backward2T 1-forward T(s)SpeedupT(s)Speedup VCO Power Amplifier DB mixer Ring Oscillator Frequency Divier Digital Adder RLC mesh RLC mesh X 1.57X
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 24 Experimental Results – 3T Speedups 3T 1-backward-1-forward & 3T 2-forward Circuit 3T 1-back-1-forward3T 2-forward T(s)SpeedupT(s)Speedup VCO Power Amplifier DB mixer Ring Oscillator Frequency Divier Digital Adder RLC mesh RLC mesh X 1.83X
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 25 Experimental Results – 4T Speedups 4T 1-backward-2-forward & 4T 3-forward Circuit 4T 1-back-2-forward4T 3-forward T(s)SpeedupT(s)Speedup VCO Power Amplifier DB mixer Ring Oscillator Frequency Divier Digital Adder RLC mesh RLC mesh X 2.19X
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 26 Experimental Results – Runtime Scaling 2-4 threads
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 27 Experimental Results Low-level scheme –Parallel matrix solve & device model evaluation Proposed scheme –1-4 threads: WavePipe –8 threads: 3-forward WavePipe + parallel matrix sol. & model eval.
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 28 Summary Multi-core challenges & opportunities for EDA Application-level coarse-grained parallelism for transient simulation –Parallelize at a granularity of single time-point circuit solution –Inherent low inter-core communication overhead –Maintain accuracy & convergence –Ease in implementation and code reuse Rich sets of parallelisms for multi-core or many-core systems –New parallel opportunities orthogonal to fine-grained schemes –Pair with parallel matrix solve, device evaluation and low-level parallel programming assists
DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 29 Thanks