WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines Wei Dong, Peng Li, Xiaoji Ye Department of ECE,

Slides:

Advertisements

Similar presentations

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Advertisements

Lecture 6: Multicore Systems

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract.

Slide 1 Bayesian Model Fusion: Large-Scale Performance Modeling of Analog and Mixed- Signal Circuits by Reusing Early-Stage Data Fa Wang*, Wangyang Zhang*,

Reachability Analysis for AMS Verification using Hybrid Support Function and SMT- based Method Honghuang Lin, Peng Li Dept. of ECE, Texas A&M University.

Ordinary Differential Equations

Time-Domain Segmentation based Massively Parallel Simulation Bichen Wu Dept. Micro/nano electronics Tsinghua Univ., Beijing, China

OpenFOAM on a GPU-based Heterogeneous Cluster

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

1 EE 616 Computer Aided Analysis of Electronic Networks Lecture 12 Instructor: Dr. J. A. Starzyk, Professor School of EECS Ohio University Athens, OH,

CISC 879 : Software Support for Multicore Architectures John Cavazos Dept of Computer & Information Sciences University of Delaware

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.

SAMSON: A Generalized Second-order Arnoldi Method for Reducing Multiple Source Linear Network with Susceptance Yiyu Shi, Hao Yu and Lei He EE Department,

UCSD CSE245 Notes -- Spring 2006 CSE245: Computer-Aided Circuit Simulation and Verification Lecture Notes Spring 2006 Prof. Chung-Kuan Cheng.

Fundamental Issues in Parallel and Distributed Computing Assaf Schuster, Computer Science, Technion.

More Realistic Power Grid Verification Based on Hierarchical Current and Power constraints 2 Chung-Kuan Cheng, 2 Peng Du, 2 Andrew B. Kahng, 1 Grantham.

CPU Performance Assessment As-Bahiya Abu-Samra *Moore’s Law *Clock Speed *Instruction Execution Rate - MIPS - MFLOPS *SPEC Speed Metric *Amdahl’s.

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

CSE245: Computer-Aided Circuit Simulation and Verification Lecture Note 2: State Equations Prof. Chung-Kuan Cheng.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Runjie Zhang Dec.3 S. Li et al. in MICRO’09.

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Sigrity, Inc © Efficient Signal and Power Integrity Analysis Using Parallel Techniques Tao Su, Xiaofeng Wang, Zhengang Bai, Venkata Vennam Sigrity,

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

A New Method For Developing IBIS-AMI Models

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

1 Chapter 5: Harmonic Analysis in Frequency and Time Domains Contributors: A. Medina, N. R. Watson, P. Ribeiro, and C. Hatziadoniu Organized by Task Force.

Large Timestep Issues Lecture 12 Alessandra Nardi Thanks to Prof. Sangiovanni, Prof. Newton, Prof. White, Deepak Ramaswamy, Michal Rewienski, and Karen.

Transient Analysis CK Cheng UC San Diego CK Cheng UC San Diego Jan. 25, 2007.

Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

Circuit Simulation using Matrix Exponential Method Shih-Hung Weng, Quan Chen and Chung-Kuan Cheng CSE Department, UC San Diego, CA Contact:

SPICE Diego : Circuit Simulation for Post Layout Analysis Chung-Kuan Cheng Department of Computer Science and Engineering University of California, San.

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.

IThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues.

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Xing Cai University of Oslo

Parallel Processing - introduction

Lazy Diagnosis of In-Production Concurrency Bugs

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Synchronization trade-offs in GPU implementations of Graph Algorithms

Soft Error Detection for Iterative Applications Using Offline Training

Supported by the National Science Foundation.

(A Research Proposal for Optimizing DBMS on CMP)

Final Project presentation

Presentation transcript:

WavePipe: Parallel Transient Simulation of Analog and Digital Circuits on Multicore Shared-Memory Machines Wei Dong, Peng Li, Xiaoji Ye Department of ECE, Texas A&M University {weidong, pli, neo.tamu.edu

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 2 Multi-Core Implications  Multi-core shift is changing the landscape of computing  New challenges & opportunities for EDA –Free ride of single-threaded EDA applications on Moore’s Law is coming to an end  Question: How to fully exploit increasingly parallel hardware and achieve good runtime scaling?? Courtesy Intel Courtesy AMD Courtesy IBM

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 3 Why Parallel Transient Simulation?  SPICE-like transient simulation is key to wide ranges of ICs –Memories, custom digital, analog/RF/mixed-signal  Long simulation time presents significant bottleneck in design –CPU time > days, weeks (e.g. transistor-level PLL simulation) –Can lead to insufficient verification, non-optimal design, chip failure Natural target for parallelization!

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 4 Prior Work  Fine-grained parallelization –Parallel matrix solves, device model evaluations –The efficiency of parallel matrix solvers deteriorates quickly  Parallel waveform relaxation [White et al ’87,Reichelt et al ICCAD’03] –Limited convergence property  Domain decomposition [Wever et al, HICSS’96] –Can create dense problems –Applicability highly application dependent Performance of a public parallel matrix solver on a 8-processor server

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 5 Our Strategies  Exploit coarse-grained & application-level parallelisms –Lessons learned before [T. Mattson, Intel] –>100 parallel languages/environments developed in the 90’s ! –Only a few with significant domain knowledge made successful –Develop simulation algorithms parallelizable by construction  Goals/Benefits –Reduce parallel overhead via applying domain knowledge –Create rich parallelisms for multi-/many-core platforms (pairing with fine grained methods) –Ease in parallel programming, debug and code reuse –Do not jeopardize accuracy & convergence

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 6 Proposed Approach  Time-domain MNA formulation  How to parallelize along the time axis?  Data dependency : vector of unknowns : static nonlinearities : dynamic nonlinearities Nonlinear DAEs : inputs t1t1 t2t2 t3t3 t4t4 t5t5 t1t1 t2t2 t3t3 t4t4 t5t5 One-step integration two-step integration

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 7 Waveform Pipelining (WavePipe) … Backward Pipelining … Forward Pipelining Multi-Step Num. Integration Predictive Computing T Current/base Position Granularity of Waveform Pipelining Schedule T1 T2 T3 T4 … Solve Fine Grained Parallel Assists Parallel Matrix Solve/Device Evaluation Multi-/Many-Core Machine

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 8 Outline  Motivation  Overview  Parallel backward pipelining  Parallel forward pipelining  Experimental results  Summary

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 9 Parallel Backward Pipelining  Move backwards in time  Create additional independent computing tasks along T axis  Why useful? –Employ under variable-stepsize multi-step numerical integration –Contribute to a larger future time step … … Backward Pipelining Forward Pipelining Multi-Step Num. Integration Predictive Computing T Current Position

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 10 Variable-Stepsize Multi-Step Gear’s Method  Gear’s integration formula  Two-step Gear’s method [Shichman, Trans. Circuit Theory, 1970] : order of numerical integration : circuit response at time point i : coefficients

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 11 Local Truncation Error (LTE)  Numerical integration error incurred “locally” at each point –All the previous solutions are assumed to be accurate  LTEs in Gear’s methods Two-step Three-step

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 12 LTE based Time Step Control (Gear2)  Control the time step to meet an LTE tolerance  LTE’s dependency on h n & h n+1  Key observation –Smaller h n  greater h n+1 : if DD3 nonincreasing –Exploit for parallel computing T ? h n+1 hnhn

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 13 Parallel Backward Pipelining  Serial Gear2  Double-threaded Gear2  Balance between efficiency and robustness:  Extensible to multi-step methods (e.g. Gear3) Initial t1 & t2 Tr1: t3 (h3  h2) Tr2: back to t3’ Tr1: t4 (h4  h3’) Tr2: back to t4’ time t1 t2 t3 t3’ h2h2 h3h3 h4h4 h3’h3’ t4 t4’ h4’h4’ Thread 1 Thread 2 t1 t2 t3 h2h2 h3h3 h4h4 t4

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 14 Parallel Forward Pipelining  Move forwards in time  Exploit predictive computing along the forward T direction  Question –How to resolve data dependency & ensure accuracy … … Backward Pipelining Forward Pipelining Multi-Step Num. Integration Predictive Computing T Current Position

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 15 Parallel Forward Pipelining  Ex: double threaded Init. t1 & t2 Time point t3 (h3  h2) FE estimate Time point t4 (h4  h3) Solve & Time point t5 (h5  h4) FE estimate Time point t6 (h6  h5) Solve & time t1 t2 t3 t4 h2h2 h3h3 h5h5 h4h4 t5 t6 h6h6 Thread 1 Thread 2

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 16 Complications  Time steps for forward points may not be estimated accurately –Data dependency on initial conditions –Apply a damping factor (β<1.0) for time step estimation –Revoke forward results in thread scheduling cycle (covered later)  Forward points based on inaccurate initial conditions –Addressed by inter-thread communication –Tradeoffs provided by fine/coarse grained communications … Forward Pipelining T Base Position h=? … Forward Pipelining T Base Position Accuracy?

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 17 Coarse Grained Inter-thread Communication FE Estimation Newton Loop One or more iter. Convergence Time point 2 Thread 2 time FE Estimation Newton Loop One or more iter. Convergence Time point 1 Thread 1 time … FE Estimation Newton Loop One or more iter. Convergence Time point 3 Thread 3 …  Iterate on the converged initial condition

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 18 Fine Grained Inter-thread Communication time  Communicate at the granularity of NR iterations  Beneficial to large circuits FE Estimation NR Iteration 1 Convergence Time point 1 Thread 1 Time point 2 Thread 2 Time point 3 Thread 3 time NR Iteration 2 NR Iteration 3 FE Estimation NR Iteration 1 Convergence NR Iteration 2 NR Iteration 3 FE Estimation NR Iteration 1 Convergence NR Iteration 2 NR Iteration 3 …

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 19 Multi-threaded WavePipe  Combine backward with forward waveform pipelining  Ex: 4T (1-backward-2-forward) WavePipe T1 T2 T3 T4 Initial Solutions … … Backward Forward 2 nd Forward Base Gear2 point One Thread Scheduling Cycle FE Newton FE Newton FE Newton FE Newton Time step T2: backward T1: standard T3: forward T4: 2 nd forward

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 20 Thread Scheduling  The work done over an overestimated step is discarded Without Step Size Overestimation Cycle Starts Cycle Completes Initial Conditions … Cycle Completes … Time Backward Forward 2 nd Forward Standard 4-Thread WavePipe (1-backward-2-forward scheme) With Step Size Overestimation Cycle Starts Partially Completes Cycle Completes … … Time Initial Conditions

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 21 Experimental Setup  A 8-processor Linux server with four dual-core processors  WavePipe implemented in C/C++ using pThreads (Gear2)  Compare with –Reference serial SPICE-like (Gear2) transient simulation –Low level parallel matrix solve (SuperLU) and device evaluation  Test circuits IndexCircuitSizeTime PointsSerial Run Time (s) 1VCO2086, Power Amplifier8113, DB mixer27134, Ring Oscillator61110, Frequency Divier1744, Digital Adder1122, RLC mesh 113, , RLC mesh 227, ,659.35

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 22 Experimental Results – Accuracy & Profiling  3-T (1 backward + 1 forward) WavePipe vs. serial (DB mixer)  Real-time threading profiling (mesh ckt)

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 23 Experimental Results – 2T Speedups  2T 1-backward & 2T 1-forward Circuit 2T 1-backward2T 1-forward T(s)SpeedupT(s)Speedup VCO Power Amplifier DB mixer Ring Oscillator Frequency Divier Digital Adder RLC mesh RLC mesh X 1.57X

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 24 Experimental Results – 3T Speedups  3T 1-backward-1-forward & 3T 2-forward Circuit 3T 1-back-1-forward3T 2-forward T(s)SpeedupT(s)Speedup VCO Power Amplifier DB mixer Ring Oscillator Frequency Divier Digital Adder RLC mesh RLC mesh X 1.83X

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 25 Experimental Results – 4T Speedups  4T 1-backward-2-forward & 4T 3-forward Circuit 4T 1-back-2-forward4T 3-forward T(s)SpeedupT(s)Speedup VCO Power Amplifier DB mixer Ring Oscillator Frequency Divier Digital Adder RLC mesh RLC mesh X 2.19X

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 26 Experimental Results – Runtime Scaling  2-4 threads

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 27 Experimental Results  Low-level scheme –Parallel matrix solve & device model evaluation  Proposed scheme –1-4 threads: WavePipe –8 threads: 3-forward WavePipe + parallel matrix sol. & model eval.

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 28 Summary  Multi-core challenges & opportunities for EDA  Application-level coarse-grained parallelism for transient simulation –Parallelize at a granularity of single time-point circuit solution –Inherent low inter-core communication overhead –Maintain accuracy & convergence –Ease in implementation and code reuse  Rich sets of parallelisms for multi-core or many-core systems –New parallel opportunities orthogonal to fine-grained schemes –Pair with parallel matrix solve, device evaluation and low-level parallel programming assists

DAC 2008 WavePipe: Parallel Transient Simulation on Multicore Shared-Memory Machines 29 Thanks