Yiyu Shi Electrical Engineering Dept. UCLA http//:eda.ee.ucla.edu

Structured and Parameterized Macromodeling for High-performance Mixed-Signal Circuits
Yiyu Shi Electrical Engineering Dept. UCLA http//:eda.ee.ucla.edu Research Projects Overview

New Design Challenges of VLSI CAD
High-performance design to verify the integrity of DVT violation Robust design to maximize the yield under PVT variation High-frequency Large Scale Structured Parameterized High-performance design High-frequency: strong electromagnetic coupling Large-scale: large number of nets and ports Structured: locally regular and globally irregular layout structure Parameterized: plenty of design parameters as well as the perturbations by variations Current VLS designs have two trends: One is still to design for the high-performance to verify the integrities under the signal delay, power supply voltage, and thermal temperature violations. The other is to design for the robustness to maximize the yield under the process, supply voltage and temperature variations. This thesis focuses on the first aspect, which has following latest changelings. First, the high-performance design is usually designed for high-speed and high-frequency, and hence there exists strong electromagnetic couplings. Second, the design in deep submicron results in a distributed circuit model and hence has large number of nets and ports. Moreover, the integration is usually heterogeneous and hence results in a spatially Non-uniform current and power distribution. Therefore, the circuit model shows locally regular but globally irregular structure. In addition, there are plenty of physical design automation freedoms such as gat/wire sizing, decap insertion and clock-tree embedding. All need a parameterized description. It challenges the circuit level simulation as a detailed simulation will never finish. A fast simulator that can perform accurate yet efficient verification and optimization becomes a need. Detailed simulation will never finish! Fast Circuit Simulator Research Projects Overview

Structured and Parameterized Simulator
Compact Macromodeling Capture the essential input/output behavior Hierarchical Simulation (Structured) Encapsulate the physics under given diverse technology Automatically move between hierarchical levels Interfacing for Synthesis/Optimization (Parameterized) Compute performance sensitivities for fabrication decisions, layout modifications, and architectural changes Parameterization Extraction Structuring Model Order Reduction Circuit Stamping Design Automation We envision that such a simulator should have following properties: First, we need a compact macromodeling to capture the essential input and output behaviors. Second, we need to perform a hierarchical simulation that can encapsulate the different physics under given diverse technology, and automatically move between hierarchical levels. In addition, we need the interfacing for synthesis and optimization by computing performance sensitivities for fabrication decisions, layout modifications and architectural changes. Research Projects Overview

Structure-preserving Macromodeling Parameterized Macromodeling
Outline Background Structure-preserving Macromodeling TBS Parameterized Macromodeling EMPIRE In the following I’ll first review the background of circuit stamping. Research Projects Overview

Macromodel by Model Reduction
Represent circuit in state equation, and then apply projection to reduce size large size … … Small but dense small size project Preserve accuracy by matching moments of state equations As shown by this figure, constructing macromodel by model order reduction is simply to reduce the system size by projecting a small sized matrix. The reduced model can preserve accuracy by matching moments of inputs by expanding the system transfer functions. To obtain the system transfer function, let’s first discuss how to describe the system in state equations. Research Projects Overview

State Equation Description by MNA
A network is described by two state variables: nodal voltage and branch current – modified nodal analysis vn+ Vn- Ib R The stamping is not symmetric but passive Check (and full rank) An electronic system can be generally described in the state equation by modified nodal analysis. In the standard MNA, a network is described by two state variables: nodal voltage and branch current. The resulting state equation has two parts. One is frequency independent matrix composed by conductance and incident matrix describing the inductive current flow. Another is the frequency dependent matrix composed by capacitance and inductance. The resulting stamping of state matrix has following properties. First, it is not symmetric but is passive. A passive matrix means that its transposed summation is semi-positive definite. In addition, the stamping in MNA is non-singular. This can be explained in two folds: First, the state equation is still definite at dc because the frequency dependent matrix approaches zero at dc. This is no surprise because at dc inductor is shorted and capacitor is open. Second, the state matrix is non-singular. This is important especially the frequency-independent matrix, because it will be factorized many times during the simulation. The stamping is non-singular State equation is still definite at dc L is shorted and C is open at dc State matrix is not rank-deficient especially for because it needs to be factorized many times Research Projects Overview

Macromodeling by Moment Matching (I)
Solution of MNA is contained in block Krylov subspace NxN qxq MIMO MOR: … Nxq qxN By defining two moment generation matrices, it is easy to check the solution of above MNA is contained by a block-Krylov subspace. Grimme’s projection theorem says that we can find a lower-dimensioned matrix V, which contains the qth-order block Krylov subspace. Then using such a V to project the original system, the resulting dimension-reduced system will match the original system by a number of block moments. Grimme’s Projection Theorem Research Projects Overview

Macromodeling by Moment Matching (II)
To remove linear dependency in the lower-dimensioned projection matrix V, block-Arnoldi orthnormalization is applied To preserve passivity, a flat-congruence transformation is used to project state matrices (G,C,B,L) respectively In addition, to remove linear dependency in the lower-dimensioned matrix V, block-Arnoldi orthnormalization is applied. To preserve passivity, a flat-congruence transformation is used to project state matrices respectively. To handle large number of inputs such as P/G network, SIMO (single-input-multi-output) reduction is assumed. It replaces the input port matrix B by a common input vector J. As such, all poles are common to one superposed input, and the matched moments/poles will be independent on input number. Research Projects Overview

Limitations of Previous Approaches
Flat projection destroys the block matrix structure such as sparsity, hierarchy and latency Global orthnormalization is inefficient for large-scale circuit with inductance No sensitivity information Research Projects Overview

Outline Background Structure-preserving Macromodeling TBS Parameterized Macromodeling EMPIRE Conclusions In the following I’ll first review the background of circuit stamping. Research Projects Overview

Structure-preserving Macromodeling
SoC, SiP, 3D integrations introduce heterogeneous switching current density Structured layout and structured model in forms of sparsity, hierarchy and latency Flat projection destroys the structure of original model Dense matrix: not sparse, not hierarchical Model structure does not show latency and has redundancy information and can not be analyzed with different time-step Moment/pole matching is not localized The researches of inverse-inductance element simulation include the first-order and second-order method. The first-order method stamps L^-1 by modified nodal analysis. As shown by this paper, the direct stamping will lead to a non-passive model. A double-inversion based stamping is proposed in this paper, but this method needs an extra cost of inverting L amtrix. On the other hand, second order method stamps L^-1 by nodal analysis. As discussed in our paper, such a NA stamping has singularity at dc and is not able to stamp back for time domain simulation. In addition, all these methods do not preserve the structure during reduction, and the reduced system is still dense. Primary contributions of our work are two fold: First, we propose a Vector potential nodal analysis (VNA) to represent L^-1 in a non-singular and passive stamping. Second, we apply a Bordered-block-diagonal structured reduction (BVOR) to preserve not only passivity but also sparsity and hierarchy. Research Projects Overview

From Layout to Structured Model
Build a structured state matrix by partitioning the layout Stamp basic blocks diagonally Stamp interconnection blocks off-diagonally w1 w2 2g1 2g - - g g1 - - g1 g - - g1 g3 - - g1 g - - gx 1 1 2 2 5 5 6 6 - - g g1 - - g1 gx - - g1 g - - g1 g3 - - g gx 3 3 4 4 7 7 8 8 - - g gx g4 - - g2 g - - g g2 We first build a structured state matrix by partitioning the layout. For a mesh in left containing 8 nodes, its structured state matrix is constructed as follows. We first partition it into two sub-meshes, each with different wire width. We then stamp the two sub-meshes diagonally in the state matrix. We call these two diagonal parts as basic blocks. We also stamp the interconnections between the two mesh off-diagonally and call these two off-diagonal parts as interconnection blocks. g1 g2 - - g2 2g2 - - g2 g - - g gx - - g g4 - - g g2 - - g g2 - - g g2 2g2 g3=2g1+gx g4=2g2+gx A number of interconnected basic blocks can be used to represent both homogenous and heterogeneous circuits Research Projects Overview

Properties of Interconnected Basic Blocks
Structure of latency: the spatial distribution of time constants Each basic block has a time constant So given a structured P/G grid in left, it can be represented by the interconnected basic blocks in right. Such a interconnected-basic-block representation has following properties. Firstly, because each basic block has a time constant, there is a spatial distribution of time constants. Such a distribution can be described by the structure of latency. In addition, different basic blocks can share a same or similar time constant, and this can be described by redundancy. Due to redundancy, basic block representation is not compact. Redundancy: different basic blocks can share a same or similar time constant Due to redundancy, basic block representation is not compact Research Projects Overview

TBS Flow Dominant-pole Clustering finds latency and removes redundancy
Basic Blocks Block Diagonal Projection Reduced Blocks Dominant-pole Clustering Two-level Relaxation Analysis Triangular Blocks Compact Blocks Therefore, our first step is to use dominant-pole clustering to find latency and remove redundancy Triangularization Block Integrity Dominant-pole Clustering finds latency and removes redundancy Research Projects Overview

Clustering Procedure Compress basic blocks into compact blocks
Represent each basic block by its first q-dominant pole-set (mode) calculated from a qth-order reduction Block j Block i Cluster basic blocks based on the mode-distance Cluster number is determined by the nature of the P/G grid structure There is no need to cluster uniform (RC-constant) grid as it contains no latency information We first represent each basic block by its first q-dominant pole set, called mode, which is calculated from a qth-order SIMO reduction. Afterwards, we cluster basic blocks based on the mode-distance defined below. Note that the cluster number is determined by the nature of the P/G grid structure, and there is no need to cluster uniform grid as it has no latency information. Research Projects Overview

Advantages of Clustering
A complete modal decomposition is achieved Each compact block has a unique pole-set or mode The resulted system is block-wisely stiff Redundant poles are removed Similar to low-rank approximation Latency is discovered Each compact block can be solved with different time-step Clustering has following advantages. Firstly, a complete modal decomposition is achieved, where each compact block has a unique pole-set or mode. The resulted system is block-wisely stiff. It means that poles are well separated. Secondly, the redundant poles are removed. Therefore the redundant columns in projection matrix are also removed. It avoids a restarted deflation procedure. Furthermore, latency is discovered, where each compact block may be simulated with different time-step. However, due to the interconnection, the system poles are not only determined by those compact blocks. But system poles are not determined only by those compact diagonal blocks due to interconnections Research Projects Overview

TBS Flow Basic Blocks Block Diagonal Projection Reduced Blocks Dominant-pole Clustering Two-level Relaxation Analysis Triangular Blocks Compact Blocks Therefore, our second step is to use triangularization to localize the system poles at those diagonal blocks. This is the key contribution of this work. Triangularization Block Integrity Triangularization can localize system poles at those diagonal blocks (Key contribution) Research Projects Overview

Triangularization Procedure
Step 1: stack a replica-block diagonally Step 2: move the original lower-triangular parts to the new upper-triangular parts Triangularization is first to stack diagonally a replica block, and then moves the original lower-triangular parts to the new upper-triangular parts. As such, it results in a upper-triangular system with m+1 blocks. Big-C matrix can be constructed similarly and so on for the state variables and inputs. Note that the triangularization can be implemented by a block matrix data structure without increasing memory usage. It is implemented by a block matrix data structure without increasing memory usage Research Projects Overview

Advantages of Triangularization
System poles are localized at those compact diagonal blocks Each compact block is almost decoupled Block duplication results in an equivalent solution It simplifies the existing permutation based triangularization procedure [Kim Davis: KLU] A triangular system has a factorization cost only coming from diagonal blocks There is no need to factorize the entire matrix Triangularization has following advantages: System poles are localized at those diagonal blocks, each almost decoupled. Moreover, block duplication results in an equivalent solution. This approach simplifies the existing permutation based triangularization procedure like KLU. In addition, the factorization cost comes from only those diagonal blocks. There is no need to factorize the entire matrix. But due to the replica block, the overall factorization cost is the same as the original. But the overall cost of the factorization is the same as the original due to the replica block Research Projects Overview

TBS Flow Basic Blocks Block Diagonal Projection Reduced Blocks Dominant-pole Clustering Two-level Relaxation Analysis Triangular Blocks Compact Blocks Therefore, our third step is to use block diagonal projection to reduce system size and hence overall factorization cost. Triangularization Block Integrity Block diagonal projection can reduce the overall system size and cost of the factorization Research Projects Overview

Block Diagonal Projection Procedure
Split a flat into a structured with an increased rank by a factor of cluster number Project the state matrices block by block respectively We first spit a flat projection matrix to a structured projection matrix. The resulting structured projection matrix has an increased rank by a factor of cluster number. Using the structured projection matrix, we reduce the state matrices block by block respectively. It is clear that the reduced system has a preserved upper-triangular block structure. Research Projects Overview

Advantages of Block Diagonal Projection
System moment and poles are locally preserved Each compact block is reduced locally to match q poles Total mq poles are matched for m unique compact blocks (poles from the replica are duplicate poles) More matched poles improves accuracy Using a low-order reduction for each compact block locally can achieve a high-order accuracy for the overall system Reduced model has a preserved block triangular structure Each reduced block can be factorized interpedently Each reduced block could have different time-constant As a result, the system moments and poles are preserved. Each compact block is reduced locally to match q poles, and there are total m unique compact blocks and hence m times q poles could be matched. As observed in experiment, more matched poles improve the accuracy. In order words, using a low-order reduction for each compact block locally can achieve a high-order accuracy for the overall system. In addition, because the reduced model has a preserved block triangular structure, it can be further solved by a block-backward-substitution or a two-level analysis with relaxation. It can be efficiently solved by a block backward-substitution or a two-level analysis with relaxation Research Projects Overview

TBS Flow Basic Blocks Block Diagonal Projection Reduced Blocks Dominant-pole Clustering Two-level Relaxation Analysis Triangular Blocks Compact Blocks The final step is therefore a two-level relaxation analysis. Triangularization Block Integrity Two-level relaxation can further reduce simulation cost by studying latency Research Projects Overview

Two-level Relaxation Solver
Two-level representation and analysis + Each reduced diagonal block can be factorized independently, and can be solved with different time step during backward-Euler (BE) integration In contrast, the previous pole-residue solution by PACT [Kerns-YANG:TCAD’98] eigen-decompose the entire reduced matrix (no structure and dense) no latency can be explored The reduced system can be further decomposed into diagonal blocks and off-diagonal blocks respectively. It only needs to factorize those diagonal blocks, and each reduced block can be solved with different time step during the Backward-Euler integration. In contrast, a pole-residue solution needs to eigen-decompose the entire reduced matrix and it has no latency to be explored. Note that the time-domain iteration of a triangular system can stably converge. The time-domain iteration of a triangular system can always stably converge Research Projects Overview

Triangular Block Structure Preservation
We first show the triangular block structure preservation. The blue entries in the figure show the non-zero pattern of the conductance matrix. (a) is the original system and (b) is the system after triangularization. After the block-diagonal projection, the reduced system (c ) has a preserved upper-triangular block structure of (b). Nonzero (nz) pattern of conductance matrices (a) original system, (b) triangular system, (c) reduced system by TBS Research Projects Overview

m x q Pole Matching Time-domain waveform comparison of TBS and BSMOR, HiPRIME and method in HNE[Zhao:DAC’00] Waveforms in time domain: improved accuracy with more matched poles Next, we show the accuracy improvement by m x q-pole matching. A non-uniform RC mesh is used , which contains 32 basic blocks. After clustering, we obtain 4 compact blocks, and reduce each by a 8th-order reduction. The figure in left shows that TBS has exact 32 poles matched, BSMOR has exact 8 poles and approximated 24 poles matched, and HiPRIME (a partitioned PRIMA) has 8 poles matched only. Correspondingly, the figure in right shows the waveform in time-domain. We observe improved accuracy with more matched poles. Pole matching comparison: mq poles matched by TBS and BSMOR, and q poles matched by HiPRIME (existing hierarchal method of PRIMA ). (q=8, m0=32, m=4): TBS has exact 32-pole matched, BSMOR has exact 8-pole and approximated 24-pole matched, and HiPRIME (a partitioned PRIMA) has exact 8-pole matched Research Projects Overview

Study Runtime Scalability
1day:1hr:29min 6min:16s NA ckt6 1day:18min 2min:8s 1day:1hr:36min 1hr:45min ~5day 2min:42s 1day:5hr:11min 4hr:43min:18s ckt5 11min:23s 20.7s 11min:42s 4min:54s ~1day 47.3s 21min:32s 34min:58s ckt4 1min:32s 1.62s 1min:38s 1min:2s 2hr:48min:20s 5.76s 1min:51s 1min:17s ckt3 1.02s 0.11s 1.18s 0.63s 1min:42s 0.54s 1.24s 2.19s ckt2 0.08s 0.09s 0.12s 0.15s 0.44s ckt1 simulation build TBS BSMOR HiPRIME HNE ckt Runtime includes building and simulation time All methods generate macromodels with similar accuracy TBS (and HiPRIME) is >133X faster to build than HNE as no LP-truncation is needed to preserve sparsity TBS (and HiPRIME) is >54X faster to build than BSMOR as the orthonormalization is performed locally TBS (and BSMOR/HNE) is >109X faster to simulate than HiPRIME as their macromodels have hierarchy Finally, we study the runtime scalability of these methods. Read script … Research Projects Overview

Outline Background Structure-preserving Macromodeling TBS Parameterized Macromodeling EMPIRE Conclusions In the following I’ll first review the background of circuit stamping. Research Projects Overview

Parameterized MOR Most physical design and optimization problems involves nonlinear optimization Decap allocation, shields insertion, thermal via planning, and structured P/G clock network sizing Sensitivities are needed to linearize the nonlinear objection function Parameterized model order reduction can generate macromodels with all the parameters preserved The moments of the parameters of the design (POD) are exactly the sensitvities Moment matching sensivity matching Previous works [Daniel:TCAD’04] extends PRIMA to handle parameterized systems But can only handle a small number of parameters and match moments up to a low order. CORE [Li:ICCAD’05] uses explicit-and-implicit moment matching for parameterized interconnect model reduction Still cannot match the moments of a huge number of parameters to a very high order Cannot match the moments of different parameters with different accuracy Research Projects Overview

Major Contribution of EMPIRE
It is an efficient yet accurate model order reduction method for physcal design with multiple parameters: Compared with CORE, with a small reduction size, it uses implicit moment matching to match high order POD moments, more accurate than the explicit moment matching used in CORE; It can match the moments of different PODs with different accuracy according to their influence on the objective. Experimental results show that compared with CORE and [Daniel:TCAD’04], EMPIRE results in 47.8X improved accuracy at a similar runtime. Research Projects Overview

Framework of EMPIRE Research Projects Overview

Parameter Number Reduction
Canonical form of a general parameterized system (E0 + E1s1 + E2s2 + … + Etst) x = Bu y=LTx, Define the significance of a parameter si as Any value in the range of si Theorem 1: SIG(si) shows the perturbation magnitude of si to the output. Therefore, we can neglect the parameters that have relative small SIG values and thus reduce the total number of parameters. Research Projects Overview

Projection Space Collapse
Find the original projection matrix by the traditional algorithms Find a new projection matrix V so that the weighted distance between colspan(V) and colspan( ) is minimized, i.e., Directly solve optimization problem Three different methods Nonlinear Programming (NP) Sequential Least Square (sLS) Sequential Barycenter Allocation (sBA) Iteratively solve the optimization problem NP sLS sBA Use quadratic approximation runtime accuracy Research Projects Overview

Three different methods
Find a new projection matrix V so that the weighted distance between colspan(V) and colspan( ) is minimized, i.e., Nonlinear Programming (NP) Directly solve the optimization problem Expensive but provides optimal solution Can be used for small scale problem Sequential Least Square (sLS) Solve the optimization problem incrementally Each time find one column vector in that has the smallest distance to the vectors in V orthogonalized by the ones already found. Sequential Barycenter Allocation (sBA) Use the barycenter to approximate the optimal solution of sLS. Research Projects Overview

Frequency Domain Moment Expansion & Projection
Frequency domain moments are critical to the waveform accuracy matching more moments! Let and , then Coefficient matrix corresponding to frequency variable Match up to q-th order of frequency domain moments Projection Research Projects Overview

Experimental Settings
We use different sizes of extracted RC meshes from industrial applications. All the algorithms are implemented in MATLAB We use a Linux workstation (P4 2.66G CPU and 2G RAM) We compare the runtime, time/frequency domain accuracy and scalability of our hybrid algorithm with [Daniel:TCAD’04] and CORE. Research Projects Overview

Waveform Comparison (a) (b) P/G RC meshes with nodes and 5000 parameters (pitch width) EMPIRE is identical to the original in both time domain (a) and frequency domain (b) , more accurate compared with CORE and [Daniel:TCAD’04]. Research Projects Overview

Waveform Comparison Output integral w.r.t. a randomly selected parameter (pitch width) EMPIRE is identical to the original, more accurate than CORE and [Daniel:TCAD’04] Research Projects Overview

Scalability Comparison
Time domain waveform relative error w.r.t reduction size EMPIRE has less error and coverges faster than CORE. Research Projects Overview

Runtime Comparison Runtime comparison between the three methods on RC meshes of different scales. EMPIRE has a similar runtime compared with CORE, and is 18.3X faster than [Daniel:TCAD’04] for model reduction time and 61.2X faster for simulation time. In addition, [Daniel:TCAD’04] cannot finish large examples. Research Projects Overview

Yiyu Shi Electrical Engineering Dept. UCLA http//:eda.ee.ucla.edu

Similar presentations

Presentation on theme: "Yiyu Shi Electrical Engineering Dept. UCLA http//:eda.ee.ucla.edu"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yiyu Shi Electrical Engineering Dept. UCLA http//:eda.ee.ucla.edu

Similar presentations

Presentation on theme: "Yiyu Shi Electrical Engineering Dept. UCLA http//:eda.ee.ucla.edu"— Presentation transcript:

Similar presentations

About project

Feedback