TBS: Fast Analysis of Structured Power Grid by Triangularization Based Structure Preserving Model Order Reduction Hao Yu, Yiyu Shi and Lei He Electrical Engineering Dept. UCLA Partially supported by NSF and UC-MICRO fund sponsored by Analog Devices, Intel and Mindspeed http//:eda.ee.ucla.edu
2 New Challenges in Integrity Verification n Integrity verification is to check transient V/T-violation for linear power/signal/thermal network l Large-scale u millions of nodes and ports l Often structured u e.g., locally regular and globally irregular P/G network [Singh- Sapatnekar:TCAD’05] n A fast yet accurate linear simulator to perform large- scale transient verification is necessary l Linear-network macromodeling is one effective approach How to use structure information to build accurate and efficient macromodels
3 Existing Structured Macromodeling n Hierarchical node-elimination (HNE) by [Zhao-Panda- Sapatnekar-Blaauw:DAC’00] l Build macromodel by internal node elimination with source mapping l Analyze macromodel in a hierarchical (two-level) fashion l Require a sparsification by linear-programming (LP) due to the dense fill-in n SPRIM [Freund:ICCAD’04] and BSMOR [Yu-He-Tan:BMAS’05] l Leverage block structure in the state matrix l Build macromodel by a structure-preserved moment-matching n HiPRIME [Cao-Lee-Chen: DAC’02], a hierarchical extension of PRIMA [Odabaisoglu-Celik-Pileggi:TCAD’98] l Build macromodel by hierarchical orthonormalization l Lose the hierarchy due to the final flat-projection We propose a new structure-preserved moment matching, with 20x less waveform error and 50x speedup
4 Outline n Review macromodeling by moment matching n Our Approach: TBS method n Experimental Results n Conclusions
5 Macromodeling by Moment Matching (I) n Electric systems can be described in MNA (modified nodal analysis) Solution ( x ) of MNA is contained in block Krylov subspace n Grimme’s Projection Theorem
6 Macromodeling by Moment Matching (II) a) To remove linear dependency in the low-dimensioned projection matrix V, block-Arnoldi orthnormalization is applied c) To handle large number of inputs such as P/G network, SIMO (single-input-multi-output) reduction can be assumed b) To preserve passivity, a congruence transformation is used to project state matrices ( G,C,B,L ) respectively Replace the input port matrix B by a common input vector J l All poles are matched w.r.t. one superposed input Matched moments/poles ( q ) are independent on input number ( p ) Feldmann-Liu: ICCAD’04 V is flat and destroys the structure of state matrices [Feldmann-Liu: ICCAD’04]
7 Structure-preserved Moment Matching n Limitations of SPRIM and BSMOR Moment/pole matching is not localized l Reduction does not preserve the structure of latency l Model does not leverage redundancy l Inefficient and inaccurate for P/G grid macromodeling SPRIM [Freund:ICCAD’04] leverages the 2 x 2 block structure in MNA Splits V into a 2 x 2 block diagonal form l Preserves the structure of reciprocity (symmetry between input and output), and hence achieves a higher accuracy than PRIMA n BSMOR [Yu-He-Tan:BMAS’05] partitions state matrices into more blocks Splits V into a m x m block diagonal form l Preserves the block structure and sparsity, and hence achieves better efficiency than SPRIM
8 Outline n Review macromodeling by moment matching n Our Approach: TBS method l Triangular Block Structured moment matching n Experimental Results n Conclusions
9 l Stamp interconnection blocks off-diagonally l Stamp basic blocks diagonally From Layout to Structured Model n Build a structured state matrix by partitioning the layout g-g-g -g 1 g 3 -g-gxgx -g 1 2g 1 -g1g1 -g-g1g1 g3g3 -g -g -g g4g4 -g-g -g 2 2g 2 -g -g 2 g 4 -g -g-g g 1 -g1g1 -g1g1 --g1g g1g1 --gxgx -gxgx -gxgx -g2g2 -g2g2 - -g2g2 - -g2g2 -g2g2 -g2g2 2g w1w1 w2w2 g 3 =2g 1 +g x g 4 =2g 2 +g x n A number of interconnected basic blocks can be used to represent both homogenous and heterogeneous circuits g1g1 g2g2 gxgx
10 Properties of Interconnected Basic Blocks n Structure of latency : the spatial distribution of time constants l Each basic block has a time constant Due to redundancy, basic block representation is not compact n Redundancy : different basic blocks can share a same or similar time constant
11 Dominant-pole Based Clustering removes redundancy TBS Flow (Reduced Blocks) (Basic Blocks) Block Diagonal Projection (Block Integrity) Two-level Relaxation Analysis (Triangular Blocks) Triangularization (Compact Blocks) Dominant-pole Clustering
12 Clustering Procedure n Compress basic blocks into compact blocks n Cluster number is determined by the nature of the network structure l There is no need to cluster a homogeneous circuit, but TBS still applies 2. Cluster basic blocks if the mode-distance is small enough 1. Calculate the q -dominant pole-set (mode) for each basic block and
13 Advantages of Clustering n Redundant poles are removed l Hence redundant columns in the projection matrix are also removed, i.e., the effective rank of projection matrix is improved n Structure of latency is leveraged l Each compact block can be solved with different time-step n A complete modal decomposition is achieved l Each compact block has a unique pole-set or mode, and the resulted system is block-wisely stiff System poles are determined by both diagonal and off- diagonal blocks, which is not efficient
14 TBS Flow Triangularization can localize system poles to diagonal blocks, which is the key contribution of this work (Reduced Blocks) (Basic Blocks) Block Diagonal Projection (Block Integrity) Two-level Relaxation Analysis (Triangular Blocks) Triangularization (Compact Blocks) Dominant-pole Clustering
15 Triangularization Procedure 2. Move the original lower-triangular parts to the new upper-triangular parts 1. Stack a replica-block diagonally n This procedure is implemented by a block matrix data structure without increasing memory usage
16 Advantages of Triangularization n System poles are determined only by those compact blocks in diagonal l Compact blocks are almost decoupled from each other n A triangular system has a factorization cost only coming from those diagonal blocks l There is no need to factorize the entire matrix n Block duplication results in an equivalent solution l Simpler than the existing permutation based triangularization procedure [Kim Davis: KLU] Due to the replica block, the overall cost of factorization is the same as the original
17 TBS Flow Block diagonal projection can reduce the system size and the cost of the factorization (Reduced Blocks) (Basic Blocks) Block Diagonal Projection (Block Integrity) Two-level Relaxation Analysis (Triangular Blocks) Triangularization (Compact Blocks) Dominant-pole Clustering
18 2. Reduce the state matrices block by block respectively Block Diagonal Projection Procedure 1. Split a flat into a structured with an increased rank by a factor of cluster number n The reduced system preserves upper-triangular structure
19 Advantages of Block Diagonal Projection n System moments and poles are matched locally Each compact block is reduced locally to match q poles Total mq poles are matched for m unique compact blocks (poles from the replica are duplicate poles) n Reduced model preserves block triangular structure and structure of latency l Each reduced block can be factorized independently l Each reduced block could have different time-constant n More matched poles improves accuracy l Using a low-order reduction for each compact block locally can achieve a high-order accuracy for the overall system It can be efficiently solved by a block backward-substitution or a two-level analysis with relaxation
20 TBS Flow Two-level relaxation can further reduce simulation cost Reduced Blocks Basic Blocks Block Diagonal Projection Block Integrity Two-level Relaxation Analysis Triangular Blocks Triangularization Compact Blocks Dominant-pole Clustering
21 Two-level Relaxation Solver n The time-domain iteration of a triangular system always converges [White: Book’87] n Two-level representation and analysis + n Each reduced diagonal block can be factorized independently, and solved with different time step during backward-Euler (BE) integration l In contrast, the previous pole-residue solution u eigen-decompose the entire reduced matrix (dense and no structure) u structure of latency cannot be explored
22 Outline n Review macromodeling by moment matching n Our Approach: TBS method n Experimental Results n Conclusions
23 Experiment Settings n Large-scale homogeneous and heterogeneous P/G grid (RC-mesh) with millions of nodes n For heterogeneous case, each block has different wire-pitch/width, block-size and hence different time-constant n Reduction algorithm assumes SIMO reduction for large number of inputs but also supports the general MIMO reduction n Compare TBS to BSMOR [Yu-He-Tan:BMAS’05], HiPRIME [Cao-Lee-Chen:DAC’02], and HNE [Zhao-Panda-Sapatnekar-Blaauw:DAC’00]
24 Triangular Block Structure Preservation n Nonzero (nz) pattern of conductance matrices l (a) original system l (b) triangular system l (c) reduced system by TBS
25 m x q Pole Matching (m0=32, m=4, q=8 ): TBS has exact 32 -pole matched, BSMOR has exact 8 -pole matched and 24 -pole approximately matched, and HiPRIME (a partitioned PRIMA) has only 8 -pole matched n Waveforms in time domain: improved accuracy with more matched poles
26 Study Waveform-error Scalability ckt Node (N)Port (p)Order (q)HNEHiPRIMEBSMORTBS ckt11K e-69.09e-64.87e-65.03e-7 ckt210K e-52.31e-57.93e-61.84e-6 ckt3100K e-26.82e-41.91e-43.02e-5 ckt41M e-29.67e-34.23e-31.27e-4 ckt57.68M ,93e-25.10e-23.01e-3 ckt67.68M6.14M300NA 5.04e-3 n HiPRIME, BSMOR and TBS use the same order (moments) to generate the macromodel n The macromodel obtained by HNE has a similar size and sparsity as TBS 1. TBS reduces waveform-error by 38X compared to HNE as truncation used in HNE leads to large error 2. TBS reduces waveform-error by 33X compared to HiPRIME as more poles are matched 3. TBS reduces waveform-error by 17X compared to BSMOR as more poles are exactly matched
27 Study Runtime Scalability 1day:1hr:2 9min 6min:16sNA ckt6 1day:18min2min:8s1day:1hr: 36min 1hr:45m in ~5day2min:42s1day:5hr:1 1min 4hr:43min:18sckt5 11min:23s20.7s11min:42s4min:54 s ~1day47.3s21min:32s34min:58sckt4 1min:32s1.62s1min:38s1min:2s2hr:48min :20s 5.76s1min:51s1min:17sckt3 1.02s0.11s1.18s0.63s1min:42s0.54s1.24s2.19sckt2 0.08s0.09s0.08s0.12s1.02s0.15s0.08s0.44sckt1 simulationbuildsimulationbuildsimulationbuildsimulationbuild TBSBSMORHiPRIMEHNE ckt n All methods generate macromodels with similar accuracy 1. TBS (and HiPRIME) is 133X faster to build than HNE as no LP-truncation is needed to preserve sparsity 2. TBS (and HiPRIME) is 54X faster to build than BSMOR as the orthonormalization is performed locally 3. TBS (and BSMOR/HNE) is 109X faster to simulate than HiPRIME as their macromodels have hierarchy n Runtime includes macromodel-building/simulation time
28 Conclusions n TBS enables localized moment matching, and matches more poles than PRIMA n TBS is stable, and is passive for MIMO reduction n TBS is applicable to both homogenous and heterogeneous designs n TBS achieves over 20x less waveform error and 50x speedup compared to HNE, HiPRIME, and BSMOR (an improved version of SPRIM) n TBS approach has been extended to l Handle inductance and its inverse element [Yu-Shi-He:ICCAD’06] l Optimize simultaneous power and thermal integrity in 3D integration [Yu-Ho-He:ICCAD’06] More details can be found in DAC Ph. D forum 2006