Improving Runtime and Memory Requirements in EDA Applications Alan Mishchenko UC Berkeley.

Improving Runtime and Memory Requirements in EDA Applications Alan Mishchenko UC Berkeley

2 Overview Introduction Introduction Topics Topics Network traversal Network traversal AIG package AIG package SAT solver SAT solver BDD package BDD package Memory management Memory management Locality of computation Locality of computation Conclusion Conclusion

3 Network Traversal Optimizing node memory for DFS traversal Optimizing node memory for DFS traversal Storing fanins/fanouts in the node Storing fanins/fanouts in the node Using traversal IDs Using traversal IDs Using wave-front traversals Using wave-front traversals Minimizing memory footprint Minimizing memory footprint

4 Memory Alloc In Topological Order Optimize node memory for DFS traversal Optimize node memory for DFS traversal Allocate memory from an array in a DFS order Allocate memory from an array in a DFS order Primary inputs Primary outputs 4 2 3 8 7 5 6 1

5 Store Fanins/Fanouts in the Node Embed the dynamic array into the node Embed the dynamic array into the node Leads to direct pointing or storing integer IDs of the fanin/fanouts Leads to direct pointing or storing integer IDs of the fanin/fanouts In rare cases when memory reallocation is needed (<0.1% of nodes), use a new piece of memory to store extended array of fanins/fanouts In rare cases when memory reallocation is needed (<0.1% of nodes), use a new piece of memory to store extended array of fanins/fanouts struct Nwk_Obj_t_ { … int nFanins; // the number of fanins int nFanins; // the number of fanins int nFanouts; // the number of fanouts int nFanouts; // the number of fanouts int nFanioAlloc; // the number of allocated fanins/fanouts int nFanioAlloc; // the number of allocated fanins/fanouts Nwk_Obj_t ** pFanio; // fanins/fanouts Nwk_Obj_t ** pFanio; // fanins/fanouts}; pObj = (Nwk_Obj_t *)Aig_MmFlexEntryFetch( sizeof(Nwk_Obj_t) + sizeof(Nwk_Obj_t *) * (nFanins + nFanouts + p->nFanioPlus) ); pObj->pFanio = (Nwk_Obj_t **)((char *)pObj + sizeof(Nwk_Obj_t));

6 Traversal ID void Nwk_ManDfs_rec( Nwk_Man_t * p, Nwk_Obj_t * pObj, Vec_Ptr_t * vNodes ) { if ( Nwk_ObjIsTravIdCurrent(p, pObj) ) if ( Nwk_ObjIsTravIdCurrent(p, pObj) ) return; return; Nwk_ObjSetTravIdCurrent(p, pObj); Nwk_ObjSetTravIdCurrent(p, pObj); Nwk_ManDfs_rec( p, Nwk_ObjFanin0(pObj), vNodes ); Nwk_ManDfs_rec( p, Nwk_ObjFanin0(pObj), vNodes ); Nwk_ManDfs_rec( p, Nwk_ObjFanin1(pObj), vNodes ); Nwk_ManDfs_rec( p, Nwk_ObjFanin1(pObj), vNodes ); Vec_PtrPush( vNodes, pObj ); Vec_PtrPush( vNodes, pObj );} Vec_Ptr_t * Nwk_ManDfs( Nwk_Man_t * p ) { Vec_Ptr_t * vNodes; Vec_Ptr_t * vNodes; Nwk_Obj_t * pObj; Nwk_Obj_t * pObj; int i; int i; Nwk_ManIncrementTravId( p ); Nwk_ManIncrementTravId( p ); vNodes = Vec_PtrAlloc(); vNodes = Vec_PtrAlloc(); Nwk_ManForEachPo( p, pObj, i ) Nwk_ManForEachPo( p, pObj, i ) Nwk_ManDfs_rec( p, pObj, vNodes ); Nwk_ManDfs_rec( p, pObj, vNodes ); return vNodes; return vNodes;} Use a specialized integer data-member of the node to remember the number of the last traversal that visited this node Use a specialized integer data-member of the node to remember the number of the last traversal that visited this node

7 Wave-Front Traversals Some applications use additional memory at each node Some applications use additional memory at each node Examples: Simulation, cut enumeration, support computation Examples: Simulation, cut enumeration, support computation 1K per node for 1M nodes = 1Gb of additional memory! 1K per node for 1M nodes = 1Gb of additional memory! Case study: Computing input supports of each output of the network Case study: Computing input supports of each output of the network Used, for example, to compute (a) output partitioning, (b) register dependency matrix (A. Dasdan et al, “An experimental study of minimum mean cycle algorithms”, 1998) Used, for example, to compute (a) output partitioning, (b) register dependency matrix (A. Dasdan et al, “An experimental study of minimum mean cycle algorithms”, 1998) Code: procedure Aig_ManSupports() in file “abc\src\aig\aig\aigPart.c” Code: procedure Aig_ManSupports() in file “abc\src\aig\aig\aigPart.c” At any time during traversal, a wave-front is the set of nodes such that: all fanins are already visited and at least one fanout is not yet visited. Additional memory is only needed for the nodes on the wave-front. For most industrial designs, wave-front is about 1% of all nodes (1Gb  10Mb). Wave-front

8 Minimizing Memory Footprint When repeatedly traversing a large network, runtime is determined by memory pumped through the CPU (pointer chasing) When repeatedly traversing a large network, runtime is determined by memory pumped through the CPU (pointer chasing) Examples when repeated traversal cannot be avoided Examples when repeated traversal cannot be avoided Sequential simulation of a network for many cycles Sequential simulation of a network for many cycles Computing maximum-network flow during retiming, etc Computing maximum-network flow during retiming, etc In such applications, it is better to develop a specialized, static, low-memory representation of the network In such applications, it is better to develop a specialized, static, low-memory representation of the network Reducing memory 2x may improve runtime 3-5x Reducing memory 2x may improve runtime 3-5x Example: Most-forward retiming (code in “abc\src\aig\aig\aigRet.c”) Example: Most-forward retiming (code in “abc\src\aig\aig\aigRet.c”) If repeated topological and reverse topological traversals are performed, it may be better to have two networks, each having memory allocated to facilitate each traversal order If repeated topological and reverse topological traversals are performed, it may be better to have two networks, each having memory allocated to facilitate each traversal order

9 Implementation of AIG Package Fixed amount of memory for each AIG node Fixed amount of memory for each AIG node Arbitrary fanout also uses fixed amount of memory per node! Arbitrary fanout also uses fixed amount of memory per node! Different memory configurations Different memory configurations Structural hashing Structural hashing The only potentially non-cache-friendly operation The only potentially non-cache-friendly operation Tricks to speed up structural hashing Tricks to speed up structural hashing AIGER: Compact binary AIG representation format AIGER: Compact binary AIG representation format Work of Armin Biere (Johannes Kepler University, Linz, Austria) Work of Armin Biere (Johannes Kepler University, Linz, Austria) Available at http://fmv.jku.at/aiger Available at http://fmv.jku.at/aiger

10 AIG Node 12 bytes (32b) / 12 bytes (64b) 12 bytes (32b) / 12 bytes (64b) struct Gia_Obj_t_ { unsigned iDiff0 : 29; // the diff of the first fanin unsigned iDiff0 : 29; // the diff of the first fanin unsigned fCompl0: 1; // the complemented attribute unsigned fCompl0: 1; // the complemented attribute unsigned fMark0 : 1; // first user-controlled mark unsigned fMark0 : 1; // first user-controlled mark unsigned fTerm : 1; // terminal node (CI/CO) unsigned fTerm : 1; // terminal node (CI/CO) unsigned iDiff1 : 29; // the diff of the second fanin unsigned iDiff1 : 29; // the diff of the second fanin unsigned fCompl1: 1; // the complemented attribute unsigned fCompl1: 1; // the complemented attribute unsigned fMark1 : 1; // second user-controlled mark unsigned fMark1 : 1; // second user-controlled mark unsigned fPhase : 1; // value under 000 pattern unsigned fPhase : 1; // value under 000 pattern unsigned Value; // application-specific value unsigned Value; // application-specific value}; 36 bytes (32b) / 56 bytes (64b) 36 bytes (32b) / 56 bytes (64b) struct Aig_Obj_t_ { Aig_Obj_t * pNext; // strashing table Aig_Obj_t * pNext; // strashing table Aig_Obj_t * pFanin0; // fanin Aig_Obj_t * pFanin0; // fanin Aig_Obj_t * pFanin1; // fanin Aig_Obj_t * pFanin1; // fanin Aig_Obj_t * pHaig; // pointer to the HAIG node Aig_Obj_t * pHaig; // pointer to the HAIG node unsigned int Type : 3; // object type unsigned int Type : 3; // object type unsigned int fPhase : 1; // value under 00...0 pattern unsigned int fPhase : 1; // value under 00...0 pattern unsigned int fMarkA : 1; // multipurpose mask unsigned int fMarkA : 1; // multipurpose mask unsigned int fMarkB : 1; // multipurpose mask unsigned int fMarkB : 1; // multipurpose mask unsigned int nRefs : 26; // reference count unsigned int nRefs : 26; // reference count unsigned Level : 24; // the topological level unsigned Level : 24; // the topological level unsigned nCuts : 8; // the number of cuts unsigned nCuts : 8; // the number of cuts int Id; // unique ID int Id; // unique ID int TravId; // ID of the last traversal int TravId; // ID of the last traversal union { // temporary storage union { // temporary storage void * pData; void * pData; int iData; int iData; float fData; float fData; }; };}; ABC has several AIG packages ABC has several AIG packages A low-memory package is used for simulation and equivalence checking A low-memory package is used for simulation and equivalence checking A more elaborate package is used for general AIG manipulation A more elaborate package is used for general AIG manipulation Observation: It is better to store node fanins as integer IDs rather than pointers.

11 Fixed-Memory Fanout for AIGs Solution (due to Satrajit Chatterjee): Solution (due to Satrajit Chatterjee): Use 5 pointers (integers) for each node Use 5 pointers (integers) for each node One pointer (integer) contains the first fanout of the node One pointer (integer) contains the first fanout of the node Other pointers (integers) are used to create two double-linked linked lists Other pointers (integers) are used to create two double-linked linked lists Each list stores fanout representation of the corresponding fanin Each list stores fanout representation of the corresponding fanin Double-linked lists allow for constant-time addition/removal of node fanouts Double-linked lists allow for constant-time addition/removal of node fanouts Code in file “abc\src\aig\aig\aigFanout.c” Code in file “abc\src\aig\aig\aigFanout.c” n a bc NULL fanins node } fanouts of the first fanin } fanouts of the second fanin first fanout n nn

12 Structural Hashing The only potentially non-cache-friendly AIG operation The only potentially non-cache-friendly AIG operation Structural hashing is very valuable – but cannot avoid hashing Structural hashing is very valuable – but cannot avoid hashing The standard hash-table is used, with nodes having the same hash key being linked into single-linked lists The standard hash-table is used, with nodes having the same hash key being linked into single-linked lists The pointer to the next node is embedded in the AIG node The pointer to the next node is embedded in the AIG node Tried the linear-probing hash-table without improvement Tried the linear-probing hash-table without improvement Trick to sometimes avoid hash-table look-up Trick to sometimes avoid hash-table look-up When building a new node, do not look it up in the table if at least one of its fanins has reference counter 0 When building a new node, do not look it up in the table if at least one of its fanins has reference counter 0

13 AIGER Uses ~3 bytes per AIG node, on average Uses ~3 bytes per AIG node, on average 1M node AIG can be written into a 3Mb file 1M node AIG can be written into a 3Mb file ~12x more compact than Verilog, BLIF, or BENCH ~12x more compact than Verilog, BLIF, or BENCH ~5x faster reading/writing for large files ~5x faster reading/writing for large files Key observations used by AIGER Key observations used by AIGER To represent a node, two integers (fanin literals) need to be represented To represent a node, two integers (fanin literals) need to be represented The fanin literals are often numerically close The fanin literals are often numerically close Only the difference between them can be stored, which typically takes only one byte Only the difference between them can be stored, which typically takes only one byte

14 SAT Solver A modern SAT solver (in particular, MiniSAT) is a treasure-trove of tricks for efficient implementation A modern SAT solver (in particular, MiniSAT) is a treasure-trove of tricks for efficient implementation To mentions just a few To mentions just a few Representing clauses as arrays of integers Representing clauses as arrays of integers Using signatures to check clause containment Using signatures to check clause containment Using two-literal watching scheme Using two-literal watching scheme etc etc

15 SAT Solver (What’s Missing?) Most of the modern SAT solvers are geared to solving hard problems, such as those encountered in SAT competitions (1 problem ~ 15 min) Most of the modern SAT solvers are geared to solving hard problems, such as those encountered in SAT competitions (1 problem ~ 15 min) This motivates elaborate data-structures and high memory usage This motivates elaborate data-structures and high memory usage 64 bytes per variable; 16 bytes per clause; 4 bytes per literal 64 bytes per variable; 16 bytes per clause; 4 bytes per literal In ABC, runtime of several applications is dominated by SAT In ABC, runtime of several applications is dominated by SAT SAT sweeping SAT sweeping Sequential SAT sweeping (register/signal correspondence) Sequential SAT sweeping (register/signal correspondence) Accumulation of structural choices Accumulation of structural choices Computing don’t-cares in a window Computing don’t-cares in a window The SAT problems solved in these applications have much in common The SAT problems solved in these applications have much in common Incremental (each problem has +/- 10 AIG nodes, compared to the previous problem solved) Incremental (each problem has +/- 10 AIG nodes, compared to the previous problem solved) Relatively easy (less than 100 conflicts) Relatively easy (less than 100 conflicts) Numerous (10K-100K problems) Numerous (10K-100K problems) Based on these observations, a new efficient circuit-based SAT solver was developed (abc\src\aig\gia\giaCSat.c) Based on these observations, a new efficient circuit-based SAT solver was developed (abc\src\aig\gia\giaCSat.c)

16 Experimental Results (SAT)

17 Experimental Results (CEC) CEC results for 8 hard industrial instances. Runtime in minutes on Intel Q9450 @ 2.66 Ghz. Time1 is “cec” in ABC809xx. Time2 is “&cec” in abc90329. Timeout is 1 hour. Less than 100 Mb of RAM was used in these experiments.

18 Why MiniSAT Is Slower? Requires multiple intermediate steps Requires multiple intermediate steps Window  AIG  CNF  Solving Window  AIG  CNF  Solving Instead of Window  Solving Instead of Window  Solving Uses too much memory Uses too much memory Solver + CNF = 140 bytes / AIG node Solver + CNF = 140 bytes / AIG node Instead of 8-16 bytes / AIG node Instead of 8-16 bytes / AIG node Decision heuristics Decision heuristics Are not aware of the circuit structure Are not aware of the circuit structure Instead of Using circuit information Instead of Using circuit information

19 BDD Package Similar to a SAT solver, a modern BDD package is a well-researched computation engine, which performs Similar to a SAT solver, a modern BDD package is a well-researched computation engine, which performs Boolean function manipulation Boolean function manipulation Garbage collection Garbage collection Dynamic variable reordering Dynamic variable reordering etc etc The usefulness of BDD package is limited since the arrival of AIGs (2000) and efficient SAT solvers (2001) The usefulness of BDD package is limited since the arrival of AIGs (2000) and efficient SAT solvers (2001) However, some applications still rely on BDDs (for example, exact reachability analysis) However, some applications still rely on BDDs (for example, exact reachability analysis) This motivates building a better BDD package This motivates building a better BDD package

20 BDD Package (What’s Missing?) How a modern BDD package can be improved? How a modern BDD package can be improved? Make it pointer-independent (!) Make it pointer-independent (!) Leads to reproducible results across different runs / platforms Leads to reproducible results across different runs / platforms Improve CPU cache behavior by using 8 bytes per node Improve CPU cache behavior by using 8 bytes per node Present packages use 16 or 32 bytes (on a 32- or 64-bit computer) Present packages use 16 or 32 bytes (on a 32- or 64-bit computer) Improve dynamic variable reordering Improve dynamic variable reordering Currently, it is very slow (~1M BDD nodes takes ~5 min) Currently, it is very slow (~1M BDD nodes takes ~5 min) Apply variable reordering more frequently Apply variable reordering more frequently Rather than wait to BDD to grow large followed by slow reordering Rather than wait to BDD to grow large followed by slow reordering These and other ideas are currently being implemented These and other ideas are currently being implemented

21 Minimalistic BDD Data-Structure Node representation Node representation Node storage (8 bytes per node) Node storage (8 bytes per node) Next pointers (4 bytes per node) Next pointers (4 bytes per node) Unique table (4 bytes per node) Unique table (4 bytes per node) Computed table (16 bytes per entry) Computed table (16 bytes per entry) External referencing External referencing Two relatively small arrays of integers Two relatively small arrays of integers Variable / Level mapping Variable / Level mapping Two relatively small arrays of integers Two relatively small arrays of integers Dynamic variable reordering Dynamic variable reordering Temporary storage for nodes (8 bytes per node) Temporary storage for nodes (8 bytes per node) Temporary reference counters (4 bytes per node) Temporary reference counters (4 bytes per node) Temporary marks (1 bit per node) Temporary marks (1 bit per node)

22 BDD Node Representation struct Bdd_Node_t // 64 bits = 8 bytes { unsigned f0 : 24; // negative cofactor unsigned f0 : 24; // negative cofactor unsigned c0 : 1; // complemented attribute of negative cofactor unsigned c0 : 1; // complemented attribute of negative cofactor unsigned f1 : 24; // positive cofactor unsigned lev : 15; // level }; This node structure is optimized for frequent traversals This node structure is optimized for frequent traversals Allows for building BDDs with ~32K variables and ~16M nodes Allows for building BDDs with ~32K variables and ~16M nodes

23 Custom Memory Management Three types of memory managers in ABC Three types of memory managers in ABC Fixed-size Fixed-size Allocates/recycles entries of a fixed size Allocates/recycles entries of a fixed size Used for AIG nodes Used for AIG nodes Flexible-size Flexible-size Allocates (but does not recycle) entries of variable size Allocates (but does not recycle) entries of variable size Used for signal names Used for signal names Step-size Step-size Steps are degrees of 2 (4-8-16-32-etc) in bytes Steps are degrees of 2 (4-8-16-32-etc) in bytes Use for CNF clauses in the customized version of MiniSat Use for CNF clauses in the customized version of MiniSat Code in package “abc\src\aig\mem” Code in package “abc\src\aig\mem”

24 Locality of Computation To improve speed To improve speed Use less memory Use less memory Make transformations local Make transformations local Use contiguous data-structures Use contiguous data-structures Case study: BDDs vs. truth tables (TTs) Case study: BDDs vs. truth tables (TTs) In the past: “BDDs are present-day truth tables” In the past: “BDDs are present-day truth tables” These days: “Truth tables are present-day BDDs” These days: “Truth tables are present-day BDDs” Advantages of TTs Advantages of TTs Computation is more local Computation is more local Memory usage is predictable Memory usage is predictable For functions up to 16 vars, TTs lead to faster computation For functions up to 16 vars, TTs lead to faster computation ISOP, DSD, matching, decomposition, etc ISOP, DSD, matching, decomposition, etc Limitations of TTs Limitations of TTs Does not work for more than 16 variables Does not work for more than 16 variables Some operations are faster using BDDs, even for functions with 10 variables Some operations are faster using BDDs, even for functions with 10 variables E.g. cofactor satisfy counting E.g. cofactor satisfy counting

25 Conclusion Lessons learned while developing ABC Lessons learned while developing ABC Topics considered Topics considered Network traversal Network traversal AIG representation AIG representation SAT solving SAT solving Memory management Memory management Locality of computation Locality of computation Locality of computation is important Locality of computation is important Allows for efficient control of the resources Allows for efficient control of the resources Leads to scalability and parallelism Leads to scalability and parallelism

Improving Runtime and Memory Requirements in EDA Applications Alan Mishchenko UC Berkeley.

Similar presentations

Presentation on theme: "Improving Runtime and Memory Requirements in EDA Applications Alan Mishchenko UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improving Runtime and Memory Requirements in EDA Applications Alan Mishchenko UC Berkeley.

Similar presentations

Presentation on theme: "Improving Runtime and Memory Requirements in EDA Applications Alan Mishchenko UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback