Good Programming Practices for Building Less Memory-Intensive EDA Applications Alan Mishchenko University of California, Berkeley
2 Outline Introduction Introduction What is special about programming for EDA What is special about programming for EDA Why much of industrial code is not efficient Why much of industrial code is not efficient Why saving memory also saves runtime Why saving memory also saves runtime When to optimize for memory When to optimize for memory Simplicity wins Simplicity wins Suggestions for improvement Suggestions for improvement Design custom data-structures Design custom data-structures Store objects in a topological order Store objects in a topological order Make fanout representation optional Make fanout representation optional Use 4-byte integers instead of 8-byte pointers Use 4-byte integers instead of 8-byte pointers Never use linked lists Never use linked lists Conclusions Conclusions
3 EDA Programming Programming for EDA is different from Programming for EDA is different from programming for the web programming for the web programming databases, etc programming databases, etc EDA deals with EDA deals with Very complex computations (NP-hard problems) Very complex computations (NP-hard problems) Very large datasets (designs with 100M+ objects) Very large datasets (designs with 100M+ objects) Programming for EDA requires knowledge of algorithms/data-structures and careful hand- crafting of efficient solutions Programming for EDA requires knowledge of algorithms/data-structures and careful hand- crafting of efficient solutions Finding an efficient solution is often the result of a laborious and time-consuming trial-and-error Finding an efficient solution is often the result of a laborious and time-consuming trial-and-error
4 Why Industrial Code Is Often Bad Heritage code Heritage code Designed long ago by somebody who did not know or did not care or both Designed long ago by somebody who did not know or did not care or both Overdesigned code Overdesigned code Designed for the most general case, which rarely or never happens Designed for the most general case, which rarely or never happens Underdesigned code Underdesigned code Designed for small netlists, while the size of a typical netlist doubles every few years, making scalability an elusive target Designed for small netlists, while the size of a typical netlist doubles every few years, making scalability an elusive target
5 Less Memory = Less Runtime Although not true in general, in most EDA applications dealing with large datasets, smaller memory results in faster code Although not true in general, in most EDA applications dealing with large datasets, smaller memory results in faster code Because most of the EDA computations are memory intensive, the effect of CPU cache misses determines their runtime Because most of the EDA computations are memory intensive, the effect of CPU cache misses determines their runtime Keep this in mind when designing new data-structures Keep this in mind when designing new data-structures
6 When to Optimize Memory? Optimize memory if we store many similar entries (nodes in a graph, timing objects, placement locations, etc) Optimize memory if we store many similar entries (nodes in a graph, timing objects, placement locations, etc) For example, when designing a netlist, which typically stores millions of individual objects, the object data-structure is very important For example, when designing a netlist, which typically stores millions of individual objects, the object data-structure is very important However, if only a few instances of a netlist are used at the same time, the netlist data- structure is less important However, if only a few instances of a netlist are used at the same time, the netlist data- structure is less important
7 Design Custom Data-Structures Figure out what is needed in each application and design a custom data-structure Figure out what is needed in each application and design a custom data-structure The lowest possible memory usage The lowest possible memory usage The fastest possible runtime The fastest possible runtime Simpler and cleaner code Simpler and cleaner code Often good data-structures can be reused elsewhere Often good data-structures can be reused elsewhere Translation to and from a custom data-structure rarely takes more than 3% of runtime Translation to and from a custom data-structure rarely takes more than 3% of runtime Example: In a typical synthesis/mapping application, it is enough to have ‘node’ and there is no need for ‘net’, ‘edge’, ‘pin’, etc Example: In a typical synthesis/mapping application, it is enough to have ‘node’ and there is no need for ‘net’, ‘edge’, ‘pin’, etc
8 Store Objects In a Topo Order Topological order Topological order When fanins (incoming edges) of a node precede the node itself When fanins (incoming edges) of a node precede the node itself Using topological order makes it unnecessary to recompute it when performing local or global changes Using topological order makes it unnecessary to recompute it when performing local or global changes Saves runtime Saves runtime Using topological order reduces CPU cache misses, which occur when computation jumps all over memory Using topological order reduces CPU cache misses, which occur when computation jumps all over memory Saves runtime Saves runtime It is best to have a specialized procedure or command to establish a topo order of the network (graph, etc) It is best to have a specialized procedure or command to establish a topo order of the network (graph, etc)
9 Fanout Representation Traditionally, each object (node) in a netlist has both fanins (incoming edges) and fanouts (outgoing edges) Traditionally, each object (node) in a netlist has both fanins (incoming edges) and fanouts (outgoing edges) In most applications, only fanins are enough In most applications, only fanins are enough Reduces memory ~2x Reduces memory ~2x Reduces runtime Reduces runtime Fanouts can be computed on demand Fanouts can be computed on demand Exercise: Implement computation of required times of all nodes in a combinational netlist without fanouts Exercise: Implement computation of required times of all nodes in a combinational netlist without fanouts If many cases, it’s enough to have “static fanout” If many cases, it’s enough to have “static fanout” If netlist is fixed, fanouts are never added/removed If netlist is fixed, fanouts are never added/removed
10 Use Integers Instead of Pointers In the old days, integer (int) and pointer (void *) used the same amount of memory (4 bytes) In the old days, integer (int) and pointer (void *) used the same amount of memory (4 bytes) In recently years, most of the EDA companies and their customers switched to using 64-bits In recently years, most of the EDA companies and their customers switched to using 64-bits One pointers now takes 8 bytes! One pointers now takes 8 bytes! However, most of the code uses a lot of pointers However, most of the code uses a lot of pointers This leads to a 2x memory increase for no reason This leads to a 2x memory increase for no reason Suggestion: Design your code to store attributes of objects as integers, rather than as pointers Suggestion: Design your code to store attributes of objects as integers, rather than as pointers
11 Avoiding Pointers (example) Node points to its fanins Node points to its fanins Fanins can be integer IDs, instead of pointers Fanins can be integer IDs, instead of pointers Instead of a linked list of node pointers, use an array of integer IDs Instead of a linked list of node pointers, use an array of integer IDs A linked list uses at least 6x more memory A linked list uses at least 6x more memory Iterating through a linked list is slower Iterating through a linked list is slower
12 Integer IDs for Indexing Attributes Each node in the netlist can have an integer ID Each node in the netlist can have an integer ID The node structure can be as simple as possible The node structure can be as simple as possible struct Node { struct Node { int ID; int ID; int nFanins; int nFanins; int * pFanins; int * pFanins; }; }; Any attribute of the node can be represented as an entry in the array with node’s ID used as an index Any attribute of the node can be represented as an entry in the array with node’s ID used as an index Vec Type; Vec Type; Vec Level; Vec Level; Vec Slack; Vec Slack; Attributes can be allocated/freed on demand, which helps control memory usage Attributes can be allocated/freed on demand, which helps control memory usage Light-weight basic data-structure makes often-used computations (such as traversals) very fast Light-weight basic data-structure makes often-used computations (such as traversals) very fast
13 Avoid Linked Lists Each link, in addition to user’s data, has previous and next fields Each link, in addition to user’s data, has previous and next fields Potentially 3x increase in memory usage Potentially 3x increase in memory usage Most of linked lists use pointers Most of linked lists use pointers Potentially 2x increase in memory usage Potentially 2x increase in memory usage Other drawbacks Other drawbacks Allocating numerous links leads to memory fragmentation Allocating numerous links leads to memory fragmentation Most data-structures can be efficiently implemented without linked lists Most data-structures can be efficiently implemented without linked lists
14 Simplicity Wins Whenever possible keep data-structures simple and light-weight Whenever possible keep data-structures simple and light-weight It is better to have on-demand attributes associated with objects, rather than an overly complex object data-structure It is better to have on-demand attributes associated with objects, rather than an overly complex object data-structure
15 Case Study: Storage for Many Similar Entries Same-size entries (for example, AIG or BDD nodes) are best stored in an array Same-size entries (for example, AIG or BDD nodes) are best stored in an array Node’s index is the place in the array where the node is stored Node’s index is the place in the array where the node is stored Different-size entries (for example, nodes in a logic network) are best stored in a custom memory manager Different-size entries (for example, nodes in a logic network) are best stored in a custom memory manager Manager allocates memory in pages (e.g. 1MB / page) Manager allocates memory in pages (e.g. 1MB / page) Each page can store entries of different size Each page can store entries of different size Each entry is assigned an integer number (called ID) Each entry is assigned an integer number (called ID) There is a vector mapping IDs into pointers to memory for each object There is a vector mapping IDs into pointers to memory for each object
16 Conclusion Reviewed several reasons for inefficient memory usage in industrial code Reviewed several reasons for inefficient memory usage in industrial code Offered several suggestions and good coding practices Offered several suggestions and good coding practices Gave a vow to think carefully about memory when designing new data-structures Gave a vow to think carefully about memory when designing new data-structures