1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University
2 Motivation Huge processor-memory performance gap Latency > 100 cycles significant fraction of memory operations in typical programs In many applications, Recursive Data Structures (RDS) constitute a large fraction of memory usage Year
3 Motivation Techniques to minimize the performance impact of this gap Caching, prefetching, out-of-order execution Not very successful for RDS Difficult to statically determine many RDS properties Accesses are irregular and usually lie in critical path of execution while (valid(node)){ //do something //with node->data node = next(node) } Traversal Code Short loop body prevents efficient OoO execution 0x1000 0x2000 0x3000 0x4000 An RDS layout example Non-contiguous layout results in irregular access patterns
4 Motivation Linearization[Clark76, Luk99] Speculation recovery costs outweighs benefits if the next pointer field gets overwritten frequently Information on the dynamic behavior of entire RDS structure is important pos index = 0; head = pos[index] while(head){ foo(head) head = pos[index++] check(head) } head Placement of the nodes in the figure correspond to their placement in memory
5 RDS Profile RDS profiling gives a ‘logical’ understanding of runtime behavior ‘Application creates 100 trees’ instead of ‘application allocates 2MB in heap’ ‘Linked list traversed 10 times’ instead of ‘Address 0x accessed 200 times’ Profile for linearization: next pointer field in list L is modified n times
6 node *tree_create(){ node *n = (node *)malloc(…); … n->left = tree_create(…); n->right = tree_create(…); } RDS Discovery Assign unique id for value returned by malloc and create a node labeled by that id Connect nodes by a directed edge if both the address and the value of a store have valid ids 123 Dynamic Shape Graph C function for creating a tree call malloc ; id = 1 mov r10 = r8 … call tree_create … call malloc ; id = 2 … mov r11 = r8 store r10[offset1] = r11 ; create 1->2 call tree_create … call malloc ; id = 3 … mov r12 = r8 store r10[offset2] = r12; create 1->3 Execution trace in (pseudo) assembly
7 RDS Discovery Multiple RDS instances can be connected together in the DSG! To separate them, we use properties of the static code Use another graph called Static Shape Graph (SSG) array = malloc(…); for (i=…) array[i] = create_tree(…); … 5
8 RDS discovery For every static call to malloc, create a node with unique id in the Static Shape Graph (SSG) If a store creates an edge, connect the corresponding static nodes Check for SCCs in the SSG Connect two dynamic nodes only if their corresponding static nodes are in same SCC DSG Execution trace in (pseudo) assembly A T SSG call malloc; id = 1 Mov r20 = r8 …call malloc ; id = 2 …mov r10 = r8 …… …call tree_create …… call malloc ; id = 3 …… mov r11 = r8 …store r10[offset1] = r11; create 2->3 …call tree_create …… call malloc ; id = 4 …… mov r12 = r8 … store r10[offset2] = r12; create 2->4 store r20[0] = r10 ; create 1->2
9 Experimental setup Uses Pin, a dynamic instrumentation tool for Itanium Mapping between address ranges and dynamic ids are stored in an AVL tree Most recent mapping is cached A mix of benchmarks from SPEC, Olden and other pointer intensive applications Dynamic instruction count varies from a few million (ks) to over 300 billion (mesa) All experiments run on a 900MHz Itanium 2 with 2 GB RAM running RH 7.1
10 Profiler Performance Profile: RDS size, lifetime, access count Memory: <16 MB for all but 3 applications Baseline: Execution using Pin (~ 10 times slower than native)
11 RDS usage statistics SCCs in static shape graph (RDS types) Usually a few(<5) per benchmark, a maximum of 31 in parser #RDS instances (connected components in DSG) Exhibits a wide range (1 in mcf to around million in parser) Tend to be live for long if the program creates only a few of them Sizes of RDS instances Varies from a single node self-loop (parser) to a few hundred thousand nodes (mcf, parser) #pointer chasing loads Significant in many benchmarks Applications show vast diversity in RDS usage A good reason for profiling them!
12 Temporal distribution
13 Cumulative distribution of RDS lifetimes
14 RDS Stability Stability of an RDS : A notion of how 'array-like' an RDS is Stability index : an attempt to quantify this notion Identify the time instances (alteration points) when changes occur to the RDS structure (by stores that replace existing pointers) Count the traversals between successive alteration points Stability index = #intervals that account for ‘most’ of the traversals Lower index means higher stability
15 Cumulative distribution of stability index
16 Conclusion Aggressive data structure level optimization techniques for RDS need profile information for improved performance RDS profiling gives a better understanding of the runtime behavior of RDS RDS usage varies widely across benchmarks
17 Extra Slides
18 RDS Profiling: Definitions RDS type: The abstract form of the logical data structure that is manipulated by the program Examples: list, binary tree, graph, etc. Can be mutually recursive (nodes point to their incident edges and vice versa to form a graph) RDS instance: A concrete realization of the RDS type Example: the tree created in function foo, the list pointed to by the first entry of the hash table.
19