Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.
Open64 Workshop Outline Motivation Types of structure layout optimizations Criteria for structure layout optimizations Implementation details Performance results Future work Conclusion
Open64 Workshop Motivation Poor data locality in many applications High data cache miss rates Growing gap between processor and memory speeds Our Approach Change layout of data structures Requires whole-program optimization Use Inter-Procedural Analysis and Optimizations (IPA) Our Aim Make applications more cache-friendly
Open64 Workshop IPA Summarization Analysis Optimization
Open64 Workshop Types of Structure Layout Optimizations Structure splitting Structure peeling struct struct_A { double d1; double d2; int i; float f; long long l; char c; struct struct_A * next; }; struct struct_A { double d1; double d2; int i; float f; long long l; char c; };
Open64 Workshop Structure Splitting Example struct new_struct_A { double d1; int i; long long l; struct new_struct_A * next; struct cold_sub_struct_A * p; }; struct struct_A { double d1; double d2; int i; float f; long long l; char c; struct struct_A * next; }; struct cold_sub_struct_A { double d2; float f; char c; };
Open64 Workshop Structure Peeling Example struct new_struct_A { double d1; int i; long long l; }; struct struct_A { double d1; double d2; int i; float f; long long l; char c; }; struct cold_sub_struct_A { double d2; float f; char c; };
Open64 Workshop Criteria for structure layout optimizations Legality Analysis Type cast Address of a field is taken Escaped types Parameter types Full visibility to IPA Alignment restrictions Profitability Analysis Hotness Affinity Field accesses at loop level Size
Open64 Workshop Implementation Details Step 1: Type information summarization (IPL) Step 2: Symbol table merging (IPA) Step 3: Legality and profitability analysis (IPA analysis) Step 4: Transforming the program (IPA optimization)
Open64 Workshop Implementation Details: Type information summarization Information summarization in IPL Framework for computing static profiles using heuristics New TY flag TY_NO_SPLIT SUMMARY_TY_INFO SUMMARY_LOOP For each DO_LOOP, WHILE_DO, DO_WHILE Bit-vector to track field accesses of up to N structure for each loop Considers field accesses immediately inside loop These fields are considered affine to each other Execution count of statements immediately inside loop From statically estimated profiles or from runtime feedback
Open64 Workshop Implementation Details: IPA Analysis Inter-procedurally update statically estimated execution count of PUs Update statically estimated loop frequencies in SUMMARY_LOOP Consider SUMMARY_LOOP from the hottest P PUs Determine candidates for structure-layout transformation Determine new layout of structures
Open64 Workshop Implementation Details: IPA Analysis Example F4F4 F3F3 F2F2 F1F1 BV L1L L2L L3L L4L L5L F4F4 F3F3 F2F2 F1F1 AG 1 40 AG 2 14 AG 3 88 L i — Loops F j — Fields in a struct AG k — Affinity groups
Open64 Workshop Implementation Details: Transforming the program struct S struct T { // N fields // AG1 fields struct T * p; // AG2 fields // M fields }; }; // peel T struct S { // N fields struct T1 * p1; struct T2 * p2; // M fields }; New type definitions Field table update Field access statements New symbols Assignment statements Example: struct T1 struct T2 { // AG1 fields // AG2 fields };
Open64 Workshop Implementation Details: Transforming the program (continued) Function calls to memory management routines Example: p = (T *) malloc (N * sizeof (T)) if (p == NULL) exit (1); Detect memory management routine calls involving transformed type T Replicate call, assignment statements Update size of memory being allocated Handle comparisons involving pointer p
Open64 Workshop Performance Results Compilations options: -Ofast at 32-bit ABI Speedup due to structure layout optimizations Benchmarks AMD Opteron™ (2.8GHz, 4GB, 1MB) AMD Barcelona(2. 0GHz, 8GB, 512KB) Intel® EM64T(3.4G Hz, 4GB, 1MB) Intel® Core™(3.0 GHz, 4GB, 4MB) SiCortex MIPS®(500MHz, 4GB, 256KB) Geometric Mean 179.art134%66%56%47%41%62.5% 181.mcf24%23% 31%13%22.0% 462.libquantum32%17%40%72%62%39.6% Geometric Mean46.9%29.6%37.2%47.2%32.1% 37.9%
Open64 Workshop Performance Results (continued) Compilations options: -Ofast at 64-bit ABI Speedup due to structure layout optimizations Benchmarks AMD Opteron™ (2.8GHz, 4GB, 1MB) AMD Barcelona(2. 0GHz, 8GB, 512KB) Intel® EM64T(3.4G Hz, 4GB, 1MB) Intel® Core™(3.0 GHz, 4GB, 4MB) SiCortex MIPS®(500MHz, 4GB, 256KB) Geometric Mean 179.art169%66%53%60%45%69.3% 181.mcf25%35%12%30%7%18.6% 462.libquantum82%51%75%70%69%68.6% Geometric Mean70.2%49.0%36.3%50.1%27.9% 44.6%
Open64 Workshop Performance Results (continued) Compilations options: -Ofast at 64-bit ABI Multiple copies of 462.libquantum running on multi-core chip Platform: Quad-core AMD Barcelona (2.0 GHz, 8GB, 512KB, 2MB) 3 rd level cache shared among 4 cores Speedup from structure layout optimizations Benchmark1 copy2 copies4 copies 462.libquantum51%69%123%
Open64 Workshop Future Work Tune static profile estimation Less restrictions Integrate with field-reordering
Open64 Workshop Conclusion A framework for performing structure layout transformations is now available in the Open64 compiler. The superior infrastructure in the Open64 compiler helped us implement the optimizations cleanly and with relatively less effort. Substantial speedups are possible on some of the CPU2000 and CPU2006 SPEC benchmarks. Structure layout optimization is a required feature for a compiler to remain competitive.