Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of.

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of Illinois at Urbana-Champaign Best Paper Award at PLDI 2005 Presented by: Jeff Da Silva CARG - Aug 2 nd 2005

Motivation Computer architecture and compiler research has primarily focused on analyzing & optimizing memory access pattern for dense arrays rather than for pointer-based data structures. Computer architecture and compiler research has primarily focused on analyzing & optimizing memory access pattern for dense arrays rather than for pointer-based data structures.  e.g. caches, prefetching, loop transformations, etc. Why? Why?  Compilers have precise knowledge of the runtime layout and traversal patterns associated with arrays.  The layout of heap allocated data structures and traversal patterns can be difficult to statically predict (generally it’s not worth the effort).

Intuition Improving a dynamic data structure’s spatial locality through program analysis (like shape analysis) is probably too difficult and might not be the best approach. Improving a dynamic data structure’s spatial locality through program analysis (like shape analysis) is probably too difficult and might not be the best approach. What if you can somehow influence the layout dynamically so that the data structure is allocated intelligently and possibly appears like a dense array? What if you can somehow influence the layout dynamically so that the data structure is allocated intelligently and possibly appears like a dense array? A new approach: develop a new technique that operates at the “macroscopic” level A new approach: develop a new technique that operates at the “macroscopic” level  i.e. at the level of the entire data structure rather than individual pointers or objects

What is the problem? List 1 Nodes What the program creates: List 2 Nodes Tree Nodes

What is the problem? List 1 Nodes List 2 Nodes Tree Nodes What the compiler sees:

What is the problem? List 1 Nodes List 2 Nodes Tree Nodes What we want the program to create and the compiler to see: What the compiler sees:

Their Approach: Segregate the Heap Step #1: Memory Usage Analysis Step #1: Memory Usage Analysis  Build context-sensitive points-to graphs for program  We use a fast unification-based algorithm Step #2: Automatic Pool Allocation Step #2: Automatic Pool Allocation  Segregate memory based on points-to graph nodes  Find lifetime bounds for memory with escape analysis  Preserve points-to graph-to-pool mapping Step #3: Follow-on pool-specific optimizations Step #3: Follow-on pool-specific optimizations  Use segregation and points-to graph for later optimizations

Why Segregate Data Structures? Primary Goal: Better compiler information & control Primary Goal: Better compiler information & control  Compiler knows where each data structure lives in memory  Compiler knows order of data in memory (in some cases)  Compiler knows type info for heap objects (from points-to info)  Compiler knows which pools point to which other pools Second Goal: Better performance Second Goal: Better performance  Smaller working sets  Improved spatial locality Especially if allocation order matches traversal order Especially if allocation order matches traversal order  Sometimes convert irregular strides to regular strides

Contributions of this Paper 1.First “region inference” technique for C/C++:  Previous work required type-safe programs: ML, Java  Previous work focused on memory management 2.Region inference driven by pointer analysis:  Enables handling non-type-safe programs  Simplifies handling imperative programs  Simplifies further pool+ptr transformations 3.New pool-based optimizations:  Exploit per-pool and pool-specific properties 4.Evaluation of impact on memory hierarchy:  We show that pool allocation reduces working sets

Outline Introduction & Motivation Introduction & Motivation Automatic Pool Allocation Transformation Automatic Pool Allocation Transformation Pool Allocation-Based Optimizations Pool Allocation-Based Optimizations Pool Allocation & Optimization Performance Impact Pool Allocation & Optimization Performance Impact Conclusion Conclusion

Example struct list { list *Next; int *Data; }; list *createnode(int *Data) { list *New = malloc(sizeof(list)); New->Data = Num; return New; return New; return New;} void splitclone(list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } else { } else { *R2 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); }}

Example void processlist(list *L) { list *A, *B, *tmp; // Clone L, splitting nodes in list A, and B. splitclone(L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; free(A); A = tmp; } // free B list while (B) {tmp = B->Next; free(B); B = tmp; } } Note that lists A and B use distinct heap memory; it would therefore be beneficial if they were allocated using separate pools of memory.

Pool Alloc Runtime Library Interface void poolcreate(Pool *PD, uint Size, uint Align); Initialize a pool descriptor. (obtains one or more pages of memory using malloc) void pooldestroy(Pool *PD); Releases pool memory and destroy pool descriptor. void *poolalloc(Pool *PD, uint numBytes); void poolfree(Pool *PD, void *ptr); void *poolrealloc(Pool *PD, void *ptr, uint numBytes); Interface also uses: poolinit_bp(..), poolalloc_bp(..), pooldestroy_bp(..).

Algorithm Steps 1) Generate a DS Graph (Points-to Graphs) for each function 2) Insert code to create and destroy pool descriptors for DS nodes who’s lifetime does not escape a function. 3) Add pool descriptor arguments for every DS node that escapes a function. 4) Replace malloc and free with calls to poolalloc and poolfree. 5) Further refinements and optimizations

Points-To Graph – DS Graph Builds a points-to graph for each function in Bottom Up [BU] order Builds a points-to graph for each function in Bottom Up [BU] order Context sensitive naming of heap objects Context sensitive naming of heap objects  More advanced than the traditional “allocation callsite” A unification-based approach A unification-based approach  Allows for a fast and scalable analysis  Ensures every pointer points to one unique node Field Sensitive Field Sensitive  Added accuracy Also used to compute escape info Also used to compute escape info

Example – DS Graph list *createnode(int *Data) { list *New = malloc(sizeof(list)); New->Data = Num; return New; return New; return New;}

Example – DS Graph void splitclone(list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); } else { } else { *R2 = createnode(L->data); splitclone(L->Next, &(*R1)->Next); }} P1P2

Example – DS Graph void processlist(list *L) { list *A, *B, *tmp; // Clone L, splitting nodes in list A, and B. splitclone(L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; free(A); A = tmp; } // free B list while (B) {tmp = B->Next; free(B); B = tmp; } } P1P2

Example – Transformation list *createnode(Pool *PD, int *Data) { list *New = poolalloc(PD, sizeof(list)); New->Data = Num; return New; return New; return New;}

Example – Transformation void splitclone(Pool *PD1, Pool *PD2, list *L, list **R1, list **R2) { if(L==0) {*R1 = *R2 = 0 return;} if(some_predicate(L->Data)) { *R1 = createnode(PD1, L->data); splitclone(PD1, PD2, L->Next, &(*R1)->Next); } else { } else { *R2 = createnode(PD2, L->data); splitclone(PD1, PD2, L->Next, &(*R1)->Next); }} P1 P2

Example – Transformation void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD1);pooldestroy(&PD2);} P1 P2

More Algorithm Details Indirect Function Call Handling: Indirect Function Call Handling:  Partition functions into equivalence classes: If F1, F2 have common call-site  same class If F1, F2 have common call-site  same class  Merge points-to graphs for each equivalence class  Apply previous transformation unchanged Global variables pointing to memory nodes Global variables pointing to memory nodes  Use global pool variable rather than passing them around through function args.

More Algorithm Details poolcreate / pooldestroy placement: poolcreate / pooldestroy placement:  Move calls earlier/later by analyzing the pool’s lifetime  Reduces memory usage  Enables poolfree elimination poolfree elimination poolfree elimination  Eliminate unnecessary poolfree calls No allocations between poolfree & pooldestroy No allocations between poolfree & pooldestroy  Behaves like static garbage collection

Example – poolcreate/pooldestroy placement void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD1);pooldestroy(&PD2);} void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD2);}

Example – poolfree Elimination void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; poolfree(&PD1, A); A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; poolfree(&PD2, B); B = tmp; } pooldestroy(&PD2);} void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list // free A list while (A) {tmp = A->Next; A = tmp; } pooldestroy(&PD1); // free B list while (B) {tmp = B->Next; B = tmp; } pooldestroy(&PD2);} void processlist(list *L) { list *A, *B, *tmp; Pool PD1, PD2; poolcreate(&PD1, sizeof(list), 8); poolcreate(&PD2, sizeof(list), 8); splitclone(&PD1, &PD2, L, &A, &B); processPortion(A); // Process first list processPortion(B); // Process second list pooldestroy(&PD1);pooldestroy(&PD2);}

Outline Introduction & Motivation Introduction & Motivation Automatic Pool Allocation Transformation Automatic Pool Allocation Transformation Pool Allocation-Based Optimizations Pool Allocation-Based Optimizations Pool Allocation & Optimiztion Performance Impact Pool Allocation & Optimiztion Performance Impact Conclusion Conclusion

PAOpts (1/4) and (2/4) Selective Pool Allocation Selective Pool Allocation  Don’t pool allocate when not profitable  Avoids creating and destroying a pool descriptor (minor) and avoids significant wasted space when the object is much smaller than the smallest internal page. PoolFree Elimination PoolFree Elimination  Remove explicit de-allocations that are not needed

Looking closely: Anatomy of a heap Fully general malloc-compatible allocator: Fully general malloc-compatible allocator:  Supports malloc/free/realloc/memalign etc.  Standard malloc overheads: object header, alignment  Allocates slabs of memory with exponential growth  By default, all returned pointers are 8-byte aligned In memory, things look like (16 byte allocs): In memory, things look like (16 byte allocs): 16-byte user data 16-byte 16-byte One 32-byte Cache Line 4-byte object header 4-byte padding for user-data alignment

PAOpts (3/4): Bump Pointer Optzn If a pool has no poolfree’s: If a pool has no poolfree’s:  Eliminate per-object header  Eliminate freelist overhead (faster object allocation) Eliminates 4 bytes of inter-object padding Eliminates 4 bytes of inter-object padding  Pack objects more densely in the cache Interacts with poolfree elimination (PAOpt 2/4)! Interacts with poolfree elimination (PAOpt 2/4)!  If poolfree elim deletes all frees, BumpPtr can apply 16-byte user data 16-byte One 32-byte Cache Line 16-byte user data 16-byte

PAOpts (4/4): Alignment Analysis Malloc must return 8-byte aligned memory: Malloc must return 8-byte aligned memory:  It has no idea what types will be used in the memory  Some machines bus error, others suffer performance problems for unaligned memory Type-safe pools infer a type for the pool: Type-safe pools infer a type for the pool:  Use 4-byte alignment for pools we know don’t need it  Reduces inter-object padding 16-byte user data 16-byte 16-byte One 32-byte Cache Line 4-byte object header 16-byte user data

Outline Introduction & Motivation Introduction & Motivation Automatic Pool Allocation Transformation Automatic Pool Allocation Transformation Pool Allocation-Based Optimizations Pool Allocation-Based Optimizations Pool Allocation & Optimization Performance Impact Pool Allocation & Optimization Performance Impact Conclusion Conclusion

Implementation & Infrastructure Link-time transformation using the LLVM Compiler Infrastructure Link-time transformation using the LLVM Compiler Infrastructure Uses LLVM-to-C back-end and the resulting code is compiled with GCC 3.4.2 –O3 Uses LLVM-to-C back-end and the resulting code is compiled with GCC 3.4.2 –O3 Evaluated on AMD Athlon MP 2100+ Evaluated on AMD Athlon MP 2100+  64KB L1, 256KB L2

Simple Pool Allocation Statistics Programs from SPEC CINT2K, Ptrdist, FreeBench & Olden suites, plus unbundled programs 91 Table 1

Simple Pool Allocation Statistics DSA is able to infer that most static pools are type-homogenous 91 Table 1

Compile Time Compilation overhead is less than 3% Table 3

Pool Allocation Speedup Several programs unaffected by pool allocation (see paper) Several programs unaffected by pool allocation (see paper) Sizable speedup across many pointer intensive programs Sizable speedup across many pointer intensive programs Some programs (ft, chomp) order of magnitude faster Some programs (ft, chomp) order of magnitude faster Most programs are 0% to 20% faster with pool allocation alone

Pool Allocation Speedup Several programs unaffected by pool allocation (see paper) Several programs unaffected by pool allocation (see paper) Sizable speedup across many pointer intensive programs Sizable speedup across many pointer intensive programs Some programs (ft, chomp) order of magnitude faster Some programs (ft, chomp) order of magnitude faster Two are 10x faster, one is almost 2x faster

Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation Optimizations help all of these programs: Optimizations help all of these programs:  Despite being very simple, they make a big impact Most are 5-15% faster with optimizations than with Pool Alloc alone PA Time Figure 9 (with different baseline)

Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation Optimizations help all of these programs: Optimizations help all of these programs:  Despite being very simple, they make a big impact One is 44% faster, other is 29% faster PA Time Figure 9 (with different baseline)

Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation Optimizations help all of these programs: Optimizations help all of these programs:  Despite being very simple, they make a big impact Pool optimizations help some progs that pool allocation itself doesn’t PA Time Figure 9 (with different baseline)

Pool Optimization Speedup (FullPA) Baseline 1.0 = Run Time with Pool Allocation Optimizations help all of these programs: Optimizations help all of these programs:  Despite being very simple, they make a big impact Pool optzns effect can be additive with the pool allocation effect PA Time Figure 9 (with different baseline)

Cache/TLB miss reduction Sources:  Defragmented heap  Reduced inter-object padding  Segregating the heap! Miss rate measured with perfctr on AMD Athlon 2100+ Figure 10

Pool Optimization Statistics Table 2

Optimization Contribution Figure 11

Pool Allocation Conclusions 1. Segregate heap based on points-to graph  Improved Memory Hierarchy Performance  Give compiler some control over layout  Give compiler information about locality 2. Optimize pools based on per-pool properties  Very simple (but useful) optimizations proposed here  Optimizations could be applied to other systems

The End

Backup Slides

Table 4

Table 5

list *makeList(int Num) { list *New = malloc(sizeof(list)); list *New = malloc(sizeof(list)); New->Next = Num ? makeList(Num-1) : 0; New->Next = Num ? makeList(Num-1) : 0; New->Data = Num; return New; New->Data = Num; return New;} int twoLists( ) { list *X = makeList(10); list *X = makeList(10); list *Y = makeList(100); list *Y = makeList(100); GL = Y; GL = Y; processList(X); processList(X); processList(Y); processList(Y); freeList(X); freeList(X); freeList(Y); freeList(Y);} Pool Allocation: Example Pool P1; poolinit(&P1); pooldestroy(&P1);, &P1), Pool* P){ poolalloc(P);, P), &P1) P1, P2) Pool* P2, P2) P2 Change calls to free into calls to poolfree  retain explicit deallocation

Different Data Structures Have Different Properties Pool allocation segregates heap: Pool allocation segregates heap:  Roughly into logical data structures  Optimize using pool-specific properties Examples of properties we look for: Examples of properties we look for:  Pool is type-homogenous  Pool contains data that only requires 4-byte alignment  Opportunities to reduce allocation overhead buildtraversedestroy complex allocation pattern Pool Specific Optimizations list: HMR list*int head list: HMR list*int head list: HMR list*int head

Benchmarks Pointer-intensive SPECINT 2000, Ptrdist, Olden, FreeBench suites Pointer-intensive SPECINT 2000, Ptrdist, Olden, FreeBench suites povray, espresso, fpgrowth, llu-bench, chomp povray, espresso, fpgrowth, llu-bench, chomp Benchmarks with custom allocators are not evaluated except for parser, which they hand modified. Benchmarks with custom allocators are not evaluated except for parser, which they hand modified.

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of.

Similar presentations

Presentation on theme: "Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of.

Similar presentations

Presentation on theme: "Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap Paper by: Chris Lattner and Vikram Adve University of."— Presentation transcript:

Similar presentations

About project

Feedback