Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos (Some slides adapted from Emery Berger’s presentation) 1
Outline Motivation Problems in allocator design – False sharing – Fragmentation Existing approaches Hoard design Experimental evaluation 2
Motivation Parallel multithreaded programs prevalent – Web servers, search engines, DB managers etc. – Run on CMP/SMP for high performance – Some of them embarrassingly parallel Memory allocation is a bottleneck – Prevents scaling with number of processors 3
Desired allocator attributes on a multiprocessor system Speed – Competitive with uniprocessor allocators on 1 cpu Scalability – Performance linear with the number of processors Fragmentation (=max allocated / max in use) – High fragmentation poor data locality paging False sharing avoidance 4
The problem of false sharing Program causes false sharing Allocate number of objects in a cache line, pass objects to different threads Allocators cause false sharing! Actively: malloc satisfies different thread requests from same cache line Passively: free allows future malloc to produce false sharing processor 1processor 2 x2 = malloc(s);x1 = malloc(s); A cache line thrash… 5
The problem of fragmentation Blowup: – Increase in memory consumption when allocator reclaims memory freed by program, but fails to use it for future requests – Mainly a problem of concurrent allocators – Unbounded (worst case) or bounded (O(P)) 6
Example: Pure Private Heaps Allocator Pure private heaps: one heap per processor. malloc gets memory from the processor's heap or the system free puts memory on the processor's heap Avoids heap contention Examples: STL, Cilk x1= malloc(s) free(x1)free(x2) x3= malloc(s) x2= malloc(s) x4= malloc(s) processor 1processor 2 = allocated by heap 1 = free, on heap 2 7
How to Break Pure Private Heaps: Fragmentation Pure private heaps: memory consumption can grow without bound! Producer-consumer: processor 1 allocates processor 2 frees Memory always unavailable to producer free(x1) x2= malloc(s) free(x2) x1= malloc(s) processor 1processor 2 x3= malloc(s) free(x3) 8
Example II: Private Heaps with Ownership free puts memory back on the originating processor's heap. Avoids unbounded memory consumption Examples: ptmalloc,LKmalloc x1= malloc(s) free(x1) free(x2) x2= malloc(s) processor 1processor 2 9
How to Break Private Heaps with Ownership:Fragmentation memory consumption can blowup by a factor of P. Round-robin producer- consumer: processor i allocates processor i+1 frees Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks free(x2) free(x1) free(x3) x1= malloc(s) x2= malloc(s) x3=malloc(s) processor 1processor 2processor 3 10
Existing approaches 11
Uniprocessor Allocators on Multiprocessors Fragmentation: Excellent – Very low for most programs [Wilson & Johnstone] Speed & Scalability: Poor – Heap contention A single lock protects the heap Can exacerbate false sharing – Different processors can share cache lines 12
Existing Multiprocessor Allocators Speed: One concurrent heap (e.g., concurrent B-tree): O(log (#size-classes)) cost per memory operation too many locks/atomic updates Fast allocators use multiple heaps Scalability: Allocator-induced false sharing Other bottlenecks (e.g. nextHeap global in Ptmalloc) Fragmentation: P-fold increase or even unbounded 13
Hoard as the solution 14
Hoard Overview P per-processor heaps & 1 global heap Each thread accesses only its local heap & global Manages memory in page-sized superblocks of same-sized objects (LIFO free-list) – Avoids false sharing by not carving up cache lines – Avoids heap contention – local heaps allocate & free small blocks from their superblocks Avoids blowup by – Moving superblocks to global heap when fraction of free memory exceeds some threshold 15
Superblock management Emptiness threshold: (u i ≥ (1-f)*a i ) ∨ ( u i ≥ a i – K*S) f = ¼ K = 0 Multiple heaps Avoid actively induced false sharing Block coalescing Avoid passively induced false sharing Superblocks transferred are usually empty and transfer is infrequent 16
Hoard pseudo-code malloc(sz) 1.If sz > S/2, allocate the superblock from the OS and return it. 2.i hash(current thread) 3.Lock heap i 4.Scan heap i’s list of superblocks from full to least (for the size class of sz) 5.If there is no superblock with free space { 6. Check heap 0 (global) for a superblock 7. If there is none { 8. Allocate S bytes as superblock s & set owner to heap i 9. } Else { 10. Transfer the superblock s to heap i 11. u 0 u 0 – s.u; u i u i + s.u 12. a 0 a 0 - S; a i a i + S 13. } 14.} 15.u i u i + sz; s.u s.u + sz 16.Unlock heap i 17.Return a block from the superblock free(ptr) 1.If the block is “large” 2. Free superblock to OS and return 3.Find the superblock s this blocks comes from 4.Lock s 5.Lock heap i, the superblock’s owner 6.Deallocate the block from the superblock 7. u i u i – block size 8. s.u s.u – block size 9.If (i = 0) unlock heap i, superblock s and return 10.If (u i < a i – K*S) and (u i <(1-f)*a i ) { 11. Transfer a mostly-empty superblock s1 to heap 0 (global) 12. u 0 u 0 + s1.u; u i u i – s1.u 13. a 0 a 0 + S; a i a i – S 14.} 15.Unlock heap i and superblock s 17
Deriving bounds on blowup blowup:= O(A(t) / U(t)) A(t) = A’(t) a i (t) – K*S ≤ u i (t)) ∨ (1-f)a i (t) ≤ u i (t) P << U(t) blowup := O(1) Worst case consumption is a constant factor overhead that does not grow with the amount of memory required by the program A(t) = O(U(t) + P) 18
Deriving bounds on contention (1) Per-processor Heap contention – 1 allocator thread / multiple threads free Inherently unscalable – Pairs of producer/consumer threads malloc/free calls serialized At most 2X slowdown (undesirable but scalable) – Empirically only a small fraction of memory is freed by another thread Contention expected to be low 19
Deriving bounds on contention (2) Global Heap contention – Measure # GH lock acquisitions as upper bound – Growing phase: Each thread at most k/(f*S/s) acquisitions for k malloc’s – Shrinking phase: Pathological case where program frees (1-f) of each superblock and then frees every block in superblock one at a time – Empirically: No excessive shrinking and gradual growth of memory usage low overall contention 20
Experimental Evaluation Dedicated 14-processor Sun Enterprise – 400 MHz Ultrasparc – 2 GB RAM, 4MB L2 cache – Solaris 7 – Superblock size=8K, f = ¼ Comparison between – Hoard – Ptmalloc (GNU libC, multiple heaps & ownership) – Mtmalloc (Solaris multithreaded allocator) – Solaris (default system allocator) 21
Benchmarks 22
Speed 23 Size classes need to be handled more cleverly
Scalability - threadtest % faster than Ptmalloc on 14 cpus t threads allocate/deallocate 100,000/t 8-byte objects
Scalability – Larson 25 “Bleeding” typical in server applications Mainly stays within empty fraction during execution 18X faster than next best allocator on 14 cpus
Scalability - BEMengine 26 Few times below empty fraction low synchronization
False sharing behavior 27 Active-false: Each thread allocates small object, writes it few times, frees it Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false Illustrate effects of contention of the coherence mechanism
Fragmentation results 28 Large number of size classes remain live for duration of program and scattered across blocks Within 20% of Lea’s allocator
Hoard Conclusions Speed: Excellent As fast as a uniprocessor allocator on one processor amortized O(1) cost 1 lock for malloc, 2 for free Scalability: Excellent Scales linearly with the number of processors Avoids false sharing Fragmentation: Very good Worst-case is provably close to ideal Actual observed fragmentation is low 29
Discussion Points If we had to re-evaluate Hoard today which benchmarks would we use? Are there any changes needed to make it work with languages like Java? 30