Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos.

Slides:

Advertisements

Similar presentations

Paging: Design Issues. Readings r Silbershatz et al: ,

Advertisements

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,

The Linux Kernel: Memory Management

Windows Kernel Internals User-mode Heap Manager

1 Smart Memory for Smart Phones Chris Clack University College London

U NIVERSITY OF M ASSACHUSETTS D EPARTMENT OF C OMPUTER S CIENCE Reconsidering Custom Memory Allocation Emery Berger, Ben Zorn, Kathryn McKinley.

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science Emery Berger Scalable Memory Management for Multithreaded Applications CMPSCI 691P Fall.

Locality-Conscious Lock-Free Linked Lists Anastasia Braginsky & Erez Petrank 1.

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

On the limits of partial compaction Anna Bendersky & Erez Petrank Technion.

Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either.

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Emery Berger, Kathryn McKinley *, Robert Blumofe, Paul.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 5: Threads Overview Multithreading Models Threading Issues Pthreads Solaris.

G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.

Composing High-Performance Memory Allocators Emery Berger, Ben Zorn, Kathryn McKinley.

CS533 Concepts of Operating Systems Class 2 Thread vs Event-Based Programming.

Chapter 11 Operating Systems

Scalable Locality- Conscious Multithreaded Memory Allocation Scott Schneider Christos D. Antonopoulos Dimitrios S. Nikolopoulos The College of William.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science PLDI 2006 DieHard: Probabilistic Memory Safety for Unsafe Programming Languages Emery.

Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger 1 Emery Berger Memory Management for High-Performance Applications Department.

CPS110: Implementing threads/locks on a uni-processor Landon Cox.

DTHREADS: Efficient Deterministic Multithreading

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

Algorithms for Synchronization and Consistency in Concurrent System Services Anders Gidenstam Distributed Computing and Systems group, Department of Computer.

Real-Time Concepts for Embedded Systems Author: Qing Li with Caroline Yao ISBN: CMPBooks.

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum

Quick overview of threads in Java Babak Esfandiari (extracted from Qusay Mahmoud’s slides)

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

OS2- Sem ; R. Jalili Introduction Chapter 1.

Supporting Multi-Processors Bernard Wong February 17, 2003.

Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.

1 Memory Management Basics. 2 Program P Basic Memory Management Concepts Address spaces Physical address space — The address space supported by the hardware.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.

1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.

CS 241 Discussion Section (11/17/2011). Outline Review of MP7 MP8 Overview Simple Code Examples (Bad before the Good) Theory behind MP8.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Yi Feng & Emery Berger University of Massachusetts Amherst A Locality-Improving.

Executing Parallel Programs with Potential Bottlenecks Efficiently Yoshihiro Oyama Kenjiro Taura Akinori Yonezawa {oyama, tau,

Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

A REAL-TIME GARBAGE COLLECTOR WITH LOW OVERHEAD AND CONSISTENT UTILIZATION David F. Bacon, Perry Cheng, and V.T. Rajan IBM T.J. Watson Research Center.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

Dynamic Memory Allocation II

Threads. Readings r Silberschatz et al : Chapter 4.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

CS 241 Discussion Section (12/1/2011). Tradeoffs When do you: – Expand Increase total memory usage – Split Make smaller chunks (avoid internal fragmentation)

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Background Computer System Architectures Computer System Software.

Operating Systems Lecture 9 Introduction to Paging Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.

1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages

Introduction to threads

Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)

Processes and threads.

Anders Gidenstam Håkan Sundell Philippas Tsigas

Advanced Operating Systems

Reconsidering Custom Memory Allocation

Chapter 4: Threads.

David F. Bacon, Perry Cheng, and V.T. Rajan

Chapter 4: Threads.

Getting to the root of concurrent binary search tree performance

CS510 - Portland State University

Chapter 4: Threads & Concurrency

Chapter 4: Threads.

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos (Some slides adapted from Emery Berger’s presentation) 1

Outline Motivation Problems in allocator design – False sharing – Fragmentation Existing approaches Hoard design Experimental evaluation 2

Motivation Parallel multithreaded programs prevalent – Web servers, search engines, DB managers etc. – Run on CMP/SMP for high performance – Some of them embarrassingly parallel Memory allocation is a bottleneck – Prevents scaling with number of processors 3

Desired allocator attributes on a multiprocessor system Speed – Competitive with uniprocessor allocators on 1 cpu Scalability – Performance linear with the number of processors Fragmentation (=max allocated / max in use) – High fragmentation  poor data locality  paging False sharing avoidance 4

The problem of false sharing Program causes false sharing Allocate number of objects in a cache line, pass objects to different threads Allocators cause false sharing! Actively: malloc satisfies different thread requests from same cache line Passively: free allows future malloc to produce false sharing processor 1processor 2 x2 = malloc(s);x1 = malloc(s); A cache line thrash… 5

The problem of fragmentation Blowup: – Increase in memory consumption when allocator reclaims memory freed by program, but fails to use it for future requests – Mainly a problem of concurrent allocators – Unbounded (worst case) or bounded (O(P)) 6

Example: Pure Private Heaps Allocator Pure private heaps: one heap per processor. malloc gets memory from the processor's heap or the system free puts memory on the processor's heap Avoids heap contention Examples: STL, Cilk x1= malloc(s) free(x1)free(x2) x3= malloc(s) x2= malloc(s) x4= malloc(s) processor 1processor 2 = allocated by heap 1 = free, on heap 2 7

How to Break Pure Private Heaps: Fragmentation Pure private heaps: memory consumption can grow without bound! Producer-consumer: processor 1 allocates processor 2 frees Memory always unavailable to producer free(x1) x2= malloc(s) free(x2) x1= malloc(s) processor 1processor 2 x3= malloc(s) free(x3) 8

Example II: Private Heaps with Ownership free puts memory back on the originating processor's heap. Avoids unbounded memory consumption Examples: ptmalloc,LKmalloc x1= malloc(s) free(x1) free(x2) x2= malloc(s) processor 1processor 2 9

How to Break Private Heaps with Ownership:Fragmentation memory consumption can blowup by a factor of P. Round-robin producer- consumer: processor i allocates processor i+1 frees Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks free(x2) free(x1) free(x3) x1= malloc(s) x2= malloc(s) x3=malloc(s) processor 1processor 2processor 3 10

Existing approaches 11

Uniprocessor Allocators on Multiprocessors Fragmentation: Excellent – Very low for most programs [Wilson & Johnstone] Speed & Scalability: Poor – Heap contention A single lock protects the heap Can exacerbate false sharing – Different processors can share cache lines 12

Existing Multiprocessor Allocators Speed: One concurrent heap (e.g., concurrent B-tree): O(log (#size-classes)) cost per memory operation too many locks/atomic updates  Fast allocators use multiple heaps Scalability: Allocator-induced false sharing Other bottlenecks (e.g. nextHeap global in Ptmalloc) Fragmentation: P-fold increase or even unbounded 13

Hoard as the solution 14

Hoard Overview P per-processor heaps & 1 global heap Each thread accesses only its local heap & global Manages memory in page-sized superblocks of same-sized objects (LIFO free-list) – Avoids false sharing by not carving up cache lines – Avoids heap contention – local heaps allocate & free small blocks from their superblocks Avoids blowup by – Moving superblocks to global heap when fraction of free memory exceeds some threshold 15

Superblock management Emptiness threshold: (u i ≥ (1-f)*a i ) ∨ ( u i ≥ a i – K*S) f = ¼ K = 0 Multiple heaps  Avoid actively induced false sharing Block coalescing  Avoid passively induced false sharing Superblocks transferred are usually empty and transfer is infrequent 16

Hoard pseudo-code malloc(sz) 1.If sz > S/2, allocate the superblock from the OS and return it. 2.i  hash(current thread) 3.Lock heap i 4.Scan heap i’s list of superblocks from full to least (for the size class of sz) 5.If there is no superblock with free space { 6. Check heap 0 (global) for a superblock 7. If there is none { 8. Allocate S bytes as superblock s & set owner to heap i 9. } Else { 10. Transfer the superblock s to heap i 11. u 0  u 0 – s.u; u i  u i + s.u 12. a 0  a 0 - S; a i  a i + S 13. } 14.} 15.u i  u i + sz; s.u  s.u + sz 16.Unlock heap i 17.Return a block from the superblock free(ptr) 1.If the block is “large” 2. Free superblock to OS and return 3.Find the superblock s this blocks comes from 4.Lock s 5.Lock heap i, the superblock’s owner 6.Deallocate the block from the superblock 7. u i  u i – block size 8. s.u  s.u – block size 9.If (i = 0) unlock heap i, superblock s and return 10.If (u i < a i – K*S) and (u i <(1-f)*a i ) { 11. Transfer a mostly-empty superblock s1 to heap 0 (global) 12. u 0  u 0 + s1.u; u i  u i – s1.u 13. a 0  a 0 + S; a i  a i – S 14.} 15.Unlock heap i and superblock s 17

Deriving bounds on blowup blowup:= O(A(t) / U(t)) A(t) = A’(t) a i (t) – K*S ≤ u i (t)) ∨ (1-f)a i (t) ≤ u i (t) P << U(t)  blowup := O(1) Worst case consumption is a constant factor overhead that does not grow with the amount of memory required by the program A(t) = O(U(t) + P) 18

Deriving bounds on contention (1) Per-processor Heap contention – 1 allocator thread / multiple threads free Inherently unscalable – Pairs of producer/consumer threads malloc/free calls serialized At most 2X slowdown (undesirable but scalable) – Empirically only a small fraction of memory is freed by another thread  Contention expected to be low 19

Deriving bounds on contention (2) Global Heap contention – Measure # GH lock acquisitions as upper bound – Growing phase: Each thread at most k/(f*S/s) acquisitions for k malloc’s – Shrinking phase: Pathological case where program frees (1-f) of each superblock and then frees every block in superblock one at a time – Empirically: No excessive shrinking and gradual growth of memory usage  low overall contention 20

Experimental Evaluation Dedicated 14-processor Sun Enterprise – 400 MHz Ultrasparc – 2 GB RAM, 4MB L2 cache – Solaris 7 – Superblock size=8K, f = ¼ Comparison between – Hoard – Ptmalloc (GNU libC, multiple heaps & ownership) – Mtmalloc (Solaris multithreaded allocator) – Solaris (default system allocator) 21

Benchmarks 22

Speed 23 Size classes need to be handled more cleverly

Scalability - threadtest % faster than Ptmalloc on 14 cpus t threads allocate/deallocate 100,000/t 8-byte objects

Scalability – Larson 25 “Bleeding” typical in server applications Mainly stays within empty fraction during execution 18X faster than next best allocator on 14 cpus

Scalability - BEMengine 26 Few times below empty fraction  low synchronization

False sharing behavior 27 Active-false: Each thread allocates small object, writes it few times, frees it Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false Illustrate effects of contention of the coherence mechanism

Fragmentation results 28 Large number of size classes remain live for duration of program and scattered across blocks Within 20% of Lea’s allocator

Hoard Conclusions Speed: Excellent As fast as a uniprocessor allocator on one processor amortized O(1) cost 1 lock for malloc, 2 for free Scalability: Excellent Scales linearly with the number of processors Avoids false sharing Fragmentation: Very good Worst-case is provably close to ideal Actual observed fragmentation is low 29

Discussion Points If we had to re-evaluate Hoard today which benchmarks would we use? Are there any changes needed to make it work with languages like Java? 30