Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either.

Slides:



Advertisements
Similar presentations
Dynamic Memory Management
Advertisements

The Linux Kernel: Memory Management
Kernel memory allocation
Chapter 12. Kernel Memory Allocation
Lecture 10: Heap Management CS 540 GMU Spring 2009.
KERNEL MEMORY ALLOCATION Unix Internals, Uresh Vahalia Sowmya Ponugoti CMSC 691X.
McRT-Malloc: A Scalable Non-Blocking Transaction Aware Memory Allocator Ali Adl-Tabatabai Ben Hertzberg Rick Hudson Bratin Saha.
U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science Emery Berger Scalable Memory Management for Multithreaded Applications CMPSCI 691P Fall.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
CPSC 388 – Compiler Design and Construction
1 CSE 380 Computer Operating Systems Instructor: Insup Lee University of Pennsylvania, Fall 2002 Lecture Note: Memory Management.
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Emery Berger, Kathryn McKinley *, Robert Blumofe, Paul.
Dynamic memory allocation and fragmentation Seminar on Network and Operating Systems Group II.
Memory Management Memory Areas and their use Memory Manager Tasks:
Memory Management A memory manager should take care of allocating memory when needed by programs release memory that is no longer used to the heap. Memory.
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Chapter 5: Memory Management Dhamdhere: Operating Systems— A Concept-Based Approach Slide No: 1 Copyright ©2005 Memory Management Chapter 5.
Thrashing and Memory Management
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 9 – Real Memory Organization and Management Outline 9.1 Introduction 9.2Memory Organization.
A Parallel, Real-Time Garbage Collector Author: Perry Cheng, Guy E. Blelloch Presenter: Jun Tao.
Memory Allocation CS Introduction to Operating Systems.
The memory allocation problem Define the memory allocation problem Memory organization and memory allocation schemes.
Real-Time Concepts for Embedded Systems Author: Qing Li with Caroline Yao ISBN: CMPBooks.
Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
1. Memory Manager 2 Memory Management In an environment that supports dynamic memory allocation, the memory manager must keep a record of the usage of.
1 Memory Management Requirements of memory management system to provide the memory space to enable several processes to execute concurrently to provide.
Dynamic Storage Allocation Bradley Herrup CS 297 Security and Programming Languages.
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 9.
1 Memory Management Basics. 2 Program P Basic Memory Management Concepts Address spaces Physical address space — The address space supported by the hardware.
1 Advanced Memory Management Techniques  static vs. dynamic kernel memory allocation  resource map allocation  power-of-two free list allocation  buddy.
Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos.
CS 241 Discussion Section (11/17/2011). Outline Review of MP7 MP8 Overview Simple Code Examples (Bad before the Good) Theory behind MP8.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Consider Starting with 160 k of memory do: Starting with 160 k of memory do: Allocate p1 (50 k) Allocate p1 (50 k) Allocate p2 (30 k) Allocate p2 (30 k)
Dynamic Memory Allocation II
Martin Kruliš by Martin Kruliš (v1.1)1.
CS 241 Discussion Section (2/9/2012). MP2 continued Implement malloc, free, calloc and realloc Reuse free memory – Sequential fit – Segregated fit.
CS 241 Discussion Section (12/1/2011). Tradeoffs When do you: – Expand Increase total memory usage – Split Make smaller chunks (avoid internal fragmentation)
External fragmentation in a paging system Use paging circuitry to map groups of noncontiguous free pages into logically contiguous addresses (remap your.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.
CS 3204 Operating Systems Godmar Back Lecture 18.
Memory Management What if pgm mem > main mem ?. Memory Management What if pgm mem > main mem ? Overlays – program controlled.
Process Management Deadlocks.
Section 10: Memory Allocation Topics
Memory Management Memory Areas and their use Memory Manager Tasks:
Memory Management.
Dynamic Memory Allocation
Chapter 9 – Real Memory Organization and Management
Task Scheduling for Multicore CPUs and NUMA Systems
Memory Management Memory Areas and their use Memory Manager Tasks:
Main Memory Management
Optimizing Malloc and Free
Chapter 4: Threads.
CS Introduction to Operating Systems
Main Memory Background Swapping Contiguous Allocation Paging
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Memory Management (1).
Management From the memory view, we can list four important tasks that the OS is responsible for ; To know the used and unused memory partitions To allocate.
Memory Management Memory Areas and their use Memory Manager Tasks:
CS703 - Advanced Operating Systems
Lecture: Coherence and Synchronization
COMP755 Advanced Operating Systems
Presentation transcript:

Parallel Memory Allocation Steven Saunders

2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either scalable, effective or efficient... Truth: no allocator exists that handles all situations, allocation patterns and memory hierarchies best (serial or parallel) Research results are difficult to compare – no standard benchmarks, simulators vs. real tests Distributed memory, garbage collection, locality not considered in this talk

3Parallel Memory AllocationSteven Saunders Definitions Heap – pool of memory available for allocation or deallocation of arbitrarily-sized blocks in arbitrary order that will live an arbitrary amount of time Dynamic Memory Allocator – used to request or return memory blocks to the heap – aware only of the size of a memory block, not its type and value – tracks which parts of the heap are in use and which parts are available for allocation

4Parallel Memory AllocationSteven Saunders Design of an Allocator Strategy - consider regularities in program behavior and memory requests to determine a set of acceptable policies Policy - decide where to allocate a memory block within the heap Mechanism - implement the policy using a set of data structures and algorithms Emphasis has been on policies and mechanisms!

5Parallel Memory AllocationSteven Saunders Strategy Ideal Serial Strategy – “put memory blocks where they won’t cause fragmentation later” – Serial program behavior: ramps, peaks, plateaus Parallel Strategies – “minimize unrelated objects on the same page” – “bound memory blowup and minimize false sharing” – Parallel program behavior: SPMD, producer-consumer

6Parallel Memory AllocationSteven Saunders Policy Common Serial Policies: – best fit, first fit, worst fit, etc. Common Techniques: – splitting - break large blocks into smaller pieces to satisfy smaller requests – coalescing - free blocks to satisfy bigger requests immediate - upon deallocation deferred - wait until requests cannot be satisfied

7Parallel Memory AllocationSteven Saunders Mechanism Each block contains header information – size, in use flag, pointer to next free block, etc. Free List - list of memory blocks available for allocation – Singly/Doubly Linked List - each free block points to the next free block – Boundary Tag - size info at both ends of block positive indicates free, negative indicates in use – Quick Lists - multiple linked lists, where each list contains blocks of equal size

8Parallel Memory AllocationSteven Saunders Performance Speed - comparable to serial allocators Scalability - scale linearly with processors False Sharing - avoid actively causing it Fragmentation - keep it to a minimum

9Parallel Memory AllocationSteven Saunders 0 1 Memory Cache Performance (False Sharing) Multiple processors inadvertently share data that happens to reside in the same cache line – padding solves problem, but greatly increases fragmentation X XXX

10Parallel Memory AllocationSteven Saunders Performance (Fragmentation) Inability to use available memory External - available blocks cannot satisfy requests (e.g., too small) Internal - block used to satisfy a request is larger than the requested size

11Parallel Memory AllocationSteven Saunders Performance (Blowup) Out of control fragmentation – unique to parallel allocators – allocator correctly reclaims freed memory but fails to reuse it for future requests available memory not seen by allocating processor – p threads that serialize one after another require O(s) memory – p threads that execute in parallel require O(ps) memory – p threads that execute interleaved on 1 processor require O(ps) memory x = malloc(s); free(x);... Example: blowup from exchanging memory (Producer-Consumer)

12Parallel Memory AllocationSteven Saunders Parallel Allocator Taxonomy Serial Single Heap – global lock protects the heap Concurrent Single Heap – multiple free lists or a concurrent free list Multiple Heaps – processors allocate from any of several heaps Private Heaps – processors allocate exclusively from a local heap Global & Local Heaps – processors allocate from a local and a global heap

13Parallel Memory AllocationSteven Saunders Serial Single Heap Make an existing serial allocator thread-safe – utilize a single global lock for every request Performance – high speed, assuming a fast lock – scalability limited – false sharing not considered – fragmentation bounded by serial policy Typically production allocators – IRIX, Solaris, Windows

14Parallel Memory AllocationSteven Saunders Concurrent Single Heap Apply concurrency to existing serial allocators – use quick list mechanism, with a lock per list Performance – moderate speed, could require many locks – scalability limited by number of requested sizes – false sharing not considered – fragmentation bounded by serial policy Typically research allocators – Buddy System, MFLF (multiple free list first), NUMAmalloc

15Parallel Memory AllocationSteven Saunders Concurrent Single (Buddy System) Policy/Mechanism – one free list per memory block size: 1,2,4,…,2 i – blocks recursively split on list i into 2 buddies for list i-1 in order to satisfy smaller requests – only buddies are coalesced to satisfy larger requests – each free list can be individually locked – trade speed for reduced fragmentation if free list empty, a thread’s malloc enters a wait queue malloc could be satisfied by another thread freeing memory or by breaking a higher list’s block into buddies (whichever finishes first) reducing the number of splits reduces fragmentation by leaving more large blocks for future requests

16Parallel Memory AllocationSteven Saunders Concurrent Single (Buddy System) Performance – moderate speed, complicated locking/queueing, although buddy split/coalesce code is fast – scalability limited by number of requested sizes – false sharing very likely – high internal fragmentation Thread 1 Thread 2 x = malloc(8); 1 x = malloc(5); 2 y = malloc(8);free(x);

17Parallel Memory AllocationSteven Saunders Concurrent Single (MFLF) Policy/Mechanism – set of quick lists to satisfy small requests exactly malloc takes first block in appropriate list free returns block to head of appropriate list – set of misc lists to satisfy large requests quickly each list labeled with range of block sizes, low…high malloc takes first block in list where request < low – trades linear search for internal fragmentation free returns blocks to list where low < request < high – each list can be individually locked and searched

18Parallel Memory AllocationSteven Saunders Concurrent Single (MFLF) Performance – high speed – scalability limited by number of requested sizes – false sharing very likely – high internal fragmentation Approach similar to current state-of-the-art serial allocators

19Parallel Memory AllocationSteven Saunders Concurrent Single (NUMAmalloc) Strategy - minimize co-location of unrelated objects on the same page – avoid page level false sharing (DSM/software DSM) Policy - place same-sized requests in the same page (heuristic hypothesis) – basically MFLF on the page level Performance – high speed – scalability limited by number of requested sizes – false sharing: helps page level but not cache level – high internal fragmentation

20Parallel Memory AllocationSteven Saunders Multiple Heaps List of multiple heaps – individually growable, shrinkable and lockable – threads scan list looking for first available (trylock) – threads may cache result to reduce next lookup Performance – moderate speed, limited by # list scans and lock – scalability limited by number of heaps and traffic – false sharing unintentionally reduced – blowup increased (up to O(p)) Typically production allocators – ptmalloc (Linux), HP-UX

21Parallel Memory AllocationSteven Saunders Private Heaps Processors exclusively utilize a local private heap for all allocation and deallocation – eliminates need for locks Performance – extremely high speed – scalability unbounded – reduced false sharing pass memory to another thread – blowup unbounded Both research and production allocators – CILK, STL

22Parallel Memory AllocationSteven Saunders Global & Local Heaps Processors generally utilize a local heap – reduces most lock contention – private memory is acquired/returned to global heap (which is always locked) as necessary Performance – high speed, less lock contention – scalability limited by number of locks – low false sharing – blowup bounded (O(1)) Typically research allocators – VH, Hoard

23Parallel Memory AllocationSteven Saunders Global & Local Heaps (VH) Strategy - exchange memory overhead for improved scalability Policy – memory broken into stacks of size m/2 – global heap maintains a LIFO of stack pointers that local heaps can use global push releases a local heap’s stack global pop acquires a local heap’s stack – local heaps maintain an Active and Backup stack local operations private, i.e. don’t require locks Mechanism - serial free list within stacks

24Parallel Memory AllocationSteven Saunders Global & Local Heaps (VH) Memory Usage = M + m*p – M = amount of memory in use by program – p = number of processors – m = private memory (2 size m/2 stacks) higher m reduces number of global heap lock operations lower m reduces memory usage overhead B 0 A B 1 A B p A global heap private heap

25Parallel Memory AllocationSteven Saunders Global & Local Heaps (Hoard) Strategy - bound blowup and min. false sharing Policy – memory broken into superblocks of size S all blocks within a superblock are of equal size – global heap maintains a set of available superblocks – local heaps maintain local superblocks malloc satisfied by local superblock free returns memory to original allocating superblock (lock!) superblocks acquired as necessary from global heap if local usage drops below a threshold, superblocks are returned to the global heap Mechanism – private superblock quick lists

26Parallel Memory AllocationSteven Saunders Global & Local Heaps (Hoard) Memory Usage = O(M + p) – M = amount of memory in use by program – p = number of processors False Sharing – since malloc is satisfied by a local superblock, and free returns memory to the original superblock, false sharing is greatly reduced – worst case: a non-empty superblock is released to global heap and another thread acquires it set superblock size and emptiness threshold to minimize

27Parallel Memory AllocationSteven Saunders Global & Local Heaps (Compare) VH proves a tighter memory bound than Hoard and only requires global locks Hoard has a more flexible local mechanism and considers false sharing They’ve never been compared! – Hoard is production quality and would likely win

28Parallel Memory AllocationSteven Saunders Summary Memory allocation is still an open problem – strategies addressing program behavior still uncommon Performance Tradeoffs – speed, scalability, false sharing, fragmentation Current Taxonomy – serial single heap, concurrent single heap, multiple heaps, private heaps, global & local heaps

29Parallel Memory AllocationSteven Saunders References Serial Allocation 1) Paul Wilson, Mark Johnstone, Michael Neely, David Boles. Dynamic Storage Allocation: A Survey and Critical Review International Workshop on Memory Management.Dynamic Storage Allocation: A Survey and Critical Review Shared Memory Multiprocessor Allocation Concurrent Single Heap 2) Arun Iyengar. Scalability of Dynamic Storage Allocation Algorithms. Sixth Symposium on the Frontiers of Massively Parallel Computing. October 1996.Scalability of Dynamic Storage Allocation Algorithms 3) Theodore Johnson, Tim Davis. Space Efficient Parallel Buddy Memory Management. The Fourth International Conference on Computing and Information (ICCI'92). May 1992.Space Efficient Parallel Buddy Memory Management 4) Jong Woo Lee, Yookun Cho. An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors. IEEE Second International Conference on Algorithms & Architectures for Parallel Processing (ICAPP'96).An Effective Shared Memory Allocator for Reducing False Sharing in NUMA Multiprocessors Multiple Heaps 5) Wolfram Gloger. Dynamic Memory Allocator Implementations in Linux System Libraries.Dynamic Memory Allocator Implementations in Linux System Libraries Global & Local Heaps 6) Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson. Hoard: A Scalable Memory Allocator for Multithreaded Applications. The Ninth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-IX). November 2000.Hoard: A Scalable Memory Allocator for Multithreaded Applications 7) Voon-Yee Vee, Wen-Jing Hsu. A Scalable and Efficient Storage Allocator on Shared-Memory Multiprocessors. The International Symposium on Parallel Architectures, Algorithms, and Networks (I-SPAN'99). June 1999.A Scalable and Efficient Storage Allocator on Shared-Memory Multiprocessors