U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science Emery Berger Scalable Memory Management for Multithreaded Applications CMPSCI 691P Fall.

Slides:



Advertisements
Similar presentations
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,
Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
U NIVERSITY OF M ASSACHUSETTS D EPARTMENT OF C OMPUTER S CIENCE Reconsidering Custom Memory Allocation Emery Berger, Ben Zorn, Kathryn McKinley.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
1 U NIVERSITY OF M ASSACHUSETTS, A MHERST School of Computer Science P REDATOR : Predictive False Sharing Detection Tongping Liu*, Chen Tian, Ziang Hu,
Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Emery Berger, Kathryn McKinley *, Robert Blumofe, Paul.
Operating Systems CMPSCI 377 Lecture 11: Memory Management
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Ch 4: Threads Dr. Mohamed Hefeeda.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Scalable Locality- Conscious Multithreaded Memory Allocation Scott Schneider Christos D. Antonopoulos Dimitrios S. Nikolopoulos The College of William.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science PLDI 2006 DieHard: Probabilistic Memory Safety for Unsafe Programming Languages Emery.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles C/C++ Emery Berger and Mark Corner University of Massachusetts.
Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger 1 Emery Berger Memory Management for High-Performance Applications Department.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science 2006 Exterminator: Automatically Correcting Memory Errors Gene Novark, Emery Berger.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Computer System Architectures Computer System Software
9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Operating System Principles Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh.
14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Threads.
Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.
Operating System Concepts Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh University.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Chapter 2 (PART 1) Light-Weight Process (Threads) Department of Computer Science Southern Illinois University Edwardsville Summer, 2004 Dr. Hiroshi Fujinoki.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Multithreaded Programming.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Chapter 4: Threads.
Chapter 4: Threads.
Chapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming
Process Management Presented By Aditya Gupta Assistant Professor
Chapter 4: Threads.
Chapter 4: Threads.
Operating Systems (CS 340 D)
Chapter 4: Threads.
Reconsidering Custom Memory Allocation
Chapter 4: Threads.
Chapter 4: Threads.
Modified by H. Schulzrinne 02/15/10 Chapter 4: Threads.
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Operating Systems (CS 340 D)
CS510 - Portland State University
Chapter 4: Threads & Concurrency
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Presentation transcript:

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science Emery Berger Scalable Memory Management for Multithreaded Applications CMPSCI 691P Fall 2002

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 2 High-Performance Applications Web servers, search engines, scientific codes C or C++ Run on one or cluster of server boxes software compiler runtime system operating system hardware Needs support at every level

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 3 New Applications, Old Memory Managers Applications and hardware have changed Multiprocessors now commonplace Object-oriented, multithreaded  Increased pressure on memory manager ( malloc, free ) But memory managers have not changed Inadequate support for modern applications

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 4 Current Memory Managers Limit Scalability As we add processors, program slows down Caused by heap contention Larson server benchmark on 14-processor Sun

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 5 The Problem Current memory managers inadequate for high-performance applications on modern architectures Limit scalability, application performance, and robustness

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 6 Overview Problems with current memory managers Contention False sharing Space Solution: provably scalable memory manager Hoard

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 7 Problems with General-Purpose Memory Managers Previous work for multiprocessors Concurrent single heap [Bigler et al. 85, Johnson 91, Iyengar 92] Impractical Multiple heaps [Larson 98, Gloger 99] Reduce contention but cause other problems: P-fold or even unbounded increase in space Allocator-induced false sharing we show

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 8 Multiple Heap Allocator: Pure Private Heaps One heap per processor: malloc gets memory from its local heap free puts memory on its local heap STL, Cilk, ad hoc x1= malloc(1) free(x1)free(x2) x3= malloc(1) x2= malloc(1) x4= malloc(1) processor 0processor 1 = in use, processor 0 = free, on heap 1 free(x3) free(x4) Key:

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 9 Problem: Unbounded Memory Consumption Producer-consumer: Processor 0 allocates Processor 1 frees Unbounded memory consumption Crash! free(x1) x2= malloc(1) free(x2) x1= malloc(1) processor 0processor 1 x3= malloc(1) free(x3)

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 10 Multiple Heap Allocator: Private Heaps with Ownership free returns memory to original heap Bounded memory consumption No crash! “Ptmalloc” (Linux), LKmalloc x1= malloc(1) free(x1) free(x2) x2= malloc(1) processor 0processor 1

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 11 Problem: P-fold Memory Blowup Occurs in practice Round-robin producer- consumer processor i mod P allocates processor (i+1) mod P frees Footprint = 1 (2GB), but space = 3 (6GB) Exceeds 32-bit address space: Crash! free(x2) free(x1) free(x3) x1= malloc(1) x2= malloc(1) x3=malloc(1) processor 0processor 1processor 2

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 12 Problem: Allocator-Induced False Sharing False sharing Non-shared objects on same cache line Bane of parallel applications Extensively studied All these allocators cause false sharing! processor 0processor 1 x2= malloc(1)x1= malloc(1) cache line thrash…

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 13 So What Do We Do Now? Where do we put free memory? on central heap: on our own heap: (pure private heaps) on the original heap: (private heaps with ownership) How do we avoid false sharing?  Heap contention  Unbounded memory consumption  P-fold blowup

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 14 Overview Problems with current memory managers Contention False sharing Space Solution: provably scalable memory manager Hoard

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 15 Hoard: Key Insights Bound local memory consumption Explicitly track utilization Move free memory to a global heap  Provably bounds memory consumption Manage memory in large chunks  Avoids false sharing  Reduces heap contention

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 16 Overview of Hoard Manage memory in heap blocks Page-sized Avoids false sharing Allocate from local heap block Avoids heap contention Low utilization  Move heap block to global heap Avoids space blowup global heap … processor 0processor P-1

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 17 Summary of Analytical Results Space consumption: near optimal worst-case Hoard:O(n log M/m + P) {P ¿ n} Optimal: O(n log M/m) [Robson 70]: ≈ bin-packing Private heaps with ownership: O(P n log M/m) Provably low synchronization n = memory required M = biggest object size m = smallest object size P = processors

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 18 Empirical Results Measure runtime on 14-processor Sun Allocators Solaris (system allocator) Ptmalloc (GNU libc) mtmalloc (Sun’s “MT-hot” allocator) Micro-benchmarks Threadtest:no sharing Larson: sharing (server-style) Cache-scratch:mostly reads & writes (tests for false sharing) Real application experience similar

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 19 Runtime Performance: threadtest speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors) Many threads, no sharing Hoard achieves linear speedup

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 20 Runtime Performance: Larson Many threads, sharing (server-style) Hoard achieves linear speedup

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 21 Runtime Performance: false sharing Many threads, mostly reads & writes of heap data Hoard achieves linear speedup

U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science 22 Hoard in the “Real World” Open source code 13,000 downloads Solaris, Linux, Windows, IRIX, … Widely used in industry AOL, British Telecom, Novell, Philips Reports: 2x-10x, “impressive” improvement in performance Search server, telecom billing systems, scene rendering, real-time messaging middleware, text-to-speech engine, telephony, JVM  Scalable general-purpose memory manager