Faster parallel programs with improved FastMM

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

The Linux Kernel: Memory Management
Parallel Programming with OmniThreadLibrary
OS/2 Warp Chris Ashworth Cameron Davis John Weatherley.
File Systems.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble, Ankita Kejriwal, and John Ousterhout Stanford University.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Chapter 12: File System Implementation
1 CSE 303 Lecture 11 Heap memory allocation ( malloc, free ) reading: Programming in C Ch. 11, 17 slides created by Marty Stepp
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Pleasures and Pitfalls of Profiling Primož Gabrijelčič.
© 2011 IBM Corporation 11 April 2011 IDS Architecture.
Windows 2000 Memory Management Computing Department, Lancaster University, UK.
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
Building Multithreaded Solutions with OmniThreadLibrary
Palm OS Jeremy Etzkorn Paul Rutschky Adam Lee Amit Bhatia Tony Picarazzi.
Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.
Processes and Threads CS550 Operating Systems. Processes and Threads These exist only at execution time They have fast state changes -> in memory and.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
SIMPLE PARALLEL PROGRAMMING WITH PATTERNS AND OMNITHREADLIBRARY PRIMOŽ GABRIJELČIČ SKYPE: GABR42
FastMM in Depth Primož Gabrijelčič,
2003 Dominic Swayne1 Microsoft Disk Operating System and PC DOS CS-550-1: Operating Systems Fall 2003 Dominic Swayne.
1 Memory Management Basics. 2 Program P Basic Memory Management Concepts Address spaces Physical address space — The address space supported by the hardware.
1 Advanced Memory Management Techniques  static vs. dynamic kernel memory allocation  resource map allocation  power-of-two free list allocation  buddy.
Computer Programming 2 Why do we study Java….. Java is Simple It has none of the following: operator overloading, header files, pre- processor, pointer.
Computer Systems Week 14: Memory Management Amanda Oddie.
CS 241 Section Week #9 (11/05/09). Topics MP6 Overview Memory Management Virtual Memory Page Tables.
Upcoming Presentations ILM Professional Service – Proprietary and Confidential ( DateTimeTopicPresenter March PM Distributed.
Review °Apply Principle of Locality Recursively °Manage memory to disk? Treat as cache Included protection as bonus, now critical Use Page Table of mappings.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Preface 1Performance Tuning Methodology: A Review Course Structure 1-2 Lesson Objective 1-3 Concepts 1-4 Determining the Worst Bottleneck 1-5 Understanding.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
By Anand George SourceLens.org Copyright. All rights reserved. Content Owner - Meera R (meera at sourcelens.org)
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
CMSC 421 Spring 2004 Section 0202 Part II: Process Management Chapter 5 Threads.
Jonas Johansson Summarizing presentation of Scheduler Activations – A different approach to parallelism.
Log-Structured Memory for DRAM-Based Storage Stephen Rumble and John Ousterhout Stanford University.
6 July 2015 Charles Reiss
Today topics: File System Implementation
EMERALDS Landon Cox March 22, 2017.
CSC 322 Operating Systems Concepts Lecture - 12: by
Chapter 3 Threads and Multithreading
Filesystems.
Main Memory Management
How I Learned to Love FastMM Internals
Memory Management-I 1.
Writing High Performance Delphi Application
Operating Systems.
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Memory Management Overview
Virtual Memory Hardware
Writing High Performance Delphi Applications
Lecture Topics: 11/1 General Operating System Concepts Processes
Optimizing Dynamic Memory Management
Fast Communication and User Level Parallelism
CPU scheduling decisions may take place when a process:
CSE451 - Section 10.
CSCI1600: Embedded and Real Time Software
Programming with Shared Memory
Programming with Shared Memory
Lecture 7: Flexible Address Translation
Last section! Project 4 + EC due tomorrow Today: Project 4 questions
CSCI1600: Embedded and Real Time Software
CSE451 - Section 10.
Buddy Allocation CS 161: Lecture 5 2/11/19.
3.8 static vs dynamic thread management
Chapter 8 & 9 Main Memory and Virtual Memory
Threads CSE 2431: Introduction to Operating Systems
CS Introduction to Operating Systems
Presentation transcript:

Faster parallel programs with improved FastMM Primož Gabrijelčič @thedelphigeek http://primoz.gabrijelcic.org

Slides and code are available at Info Slides and code are available at http://thedelphigeek.com/p/presentations.html

HISTORY Developed by Pierre LeRiche for the FastCode project https://en.wikipedia.org/wiki/FastCode Version 4, hence FastMM4 Included in RAD Studio since version 2006 http://www.tindex.net/Language/FastMMmemorymanager.html Much improved since Don’t use default FastMM, download the fresh one https://github.com/pleriche/FastMM4

FEATURES Fast Fragmentation resistant Access to > 2GB Simple memory sharing Memory leak reporting Catches some memory-related bugs

PROBLEMS Can be slow in a multithreaded environment

INTERNALS

TOP VIEW Three memory managers in one Small blocks (< 2,5 KB) Most frequently used (99%) Medium blocks, subdivided into small blocks Medium blocks (2,5 – 260 KB) Allocated in chunks (1,25 MB) and subdivided into lists Large blocks (> 260 KB) Allocated directly by the OS

DETAILS One large block allocator One medium block allocator Multiple (54+2) small block allocators SmallBlockTypes Custom, optimized Move routines (FastCode) Each allocator has its own lock If SmallAllocator is locked, SmallAllocator+1 or SmallAllocator+2 is used

Reasons for slowdown Threads are fighting for allocators Solution – Change the program Hard to find out the problematic code

Demo Steve Maughan http://www.stevemaughan.com/delphi/delphi-parallel-programming-library-memory-managers/ Redesigned to use OmniThreadLibrary

Diagnosing FastMM Bottlenecks

FastMM4 Locking if IsMultiThread then begin while LockCmpxchg(0, 1, @MediumBlocksLocked) <> 0 do begin {$ifdef NeverSleepOnThreadContention} {$ifdef UseSwitchToThread} SwitchToThread; //any thread on the same processor {$endif} {$else} Sleep(InitialSleepTime);//0;any thread that is ready to run if LockCmpxchg(0, 1, @MediumBlocksLocked) = 0 then Break; Sleep(AdditionalSleepTime); //1; wait end;

Lock Contention Logging LockMediumBlocks({$ifdef LogLockContention}LDidSleep{$endif}); ACollector := nil; {$ifdef LogLockContention} if LDidSleep then ACollector := @MediumBlockCollector; {$endif} if Assigned(ACollector) then begin GetStackTrace(@LStackTrace, StackTraceDepth, 1); ACollector.Add(@LStackTrace[0], StackTraceDepth); end;

FastMM4DataCollector Opaque data Completely static Can’t use MM inside MM Agreed max data size Most Frequently Used Generational Reduce the problem of local maxima Two generations, sorted Easy to expand to more generations

Output Results for all allocators are merged Top 10 call stacks are written to <programname>_MemoryManager_EventLog.txt

Findings

IT IS HARD TO RELEASE MEMORY GetMem does not represent a problem It can (with small blocks) upgrade to unused allocator One thread doesn‘t block another Time is mostly wasted in FreeMem FreeMem must use allocator that produced the memory One thread blocks another

Solution

Partial solution If allocator is locked, delay the FreeMem Memory block is pushed on a ‘to be released’ stack Each allocator gets its own “release stack” while LockCmpxchg(0, 1, @LPSmallBlockType.BlockTypeLocked) <> 0 do begin {$ifdef UseReleaseStack} LPReleaseStack := @LPSmallBlockType.ReleaseStack; if (not LPReleaseStack^.IsFull) and LPReleaseStack^.Push(APointer) then begin Result := 0; Exit; end; {$endif} When allocator is successfully locked, all memory from its release stack is released.

FASTMM4LOCKFREESTACK Very fast lock-free stack implementation Taken from OmniThreadLibrary Windows only Dynamic memory Allocated at startup Uses HeapAlloc for memory allocation

Problems Release stacks work, but not perfectly FreeMem can still block if multiple threads are releasing similarly sized memory blocks. Solution: Hash all threads into a pool of release stacks. 2. Somebody has to clean after terminated threads. Solution: Low-priority memory release thread. Currently only for medium/large blocks. CreateCleanupThread/DestroyCleanupThread

Full solution? while LockCmpxchg(0, 1, @LPSmallBlockType.BlockTypeLocked) <> 0 do begin {$ifdef UseReleaseStack} LPReleaseStack := @LPSmallBlockType.ReleaseStack[GetStackSlot]; if (not LPReleaseStack^.IsFull) and LPReleaseStack^.Push(APointer) then begin Result := 0; Exit; end; {$endif} GetStackSlot hashes thread ID into [0..NumStacksPerBlock-1] range

Bunch of release stacks! 56 + 1 + 1 allocators, each with 64 release stacks Each release stack is very small 36 static bytes 88 dynamic bytes (16 pointers per stack) In 32-bit world 58 * 64 * (36 + 88) = 460 KB

IMPROVE YOUR CODE

DEPLOYMENT Main FastMM repository https://github.com/pleriche/FastMM4 Define LogLockContention or Define UseReleaseStack Rebuild

Slides and code are available at Q & A Slides and code are available at http://thedelphigeek.com/p/presentations.html