How I Learned to Love FastMM Internals

Slides:



Advertisements
Similar presentations
CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.
Advertisements

Part IV: Memory Management
The Linux Kernel: Memory Management
An Adaptive, Region-based Allocator for Java Feng Qian & Laurie Hendren 2002.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Chapter 12: File System Implementation
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Budapesti Műszaki és Gazdaságtudományi Egyetem Méréstechnika és Információs Rendszerek Tanszék Scheduling in Windows Zoltan Micskei
© 2011 IBM Corporation 11 April 2011 IDS Architecture.
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
C++ Programming. Table of Contents History What is C++? Development of C++ Standardized C++ What are the features of C++? What is Object Orientation?
Building Multithreaded Solutions with OmniThreadLibrary
Windows NT and Real-Time? Reading: “Inside Microsoft Windows 2000”, (Solomon, Russinovich, Microsoft Programming Series) “Real-Time Systems and Microsoft.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 6.
Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.
War of the Worlds -- Shared-memory vs. Distributed-memory In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate.
FastMM in Depth Primož Gabrijelčič,
Games Development 2 Concurrent Programming CO3301 Week 9.
2003 Dominic Swayne1 Microsoft Disk Operating System and PC DOS CS-550-1: Operating Systems Fall 2003 Dominic Swayne.
Computer Systems Week 14: Memory Management Amanda Oddie.
Upcoming Presentations ILM Professional Service – Proprietary and Confidential ( DateTimeTopicPresenter March PM Distributed.
Processes and Virtual Memory
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Preface 1Performance Tuning Methodology: A Review Course Structure 1-2 Lesson Objective 1-3 Concepts 1-4 Determining the Worst Bottleneck 1-5 Understanding.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Ch. 4 Memory Mangement Parkinson’s law: “Programs expand to fill the memory available to hold them.”
Tuning Threaded Code with Intel® Parallel Amplifier.
Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.
Scalable Computing model : Lock free protocol By Peeyush Agrawal 2010MCS3469 Guided By Dr. Kolin Paul.
Jonas Johansson Summarizing presentation of Scheduler Activations – A different approach to parallelism.
Event Based Simulation of The Backfilling Algorithm OOP tirgul No
Java Thread Programming
Current Generation Hypervisor Type 1 Type 2.
CSC 322 Operating Systems Concepts Lecture - 12: by
Faster parallel programs with improved FastMM
Outline Other synchronization primitives
Unit OS9: Real-Time and Embedded Systems
Other Important Synchronization Primitives
Debunking the Top 10 Myths of Small Business Server: Using Windows SBS in Larger Environments Abstract: This session will debunk some of the common myths.
Main Memory Management
SharePoint-Hosted Apps and JavaScript
Multithreading Chapter 23.
ICS 143 Principles of Operating Systems
Department of Computer Science University of California, Santa Barbara
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Writing High Performance Delphi Application
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Scheduling.
Memory Management Overview
Writing High Performance Delphi Applications
Fast Communication and User Level Parallelism
CPU scheduling decisions may take place when a process:
Threads Chapter 4.
Programming with Shared Memory
PROCESSES & THREADS ADINA-CLAUDIA STOICA.
CSE451 Virtual Memory Paging Autumn 2002
Chapter 3: Processes.
Chapter 10 Multiprocessor and Real-Time Scheduling
Programming with Shared Memory
Lecture 7: Flexible Address Translation
Department of Computer Science University of California, Santa Barbara
CSE 542: Operating Systems
Buddy Allocation CS 161: Lecture 5 2/11/19.
COMP755 Advanced Operating Systems
3.8 static vs dynamic thread management
CSC Multiprocessor Programming, Spring, 2011
Nikola Grcevski Testarossa JIT Compiler IBM Toronto Lab
Threads CSE 2431: Introduction to Operating Systems
CS Introduction to Operating Systems
Presentation transcript:

How I Learned to Love FastMM Internals Primož Gabrijelčič http://primoz.gabrijelcic.org

History

History Developed by Pierre LeRiche for the FastCode project https://en.wikipedia.org/wiki/FastCode Version 4, hence FastMM4 Included in RAD Studio since version 2006 http://www.tindex.net/Language/FastMMmemorymanager.html Much improved since Don’t use default FastMM, download the fresh one https://github.com/pleriche/FastMM4

Features Fast Fragmentation resistant Access to > 2GB Simple memory sharing Memory leak reporting Catches some memory-related bugs

Problems Can get slow in multithreaded environment Can get VERY slow in multithreaded environment

FastMM4 Internals

Top View Three memory managers in one Small blocks (< 2,5 KB) Most frequently used (99%) Medium blocks, subdivided into small blocks Medium blocks (2,5 – 260 KB) Allocated in chunks (1,25 MB) and subdivided into lists Large blocks (> 260 KB) Allocated directly by the OS

Details One large block allocator One medium block allocator Multiple (54+2) small block allocators SmallBlockTypes Custom, optimized Move routines (FastCode) Each allocator has its own lock If SmallAllocator is locked, SmallAllocator+1 or SmallAllocator+2 is used

Problem Multithreaded programs are slow? Threads are fighting for allocators. Easy to change the program to bypass the problem. Well, sometimes. Hard to find out the responsible code.

Demo Steve Maughan: http://www.stevemaughan.com/delphi/delphi- parallel-programming-library-memory-managers/ http://www.thedelphigeek.com/2016/02/finding-memory-allocation- bottlenecks.html

Diagnosing FastMM4 Bottlenecks $DEFINE LogLockContention

FastMM4 Locking if IsMultiThread then begin while LockCmpxchg(0, 1, @MediumBlocksLocked) <> 0 do begin {$ifdef NeverSleepOnThreadContention} {$ifdef UseSwitchToThread} SwitchToThread; //any thread on the same processor {$endif} {$else} Sleep(InitialSleepTime); // 0; any thread that is ready to run if LockCmpxchg(0, 1, @MediumBlocksLocked) = 0 then Break; Sleep(AdditionalSleepTime); // 1; wait end;

Lock Contention Logging LockMediumBlocks({$ifdef LogLockContention}LDidSleep{$endif}); {$ifdef LogLockContention} if LDidSleep then ACollector := @MediumBlockCollector; {$endif} if Assigned(ACollector) then begin GetStackTrace(@LStackTrace, StackTraceDepth, 1); MediumBlockCollector.Add(@LStackTrace[0], StackTraceDepth); end;

FastMM4DataCollector Opaque data Completely static Can’t use MM inside MM Agreed max data size Most Frequently Used Generational Reduce the problem of local maxima Two generations, sorted 1024 slots in Gen1 256 slots in Gen2 Easy to expand to more generations

Output Results for all allocators are merged LargeBlockCollector.GetData(mergedData, mergedCount); MediumBlockCollector.GetData(data, count); LargeBlockCollector.Merge(mergedData, mergedCount, data, count); for i := 0 to High(SmallBlockTypes) do begin SmallBlockTypes[i].BlockCollector.GetData(data, count); end; Top 10 “call sites” are written to <programname>_MemoryManager_EventLog.txt

Findings

It is hard to release memory Time is mostly wasted in FreeMem GetMem (with small blocks) can “upgrade” to unused allocator One thread doesn’t block another FreeMem must work with the allocator that “produced” the memory One thread blocks another

Solution

Solution If allocator is locked, delay the FreeMem Memory block is pushed on a ‘to be released’ list Each allocator gets its own “release stack” while LockCmpxchg(0, 1, @LPSmallBlockType.BlockTypeLocked) <> 0 do begin {$ifdef UseReleaseStack} LPReleaseStack := @LPSmallBlockType.ReleaseStack; if (not LPReleaseStack^.IsFull) and LPReleaseStack^.Push(APointer) then begin Result := 0; Exit; end; {$endif} When allocator is successfully locked, all memory from its release stack is released.

FastMM4LockFreeStack Very fast lock-free stack implementation Taken from OmniThreadLibrary Windows only Dynamic memory Uses HeapAlloc for memory allocation

Problems Release stacks work, but not perfectly FreeMem can still block if multiple threads are releasing similarly sized memory blocks. Solution: Hash all threads into a pool of release stacks. 2. Somebody has to clean after terminated threads. Solution: Low-priority memory release thread. Currently only for medium/large blocks. CreateCleanupThread/DestroyCleanupThread

Bunch of release stacks while LockCmpxchg(0, 1, @LPSmallBlockType.BlockTypeLocked) <> 0 do begin {$ifdef UseReleaseStack} LPReleaseStack := @LPSmallBlockType.ReleaseStack[GetStackSlot]; if (not LPReleaseStack^.IsFull) and LPReleaseStack^.Push(APointer) then begin Result := 0; Exit; end; {$endif} GetStackSlot hashes thread ID into [0..NumStacksPerBlock-1] range

Danger, Will Robinson! Used in production Still, use with care Incompatible with FullDebugMode $DEFINE UseReleaseStack

NUMA Non-Uniform Memory Access

Non-Uniformed Memory Access SMP NUMA Source: Introduction to Parallel Computing, https://computing.llnl.gov/tutorials/parallel_comp/

NUMA brings problems Different “cost” for memory access Measurement from a real system 80 cores, 20 in each NUMA node Coreinfo, Mark Russinovich Not very accurate measurement 00 01 02 03 1.0 1.6 1.9 3.4 1.8 2.2 3.5 2.1 2.6 3.1 2.8

Solution Node-local memory allocation FastMM implementation: per-node allocators https://github.com/gabr42/FastMM4-MP/tree/numa VERY experimental!

Solution, part 2 How to use more than 64 cores in your program? OmniThreadLibrary with NUMA extensions https://github.com/gabr42/OmniThreadLibrary/tree/numa Environment.ProcessorGroups, Environment.NUMANodes IOmniTaskControl.ProcessorGroup, IOmniTaskControl.NUMANode IOmniThreadPool.ProcessorGroups, IOmniThreadPool.NUMANodes

“NUMA” for Developers bcdedit /set groupsize 2 https://msdn.microsoft.com/en- us/library/windows/hardware/ff542298(v=vs.85).aspx

Questions?