How I Learned to Love FastMM Internals

Slides:

Advertisements

Similar presentations

CS 11 C track: lecture 7 Last week: structs, typedef, linked lists This week: hash tables more on the C preprocessor extern const.

Advertisements

Part IV: Memory Management

The Linux Kernel: Memory Management

An Adaptive, Region-based Allocator for Java Feng Qian & Laurie Hendren 2002.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Chapter 12: File System Implementation

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

Budapesti Műszaki és Gazdaságtudományi Egyetem Méréstechnika és Információs Rendszerek Tanszék Scheduling in Windows Zoltan Micskei

© 2011 IBM Corporation 11 April 2011 IDS Architecture.

DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program

C++ Programming. Table of Contents History What is C++? Development of C++ Standardized C++ What are the features of C++? What is Object Orientation?

Building Multithreaded Solutions with OmniThreadLibrary

Windows NT and Real-Time? Reading: “Inside Microsoft Windows 2000”, (Solomon, Russinovich, Microsoft Programming Series) “Real-Time Systems and Microsoft.

ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 6.

Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.

War of the Worlds -- Shared-memory vs. Distributed-memory In distributed world, we have heavyweight processes (nodes) rather than threads Nodes communicate.

FastMM in Depth Primož Gabrijelčič,

Games Development 2 Concurrent Programming CO3301 Week 9.

2003 Dominic Swayne1 Microsoft Disk Operating System and PC DOS CS-550-1: Operating Systems Fall 2003 Dominic Swayne.

Computer Systems Week 14: Memory Management Amanda Oddie.

Upcoming Presentations ILM Professional Service – Proprietary and Confidential ( DateTimeTopicPresenter March PM Distributed.

Processes and Virtual Memory

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

Preface 1Performance Tuning Methodology: A Review Course Structure 1-2 Lesson Objective 1-3 Concepts 1-4 Determining the Worst Bottleneck 1-5 Understanding.

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Ch. 4 Memory Mangement Parkinson’s law: “Programs expand to fill the memory available to hold them.”

Tuning Threaded Code with Intel® Parallel Amplifier.

Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.

Scalable Computing model : Lock free protocol By Peeyush Agrawal 2010MCS3469 Guided By Dr. Kolin Paul.

Jonas Johansson Summarizing presentation of Scheduler Activations – A different approach to parallelism.

Event Based Simulation of The Backfilling Algorithm OOP tirgul No

Java Thread Programming

Current Generation Hypervisor Type 1 Type 2.

CSC 322 Operating Systems Concepts Lecture - 12: by

Faster parallel programs with improved FastMM

Outline Other synchronization primitives

Unit OS9: Real-Time and Embedded Systems

Other Important Synchronization Primitives

Debunking the Top 10 Myths of Small Business Server: Using Windows SBS in Larger Environments Abstract: This session will debunk some of the common myths.

Main Memory Management

SharePoint-Hosted Apps and JavaScript

Multithreading Chapter 23.

ICS 143 Principles of Operating Systems

Department of Computer Science University of California, Santa Barbara

KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures

Writing High Performance Delphi Application

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Memory Management Overview

Writing High Performance Delphi Applications

Fast Communication and User Level Parallelism

CPU scheduling decisions may take place when a process:

Threads Chapter 4.

Programming with Shared Memory

PROCESSES & THREADS ADINA-CLAUDIA STOICA.

CSE451 Virtual Memory Paging Autumn 2002

Chapter 3: Processes.

Chapter 10 Multiprocessor and Real-Time Scheduling

Programming with Shared Memory

Lecture 7: Flexible Address Translation

Department of Computer Science University of California, Santa Barbara

CSE 542: Operating Systems

Buddy Allocation CS 161: Lecture 5 2/11/19.

COMP755 Advanced Operating Systems

3.8 static vs dynamic thread management

CSC Multiprocessor Programming, Spring, 2011

Nikola Grcevski Testarossa JIT Compiler IBM Toronto Lab

Threads CSE 2431: Introduction to Operating Systems

CS Introduction to Operating Systems

Presentation transcript:

How I Learned to Love FastMM Internals Primož Gabrijelčič http://primoz.gabrijelcic.org

History

History Developed by Pierre LeRiche for the FastCode project https://en.wikipedia.org/wiki/FastCode Version 4, hence FastMM4 Included in RAD Studio since version 2006 http://www.tindex.net/Language/FastMMmemorymanager.html Much improved since Don’t use default FastMM, download the fresh one https://github.com/pleriche/FastMM4

Features Fast Fragmentation resistant Access to > 2GB Simple memory sharing Memory leak reporting Catches some memory-related bugs

Problems Can get slow in multithreaded environment Can get VERY slow in multithreaded environment

FastMM4 Internals

Top View Three memory managers in one Small blocks (< 2,5 KB) Most frequently used (99%) Medium blocks, subdivided into small blocks Medium blocks (2,5 – 260 KB) Allocated in chunks (1,25 MB) and subdivided into lists Large blocks (> 260 KB) Allocated directly by the OS

Details One large block allocator One medium block allocator Multiple (54+2) small block allocators SmallBlockTypes Custom, optimized Move routines (FastCode) Each allocator has its own lock If SmallAllocator is locked, SmallAllocator+1 or SmallAllocator+2 is used

Problem Multithreaded programs are slow? Threads are fighting for allocators. Easy to change the program to bypass the problem. Well, sometimes. Hard to find out the responsible code.

Demo Steve Maughan: http://www.stevemaughan.com/delphi/delphi- parallel-programming-library-memory-managers/ http://www.thedelphigeek.com/2016/02/finding-memory-allocation- bottlenecks.html

Diagnosing FastMM4 Bottlenecks $DEFINE LogLockContention

FastMM4 Locking if IsMultiThread then begin while LockCmpxchg(0, 1, @MediumBlocksLocked) <> 0 do begin {$ifdef NeverSleepOnThreadContention} {$ifdef UseSwitchToThread} SwitchToThread; //any thread on the same processor {$endif} {$else} Sleep(InitialSleepTime); // 0; any thread that is ready to run if LockCmpxchg(0, 1, @MediumBlocksLocked) = 0 then Break; Sleep(AdditionalSleepTime); // 1; wait end;

Lock Contention Logging LockMediumBlocks({$ifdef LogLockContention}LDidSleep{$endif}); {$ifdef LogLockContention} if LDidSleep then ACollector := @MediumBlockCollector; {$endif} if Assigned(ACollector) then begin GetStackTrace(@LStackTrace, StackTraceDepth, 1); MediumBlockCollector.Add(@LStackTrace[0], StackTraceDepth); end;

FastMM4DataCollector Opaque data Completely static Can’t use MM inside MM Agreed max data size Most Frequently Used Generational Reduce the problem of local maxima Two generations, sorted 1024 slots in Gen1 256 slots in Gen2 Easy to expand to more generations

Output Results for all allocators are merged LargeBlockCollector.GetData(mergedData, mergedCount); MediumBlockCollector.GetData(data, count); LargeBlockCollector.Merge(mergedData, mergedCount, data, count); for i := 0 to High(SmallBlockTypes) do begin SmallBlockTypes[i].BlockCollector.GetData(data, count); end; Top 10 “call sites” are written to <programname>_MemoryManager_EventLog.txt

Findings

It is hard to release memory Time is mostly wasted in FreeMem GetMem (with small blocks) can “upgrade” to unused allocator One thread doesn’t block another FreeMem must work with the allocator that “produced” the memory One thread blocks another

Solution

Solution If allocator is locked, delay the FreeMem Memory block is pushed on a ‘to be released’ list Each allocator gets its own “release stack” while LockCmpxchg(0, 1, @LPSmallBlockType.BlockTypeLocked) <> 0 do begin {$ifdef UseReleaseStack} LPReleaseStack := @LPSmallBlockType.ReleaseStack; if (not LPReleaseStack^.IsFull) and LPReleaseStack^.Push(APointer) then begin Result := 0; Exit; end; {$endif} When allocator is successfully locked, all memory from its release stack is released.

FastMM4LockFreeStack Very fast lock-free stack implementation Taken from OmniThreadLibrary Windows only Dynamic memory Uses HeapAlloc for memory allocation

Problems Release stacks work, but not perfectly FreeMem can still block if multiple threads are releasing similarly sized memory blocks. Solution: Hash all threads into a pool of release stacks. 2. Somebody has to clean after terminated threads. Solution: Low-priority memory release thread. Currently only for medium/large blocks. CreateCleanupThread/DestroyCleanupThread

Bunch of release stacks while LockCmpxchg(0, 1, @LPSmallBlockType.BlockTypeLocked) <> 0 do begin {$ifdef UseReleaseStack} LPReleaseStack := @LPSmallBlockType.ReleaseStack[GetStackSlot]; if (not LPReleaseStack^.IsFull) and LPReleaseStack^.Push(APointer) then begin Result := 0; Exit; end; {$endif} GetStackSlot hashes thread ID into [0..NumStacksPerBlock-1] range

Danger, Will Robinson! Used in production Still, use with care Incompatible with FullDebugMode $DEFINE UseReleaseStack

NUMA Non-Uniform Memory Access

Non-Uniformed Memory Access SMP NUMA Source: Introduction to Parallel Computing, https://computing.llnl.gov/tutorials/parallel_comp/

NUMA brings problems Different “cost” for memory access Measurement from a real system 80 cores, 20 in each NUMA node Coreinfo, Mark Russinovich Not very accurate measurement 00 01 02 03 1.0 1.6 1.9 3.4 1.8 2.2 3.5 2.1 2.6 3.1 2.8

Solution Node-local memory allocation FastMM implementation: per-node allocators https://github.com/gabr42/FastMM4-MP/tree/numa VERY experimental!

Solution, part 2 How to use more than 64 cores in your program? OmniThreadLibrary with NUMA extensions https://github.com/gabr42/OmniThreadLibrary/tree/numa Environment.ProcessorGroups, Environment.NUMANodes IOmniTaskControl.ProcessorGroup, IOmniTaskControl.NUMANode IOmniThreadPool.ProcessorGroups, IOmniThreadPool.NUMANodes

“NUMA” for Developers bcdedit /set groupsize 2 https://msdn.microsoft.com/en- us/library/windows/hardware/ff542298(v=vs.85).aspx

Questions?