Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)

SSMalloc A Low-latency, Locality-conscious Memory Allocator with Stable Performance Scalability
Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)

Background Many-Core Era Many-Thread Application
Computers with tens of cores are available Many-Thread Application Server Program Scientific Computation Program … Many Applications’ performance heavily relies on memory allocator

Allocator performance matters
Web server throughput with different memory allocators *Taken from Facebook website

Is it a solved problem? glibc SFMalloc(PACT11) Scale Up
jemalloc(BSDCan06) Streamflow(ISMM06) glibc

Is it a solved problem? Unstable Scale Up #Core kernel contention
User-level contention

The main problems in modern memory allocators
Unstable scalability Critical path contention Global data structure contention Kernel contention With 64 threads, SFMalloc spent a great amount of time in mmap calls. Unstable locality Kernel execution Allocator data structure operation Context switch Unstable Latency Algorithm complexity Jemalloc use RB trees (O(log N)) internally. Hardware details(pipeline, branch prediction, cache)

This paper #Core Stable Scale Up

Design of ssmalloc

Mechanism for object of different size
Small Object Closely related to scalability Handled in private heap Large Objects Forward to OS via mmap

Small object (<=64KB) management
Thread 1 Thread 2 Thread N … Private Heap 1 Private Heap 2 Private Heap N Memory Chunks Global Pool OS

Memory Chunk Basic unit of memory management
Contains multiple objects of the same size class Obj … Obj N Header Private RW Shared R Shared W Avoid false sharing on allocator data structure

Memory Chunk (Cont.) Same size Unaligned size (65536 + 256 Byte)
Cross size-class reuse Easy metadata locating Unaligned size ( Byte) Mitigate cache conflict on header Header 256 Byte Data Area 65536 Byte cache

Private Heap Full Chunks Foreground Chunks Background Chunks
(LIFO Linked List) Local Free Chunks

Private Heap (Cont.) Hot Chunks Cold Chunks Full Chunks
Foreground Chunks Background Chunks (LIFO Linked List) Local Free Chunks Cold Chunks

Global Pool Global reuse (Lock Free) Alloc new chunk(Lock Free)
Private Heap A Private Heap B Global reuse (Lock Free) Alloc new chunk(Lock Free) Raw Memory Pool Raw Memory Pool is Enlarged Exponentially to avoid mmap calls

Global Pool (Cont.) Interact with OS SSMalloc
Memory Amount SSMalloc (Time-directed reclamation) Reduce VM management Calls Many other allocators (Space-directed reclamation) Memory pages ping-pongs from user & kernel Excessive VM management calls Time

How to free an object? Textbook solution: per object header
Problem: decide the size of memory object Textbook solution: per object header Easy to locate, Bad locality Modern allocators: centralized metadata Hard to locate (bitmap, hash table, radix tree…), Good locality H ?

How to free an object? Problem: decide the size of memory object SSMalloc: Unified header for small & large objects All the object’s header is at the previous chunk boundary Easy to locate (Align to chunk boundary), Good locality Small Objects Large Objects

Design summary Scalability Latency Locality Sync-free critical path
Local memory reuse Lock-free global data structure Excessive VM management calls avoidance(mmap, munmap) … Latency Wait-free algorithm within private heap Short critical path Unified header Locality Locality-conscious memory chunk management Allocator false-sharing avoidance

Evaluation

Evaluation Platform Other memory allocators
8 Six-Core (2.4 GHz) AMD x64 system (48 cores in total) 128 GB memory Linux Other memory allocators Glibc TCMalloc from google-perftools 1.7 jemalloc 2.1.2 streamflow SFMalloc

latency Allocation intensive serial programs

Scalability shbench performance

Locality Wordcount from phoenix 2.0: cache miss

Map-reduce performance
Wordcount from phoenix 2.0

Conclusion Analysis the performance problem of memory allocators Explore the design space of memory allocator for many-thread applications on many-core systems A prototype: SSMalloc Low latency Stable scalability Good locality Thanks!

Why not modify kernel to improve mmap scalability?
Parallelize the VM management operations includes huge kernel code refactoring Memory manager itself Device driver Apply a new memory allocator is much more easy and practical.

Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)

Similar presentations

Presentation on theme: "Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)

Similar presentations

Presentation on theme: "Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)"— Presentation transcript:

Similar presentations

About project

Feedback