Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)

Similar presentations


Presentation on theme: "Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)"— Presentation transcript:

1 SSMalloc A Low-latency, Locality-conscious Memory Allocator with Stable Performance Scalability
Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.) Haibo Chen(Shanghai Jiaotong Univ.)

2 Background Many-Core Era Many-Thread Application
Computers with tens of cores are available Many-Thread Application Server Program Scientific Computation Program Many Applications’ performance heavily relies on memory allocator

3 Allocator performance matters
Web server throughput with different memory allocators *Taken from Facebook website

4 Is it a solved problem? glibc SFMalloc(PACT11) Scale Up
jemalloc(BSDCan06) Streamflow(ISMM06) glibc

5 Is it a solved problem? Unstable Scale Up #Core kernel contention
User-level contention

6 The main problems in modern memory allocators
Unstable scalability Critical path contention Global data structure contention Kernel contention With 64 threads, SFMalloc spent a great amount of time in mmap calls. Unstable locality Kernel execution Allocator data structure operation Context switch Unstable Latency Algorithm complexity Jemalloc use RB trees (O(log N)) internally. Hardware details(pipeline, branch prediction, cache)

7 This paper #Core Stable Scale Up

8 Design of ssmalloc

9 Mechanism for object of different size
Small Object Closely related to scalability Handled in private heap Large Objects Forward to OS via mmap

10 Small object (<=64KB) management
Thread 1 Thread 2 Thread N Private Heap 1 Private Heap 2 Private Heap N Memory Chunks Global Pool OS

11 Memory Chunk Basic unit of memory management
Contains multiple objects of the same size class Obj … Obj N Header Private RW Shared R Shared W Avoid false sharing on allocator data structure

12 Memory Chunk (Cont.) Same size Unaligned size (65536 + 256 Byte)
Cross size-class reuse Easy metadata locating Unaligned size ( Byte) Mitigate cache conflict on header Header 256 Byte Data Area 65536 Byte cache

13 Private Heap Full Chunks Foreground Chunks Background Chunks
(LIFO Linked List) Local Free Chunks

14 Private Heap (Cont.) Hot Chunks Cold Chunks Full Chunks
Foreground Chunks Background Chunks (LIFO Linked List) Local Free Chunks Cold Chunks

15 Global Pool Global reuse (Lock Free) Alloc new chunk(Lock Free)
Private Heap A Private Heap B Global reuse (Lock Free) Alloc new chunk(Lock Free) Raw Memory Pool Raw Memory Pool is Enlarged Exponentially to avoid mmap calls

16 Global Pool (Cont.) Interact with OS SSMalloc
Memory Amount SSMalloc (Time-directed reclamation) Reduce VM management Calls Many other allocators (Space-directed reclamation) Memory pages ping-pongs from user & kernel Excessive VM management calls Time

17 How to free an object? Textbook solution: per object header
Problem: decide the size of memory object Textbook solution: per object header Easy to locate, Bad locality Modern allocators: centralized metadata Hard to locate (bitmap, hash table, radix tree…), Good locality H ?

18 How to free an object? Problem: decide the size of memory object SSMalloc: Unified header for small & large objects All the object’s header is at the previous chunk boundary Easy to locate (Align to chunk boundary), Good locality Small Objects Large Objects

19 Design summary Scalability Latency Locality Sync-free critical path
Local memory reuse Lock-free global data structure Excessive VM management calls avoidance(mmap, munmap) Latency Wait-free algorithm within private heap Short critical path Unified header Locality Locality-conscious memory chunk management Allocator false-sharing avoidance

20 Evaluation

21 Evaluation Platform Other memory allocators
8 Six-Core (2.4 GHz) AMD x64 system (48 cores in total) 128 GB memory Linux Other memory allocators Glibc TCMalloc from google-perftools 1.7 jemalloc 2.1.2 streamflow SFMalloc

22 latency Allocation intensive serial programs

23 Scalability shbench performance

24 Locality Wordcount from phoenix 2.0: cache miss

25 Map-reduce performance
Wordcount from phoenix 2.0

26 Conclusion Analysis the performance problem of memory allocators Explore the design space of memory allocator for many-thread applications on many-core systems A prototype: SSMalloc Low latency Stable scalability Good locality Thanks!

27 Why not modify kernel to improve mmap scalability?
Parallelize the VM management operations includes huge kernel code refactoring Memory manager itself Device driver Apply a new memory allocator is much more easy and practical.


Download ppt "Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)"

Similar presentations


Ads by Google