Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland.

Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland

Scratch-Pad Memory Allocator SMA: A dynamic memory allocator targeting extremely small memories (< 1MB in size) Why target such tiny memories? Why provide dynamic memory allocation for such small memories?

Outline Rational for SMA SMA Approach Results Concurrent SMA Conclusion / Future work

What Tiny Memories? Embedded Systems –Sensor Network Motes –Vehicular Devices Scratch-Pad Memories –Network Processors –Heterogeneous Multi-Core Processors

Scratch-Pad Memories Memory structured as a hierarchy –Small fast memories, large slow memories Usually hidden by hardware caches Some processor architectures employ scratch-pad memories instead –Similar size and speed as caches, but explicitly accessible by software Examples –IBM Cell processor –Intel IXP network processors –Intel PXA mobile phone processors

Why Dynamic Management? Developers want as much useful data in the fast Scratch-Pad memory as possible They don’t want to deal with the fragmented memory hierarchy ManualStatic Developer ease ✗✓ Make full use of Scratch-Pad ✓✗ Dynamic ✓ ✓

Why SMA? Resource Doug Lea malloc State Memory (bytes)516 Code Memory (instructions)1634 Avg. Alloc Time (cycles)70.7 Avg. Free Time (cycles)95.2 SMA malloc 40 297 72.8 52.4 Managing 4kB Scratch-Pad memory on an Intel IXP processor

Basic Approach By default represent memory coarsely as a series of fixed size blocks –Can employ a very simple bitmap based allocation / free algorithm When required, split blocks into variable sized regions –Prevents excessive internal fragmentation

Large Block Allocation Each block in memory represented by a bit in a free-block bitmap 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 rem_blocks = blocks_bm & ~mask; next_pos = ffs(rem_blocks); in_use = mask & ~blocks_bm; next pos = fls(in_use) + 1; 1 1

Small Region Allocation Unused parts of an allocated block can be reused by sub-block sized allocations Blocks are split into power of two sized regions, in a Binary Buddy type approach Free regions are stored in per-size free lists

Coalescing Freed Regions We wanted to avoid boundary tags Instead the orderly way in which regions are split is exploited A word sized coalesce tag stores the coalesce details for all regions in a block 1

Deferred Coalescing SMA (CAM) –Any size can have coalescing deferred –Content addressable memory used to associate the size of deferred coalesced regions with the regions themselves SMA (LM) –Sizes which coalescing can be deferred chosen at compile time –Deferred regions stored in an array in local memory

Experimental Setup Intel IXP 2350 –Network processor –4 microengine cores with 4kB local scratch-pad each –Access to another 16kB of shared scratch-pad Compared against Doug Lea’s malloc a2pConversion of a 15kB text file to postscript gccCompilation of the file “combine.c” in the gcc source, using gcc gstGhostscript extraction of a 682kB postscript file cvtApplication of the charcoal ﬁlter to a 1024x768 Jpeg image using ImageMagick og g Encoding of a 20 second wav file using the ogg encoder pytExecution of the python example file “md5driver.py” tarArchive and gzip compression of 27 files in 4 directories into a 1Mb archive

Allocation Performance

Free Performance

Memory Wastage

Lock-Free Block Allocation State for large blocks is stored in the free-block bitmap A simple lock-free update algorithm can be used to protect this bitmap –Uses the test and clear primitive 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 Global Thread 1Thread 2 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 1 Test & Clear 0 0 0 0 0 0 0 0 0 0 Atomic Set 0 0 0 0 0 0

Protecting Small Region Lists Locks are used to protect the free-lists used for small size allocation –SMA Coarse uses one lock –SMA Fine uses one lock per size class In SMA Fine, when regions are being coalesced, two locks must be held briefly

Concurrency Scaling

Future Work Provide the illusion of a single memory Let runtime worry about data placement Data can be annotated to give hints to the runtime system

Conclusion Tiny memories need to be managed too SMA is a simple and efficient algorithm for dynamic management of small memories –Fixed size block allocation is simple and has low state overheads –Splitting partially used blocks to be reused by small allocations limits fragmentation SMA can be augmented to support concurrent requests from multiple cores

Questions?

16kb Management Allocation

16kB Management Free

16kB Management Waste

Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland.

Similar presentations

Presentation on theme: "Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland.

Similar presentations

Presentation on theme: "Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland."— Presentation transcript:

Similar presentations

About project

Feedback