Shoal: smart allocation and replication of memory for parallel programs Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris ATC’15 March 31st, 2016 Cho, Hyojae
CONTENTS Introduction Motivation Array
1. Introduction Memory allocation in NUMA multi-core machines NUMA(Non-Uniform Memory Access)
1. Introduction Methods: Manual configuration by programmers They struggle to develop software applying these techniques Programmers must repeatedly make manual changes Relying on automatic online monitoring to decide how to migrate data Maybe expensive Small number of optimizations
2. Motivation “memset()” considered harmful on multi-core
2. Motivation Shoal A system that abstracts memory access and provides rich programming interface It automatically tune data placement and access based on memory access patterns Programmers need not to know where the data is saved
2. Motivation Shoal A new interface for memory allocation including machine aware “malloc” call An abstraction for data access based on arrays. All implementations can be interchanged transparently without the need to change programs.
3. Array Array types Single-node allocation Distribution Replication Allocates the entire array on the local node Distribution Allocate data split equally across NUMA nodes Replication Several copies of the array are allocated. Partitioning Allocate data where work units can be executed local
3. Array Selection of arrays Maximize local access to minimize interconnect traffic Load-balance memory on all available controllers Partitioning If the array is only accessed via an index Replication If the array is read-only and fits into every NUMA node Otherwise use a uniform distribution
3. Array Selection of arrays
4. Implementation The Shoal runtime library A high-level array representation based on C++ templates. A low-level, OS-specific backend
4. Implementation An example of high-level DSLs. DSL : Domain-Specific Language Foreach (t: G.Nodes) means the nodes-array will be accessed sequentially, and with an index Sum(w: t.InNbrs) implies read-only, indexed accesses on in-neighbors array.
4. Implementation High-level compiler High-level program Written in high-level parallel language Such as Green-Marl, OptiML High-level compiler It translates high-level code to low-level code Low-level code with array abstractions Written in C++ It uses Shoal’s abstraction to allocate and access memory At compile time the concrete choice of array implementation is not made.
4. Implementation Access patterns Shoal library OS-specific backends A information about load/store patterns Read/write ratio Shoal library It takes care of selecting array implementations based on extracted access patterns OS-specific backends It runs on the Linux and Barrelfish OS currently.
5. Evaluation Goal: Comparison of Shore and a regular memory runtime Shore’s array implementations Analyze shoal’s initialization cost Investigate the benefits of using a DMA engine for array copy
5. Evaluation Machines
5. Evaluation Scalability (Green-Marl) ) Almost 2x faster than the original implementation
5. Evaluation Scalability (PARSEC - Streamcluster) One of the used arrays is replaced with Shoal array 4x faster than original implementation
5. Evaluation Use DMA engines
6. Conclusion Shoal, a library that provides an array abstraction rich memory allocation functions allow automatic tuning of data placement and access depending on workload and machine characteristics 2x improvement for Green-Marl program without changing the Green-Marl input program