1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

2© Copyright 2015 EMC Corporation. All rights reserved. MOTIVATION Next generation of EMC VPLEX hardware is NUMA based – What is the expected performance benefit? – How to best adjust the code to NUMA? Gain experience with NUMA tools

3© Copyright 2015 EMC Corporation. All rights reserved. VPLEX OVERVIEW A unique virtual storage technology that enables: – Data mobility and high availability within and between data centers. – Mission critical continuous availability between two synchronous sites. – Distributed RAID1 between 2 sites.

7© Copyright 2015 EMC Corporation. All rights reserved. POLICIES CONT. NameDescription defaultAllocate on the local node (the node the thread is running on) bindAllocate on a specific set of nodes interleaveInterleave memory allocations on a set of nodes preferredTry to allocate on a node first * Policies can also be applied to shared memory regions.

12© Copyright 2015 EMC Corporation. All rights reserved. NUMACTL EXAMPLES numactl –cpubind=0 –membind=0,1 run the program on node 0 and allocate memory from nodes 0,1 numactl –interleave=all run the program with memory interleave on all available nodes.

13© Copyright 2015 EMC Corporation. All rights reserved. LIBNUMA A library that offers an API for NUMA policy. Fine grained tuning of NUMA policies. – Changing policy in one thread does not affect other threads.

14© Copyright 2015 EMC Corporation. All rights reserved. LIBNUMA EXAMPLES numa_available() – checks if NUMA is supported on the system. numa_run_on_node(int node) – binds the current thread on a specific node. numa_max_node() – the number of the highest node in the system. numa_alloc_interleave(size_t size) – allocates size bytes of memory page interleaved on all available nodes. numa_alloc_onnode(size_t size, int node) – allocate memory on a specific node.

15© Copyright 2015 EMC Corporation. All rights reserved. HARDWARE OVERVIEW Node 0 cpu0 L2 L1 cpu1 L2 L1 cpu2 L2 L1 cpu3 L2 L1 cpu4 L2 L1 cpu5 L2 L1 L3 cache Node 1 cpu0 L2 L1 cpu1 L2 L1 cpu2 L2 L1 cpu3 L2 L1 cpu4 L2 L1 cpu5 L2 L1 L3 cache RAM QPI 2 hyper threads Quick Path Interconnect

16© Copyright 2015 EMC Corporation. All rights reserved. HARDWARE OVERVIEW ProcessorIntel Xeon Processor E5-2620 # Cores6 # Threads12 QPI speed8.0 GT/s = 64 GB/s L1 data cache32 KB L1 instruction cache32 KB L2 cache256 KB L3 cache15 MB RAM62.5 GB Gigatransfers per second GB/s = GT/s * BUS bandwidth (8B)

17© Copyright 2015 EMC Corporation. All rights reserved. LINUX PERF TOOL Command line profiler Based on perf_events – Hardware events – counted by the CPU – Software events – counted by the kernel perf list – a list of pre-defined events (to be used in –e). – instructions [Hardware event] – context-switches OR cs [Software event] – L1-dcache-loads [Hardware cache event] – rNNN [Raw hardware event descriptor]

18© Copyright 2015 EMC Corporation. All rights reserved. PERF STAT Keeps a running count of selected events during process execution. perf stat [options] –e [list of events] Examples: – perf stat –e page-faults my_exec. #page-faults that occurred during execution of my_exec. – perf stat –a –e instructions,r81d0 sleep 5 System wide count on all CPUs. Counts #intructions and l1 dcache loads.

19© Copyright 2015 EMC Corporation. All rights reserved. CHARACTERIZING OUR SYSTEM Linux perf tool CPU Performance counters – L1-dcache-loads – L1-dcache-stores Test: ran IO for 120 seconds Result: RD/WR = 2:1

20© Copyright 2015 EMC Corporation. All rights reserved. THE SIMULATOR Measuring performance for different memory allocation policies on a 2 node system. Throughput is measured as the time it takes to complete N iterations. Threads randomly access a shared memory.

21© Copyright 2015 EMC Corporation. All rights reserved. THE SIMULATOR CONT. #Threads RD/WR ratio – ratio between the number of read and write operations a thread performs Policy – local / interleave / remote. Size – the size of memory to allocate. #Iterations Node0/Node1 – ratio between threads bound to Node 0 and threads bound Node 1 RW_SIZE - size of read or write operation in each iteration. Config file:

22© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 Compare performance of 3 policies: – Local – threads access memory on node they run on. – Remote – threads access memory on a different node from which they run on. – Interleave – memory is interleaves across nodes (threads access both local and remote memory)

23© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 3 policies – local, interleave, remote. #Threads varies from 1-24 (the maximal number of concurrent threads in the system) 2 setups – balanced/unbalanced workload balancedunbalanced

28© Copyright 2015 EMC Corporation. All rights reserved. CONCLUSIONS The more concurrent threads in the system, the more impact memory locality has on performance. In applications with #concurrent threads up to #cores in 1 node, the best solution is to bind the process and allocate memory on the same node. In applications with #concurrent threads up to #cores in a 2 node system, disabling NUMA (interleaving memory) will have similar performance to binding the process and allocating memory on the same node.

29© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 Local access is significantly faster than remote. Our system uses RW locks to synchronize memory access.  Is maintaining read locality by mirroring the data on both nodes have better performance than the current interleave policy?

30© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 Purpose: find RD/WR ratio for which maintaining read locality is better than memory interleaving. Setup 1: Interleaving – Single RW lock – Data is interleaved across both nodes Setup 2: Mirroring data – RW lock per node – Each read operation accesses local memory. – Each write operation is done to both local and remote memory.

31© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 #Iterations = 25,000,000 Data size = 2 * 150 MB RD/WR ratio = 12 : i, { 1 <= i <= 12} #Threads = 8 ; 12 RW_SIZE = 512 ; 1024 ; 2048 ; 4096 Bytes Configurations:

34© Copyright 2015 EMC Corporation. All rights reserved. CONCLUSIONS Memory op size and % of write operations both play a role in deciding which memory allocation policy is better. In applications with small mem-op size (512B) and up to 50% write operations, mirroring is the better option. In applications with mem-op size of 4KB and more, mirroring is worse than interleaving the memory and using a single RW lock.

35© Copyright 2015 EMC Corporation. All rights reserved. SUMMARY Fine grained memory allocation can lead to performance improvement for certain workloads. – More investigation is needed in order to configure a suitable memory policy that utilizes NUMA abilities.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Similar presentations

Presentation on theme: "1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Similar presentations

Presentation on theme: "1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER."— Presentation transcript:

Similar presentations

About project

Feedback