Download presentation
Presentation is loading. Please wait.
Published byShon Stokes Modified over 8 years ago
2
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER
3
2© Copyright 2015 EMC Corporation. All rights reserved. MOTIVATION Next generation of EMC VPLEX hardware is NUMA based – What is the expected performance benefit? – How to best adjust the code to NUMA? Gain experience with NUMA tools
4
3© Copyright 2015 EMC Corporation. All rights reserved. VPLEX OVERVIEW A unique virtual storage technology that enables: – Data mobility and high availability within and between data centers. – Mission critical continuous availability between two synchronous sites. – Distributed RAID1 between 2 sites.
5
4© Copyright 2015 EMC Corporation. All rights reserved. UMA OVERVIEW – CURRENT STATE Uniform Memory Access RAM CPU0 CPU1 CPU2 CPU3 CPU4 CPU5
6
5© Copyright 2015 EMC Corporation. All rights reserved. NODE0NODE1 NUMA OVERVIEW – NEXT GENERATION Non Uniform Memory Access RAM CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 RAM
7
6© Copyright 2015 EMC Corporation. All rights reserved. POLICIES Allocation of memory on specific nodes Binding threads to specific nodes/CPUs Can be applied to: – Process – Memory area
8
7© Copyright 2015 EMC Corporation. All rights reserved. POLICIES CONT. NameDescription defaultAllocate on the local node (the node the thread is running on) bindAllocate on a specific set of nodes interleaveInterleave memory allocations on a set of nodes preferredTry to allocate on a node first * Policies can also be applied to shared memory regions.
9
8© Copyright 2015 EMC Corporation. All rights reserved. DEFAULT POLICY Node 0 Local memory access Local memory access Running thread Node 1 Local memory access Local memory access Running thread
10
9© Copyright 2015 EMC Corporation. All rights reserved. BIND/PREFFERED POLICY Node 0 Running thread local Node 1 Running thread remote
11
10© Copyright 2015 EMC Corporation. All rights reserved. INTERLEAVE POLICY Node 0 Local memory access Local memory access Running thread Node 1 Remote memory access Remote memory access
12
11© Copyright 2015 EMC Corporation. All rights reserved. NUMACTL Command line tool for running a specific NUMA Policy. Useful for programs that cannot be modified or recompiled.
13
12© Copyright 2015 EMC Corporation. All rights reserved. NUMACTL EXAMPLES numactl –cpubind=0 –membind=0,1 run the program on node 0 and allocate memory from nodes 0,1 numactl –interleave=all run the program with memory interleave on all available nodes.
14
13© Copyright 2015 EMC Corporation. All rights reserved. LIBNUMA A library that offers an API for NUMA policy. Fine grained tuning of NUMA policies. – Changing policy in one thread does not affect other threads.
15
14© Copyright 2015 EMC Corporation. All rights reserved. LIBNUMA EXAMPLES numa_available() – checks if NUMA is supported on the system. numa_run_on_node(int node) – binds the current thread on a specific node. numa_max_node() – the number of the highest node in the system. numa_alloc_interleave(size_t size) – allocates size bytes of memory page interleaved on all available nodes. numa_alloc_onnode(size_t size, int node) – allocate memory on a specific node.
16
15© Copyright 2015 EMC Corporation. All rights reserved. HARDWARE OVERVIEW Node 0 cpu0 L2 L1 cpu1 L2 L1 cpu2 L2 L1 cpu3 L2 L1 cpu4 L2 L1 cpu5 L2 L1 L3 cache Node 1 cpu0 L2 L1 cpu1 L2 L1 cpu2 L2 L1 cpu3 L2 L1 cpu4 L2 L1 cpu5 L2 L1 L3 cache RAM QPI 2 hyper threads Quick Path Interconnect
17
16© Copyright 2015 EMC Corporation. All rights reserved. HARDWARE OVERVIEW ProcessorIntel Xeon Processor E5-2620 # Cores6 # Threads12 QPI speed8.0 GT/s = 64 GB/s L1 data cache32 KB L1 instruction cache32 KB L2 cache256 KB L3 cache15 MB RAM62.5 GB Gigatransfers per second GB/s = GT/s * BUS bandwidth (8B)
18
17© Copyright 2015 EMC Corporation. All rights reserved. LINUX PERF TOOL Command line profiler Based on perf_events – Hardware events – counted by the CPU – Software events – counted by the kernel perf list – a list of pre-defined events (to be used in –e). – instructions [Hardware event] – context-switches OR cs [Software event] – L1-dcache-loads [Hardware cache event] – rNNN [Raw hardware event descriptor]
19
18© Copyright 2015 EMC Corporation. All rights reserved. PERF STAT Keeps a running count of selected events during process execution. perf stat [options] –e [list of events] Examples: – perf stat –e page-faults my_exec. #page-faults that occurred during execution of my_exec. – perf stat –a –e instructions,r81d0 sleep 5 System wide count on all CPUs. Counts #intructions and l1 dcache loads.
20
19© Copyright 2015 EMC Corporation. All rights reserved. CHARACTERIZING OUR SYSTEM Linux perf tool CPU Performance counters – L1-dcache-loads – L1-dcache-stores Test: ran IO for 120 seconds Result: RD/WR = 2:1
21
20© Copyright 2015 EMC Corporation. All rights reserved. THE SIMULATOR Measuring performance for different memory allocation policies on a 2 node system. Throughput is measured as the time it takes to complete N iterations. Threads randomly access a shared memory.
22
21© Copyright 2015 EMC Corporation. All rights reserved. THE SIMULATOR CONT. #Threads RD/WR ratio – ratio between the number of read and write operations a thread performs Policy – local / interleave / remote. Size – the size of memory to allocate. #Iterations Node0/Node1 – ratio between threads bound to Node 0 and threads bound Node 1 RW_SIZE - size of read or write operation in each iteration. Config file:
23
22© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 Compare performance of 3 policies: – Local – threads access memory on node they run on. – Remote – threads access memory on a different node from which they run on. – Interleave – memory is interleaves across nodes (threads access both local and remote memory)
24
23© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 3 policies – local, interleave, remote. #Threads varies from 1-24 (the maximal number of concurrent threads in the system) 2 setups – balanced/unbalanced workload balancedunbalanced
25
24© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #1 #Iterations = 100,000,000 Data size = 2 * 150 MB RD/WR ratio = 2:1 RW_SIZE = 128 Bytes Configurations:
26
25© Copyright 2015 EMC Corporation. All rights reserved. RESULTS - BALANCED WORKLOAD -37% +69% -46% +83% Time it took until the last thread finished working.
27
26© Copyright 2015 EMC Corporation. All rights reserved. RESULTS - UNBALANCED WORKLOAD -35% +73% -45% +87%
28
27© Copyright 2015 EMC Corporation. All rights reserved. RESULTS - COMPARED local remote interleave balanced unbalanced
29
28© Copyright 2015 EMC Corporation. All rights reserved. CONCLUSIONS The more concurrent threads in the system, the more impact memory locality has on performance. In applications with #concurrent threads up to #cores in 1 node, the best solution is to bind the process and allocate memory on the same node. In applications with #concurrent threads up to #cores in a 2 node system, disabling NUMA (interleaving memory) will have similar performance to binding the process and allocating memory on the same node.
30
29© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 Local access is significantly faster than remote. Our system uses RW locks to synchronize memory access. Is maintaining read locality by mirroring the data on both nodes have better performance than the current interleave policy?
31
30© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 Purpose: find RD/WR ratio for which maintaining read locality is better than memory interleaving. Setup 1: Interleaving – Single RW lock – Data is interleaved across both nodes Setup 2: Mirroring data – RW lock per node – Each read operation accesses local memory. – Each write operation is done to both local and remote memory.
32
31© Copyright 2015 EMC Corporation. All rights reserved. EXPERIMENT #2 #Iterations = 25,000,000 Data size = 2 * 150 MB RD/WR ratio = 12 : i, { 1 <= i <= 12} #Threads = 8 ; 12 RW_SIZE = 512 ; 1024 ; 2048 ; 4096 Bytes Configurations:
33
32© Copyright 2015 EMC Corporation. All rights reserved. RW LOCKS – MIRRORING VS. INTERLEAVING 8 threads
34
33© Copyright 2015 EMC Corporation. All rights reserved. RW LOCKS – MIRRORING VS. INTERLEAVING 12 threads
35
34© Copyright 2015 EMC Corporation. All rights reserved. CONCLUSIONS Memory op size and % of write operations both play a role in deciding which memory allocation policy is better. In applications with small mem-op size (512B) and up to 50% write operations, mirroring is the better option. In applications with mem-op size of 4KB and more, mirroring is worse than interleaving the memory and using a single RW lock.
36
35© Copyright 2015 EMC Corporation. All rights reserved. SUMMARY Fine grained memory allocation can lead to performance improvement for certain workloads. – More investigation is needed in order to configure a suitable memory policy that utilizes NUMA abilities.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.