Good data structure experiments are r.a.r.e.

Good data structure experiments are r.a.r.e.
Trevor Brown Technion slides at 43:23

Why do we perform experiments?
To answer questions about data structures Is one data structure faster than another? Why? We are asking about algorithmic differences, NOT engineering differences

The problem Typical data structure experiment
2x Intel E7-4830, 48 threads, 128GB RAM Ubuntu 16.04LTS, G with flags –mcx16 –O3 Binary search tree benchmark Five 3-second trials, tree prefilled to half-full 24 threads do 50% insert, 50% delete, 100k keys

Which is the “true” performance?
operations per microsecond

Which is the “right” comparison?
[NM14] Lock-free External BST [BCCO10] Optimistic AVL tree BCCO10 168% faster than NM14 If I give you either of these pairs of performance numbers, is this a “good” experiment? Is there any pair of performance numbers here that would qualify as a “good” experiment? I haven’t told you all of the configuration parameters, so you can’t reproduce my results. You don’t know if I’m making fair comparisons. You don’t know if any of these configurations are realistic. This is exactly what we deal with in the literature. NM14 84% faster than BCCO10

Good data structure Experiments are
Reproducible Apples-to-apples (fair) Realistic Explainable

Reproducibility: Crucial configuration parameters
Operating system: memory allocator, huge pages, thread pinning Processor: prefetching mode, hyper threading, turbo boost Data structure: memory reclamation, object pooling

Apples-to-apples Comparisons
All data structures should use the same: Configuration parameters ADT differences set vs dictionary insert-replace vs insert-if-absent Engineering practices inlining, int vs long

Realistic experiments
Appropriate benchmarks Realistic system configuration Fast scalable allocator Realistic data structure implementation Memory reclamation (and free() calls) Eliminate implementation errors

Thread pinning reveals sensitivity to NUMA
Comparison in [NM14] Thread pinning reveals sensitivity to NUMA apples to apples 33% N>B ; 106% B>N realistic 6% N>B BCCO10 106% faster NM14 1-6% faster NM14 33% faster

Explaining results Investigate with systems tools
Use performance counters (PAPI, Linux perftools) L1/L2/L3 cache misses, stalls, cycles, instructions Construct experiments to confirm explanations

How to perform R.A.R.E. experiments
Unfair or unrealistic comparisons Bugs, unfair or unrealistic engineering Forgotten or unrealistic parameters Iterative process – decisions guided by RARE principles Takes months to understand and correct results Relies on constant and extensive Sanity checks Systems-level analysis Hypothesizing and testing

Common Implementation errors
Test harness overhead Misuse of C/C++ volatile Memory leaks False sharing Bad padding/alignment Data structure memory layout anomalies

[#1] Test harness Overhead: impact on different Data structures
Original test harness After reducing overhead 8-thread Intel I7-4770 3.1x operations per microsecond 2.2x concurrent threads

Overhead of timing measurements
Data structure operations are no-ops 64-thread AMD system # operations per get_time() call Implies that operation latency measurements may be tricky to get right.

[#2] C++ volatile keyword
Informs compiler an address may be changed by another thread Prevents some optimizations that are illegal in a concurrent setting Value-based validation v1 = *addr; […] v2 = *addr if (v1 != v2) return FAIL; Eliminated validation! v1 = *addr; […] v2 = v1 if (v1 != v2) return FAIL; Optimize Impossible!

Misuse of C/C++ Volatile
What is “left?” node_t * left; volatile node_t * left; node_t volatile * left; node_t * volatile left; volatile node_t * volatile left;

examples in the wild: Missing volatiles
The original implementation of the [NM14] BST uses the following node type: AO_double_t is defined by the Atomic Ops (AO) library NOT volatile by default Need “volatile AO_double_t children” struct node_t { int key; AO_double_t children; };

examples in the wild: Misplaced volatiles
The ASCYLIB implementation of the [NM14] BST uses the following node type: struct node_t { skey_t key; sval_t value; volatile node_t * left; volatile node_t * right; char padding[32]; }; “left” is a pointer to a volatile node We want a volatile pointer to a node: node_t * volatile left;

[#3] Checking for memory leaks: Using jemalloc
Profiling leaks in ./myprogram PDF graph output env MALLOC_CONF=prof_leak:true,lg_prof_sample:0,prof_final:true LD_PRELOAD=libjemalloc.so ./myprogram <jemalloc>: Leak approximation summary: bytes [...] <jemalloc>: Run jeprof on "jeprof f.heap" [...] jeprof --show_bytes --pdf ./myprogram jeprof f.heap > output.pdf

PDF output: tracking down ~8MB of leaked memory
Leak was caused by a serious algorithmic bug!

Checking for memory leaks: Using valgrind
$ valgrind --fair-sched=yes --leak-check=full ./myprogram ==28550== 233,072 (3,696 direct, 229,376 indirect) bytes in 154 blocks are definitely lost in loss record 13 of 16 ==28550== by 0x429518: Prepare<...> (snapcollector.h:307) ==28550== by 0x429518: traversal_end (rq_snap.h:313) ...

[#4] False sharing Thread 1 reads w2 Thread 2 reads w7
Thread 1’s cache Thread 2’s cache w1 w2 w3 w4 w5 w6 w7 w8 S w1 w2 w3 w4 w5 w6 w7 w8 X S Thread 1 reads w2 Thread 2 reads w7 Thread 2 writes w7 w1 w2 w3 w4 w5 w6 w7 w8 64 byte (8 word) cache line

False sharing in the test harness
Typically revealed by sanity checks For example: read only workload with empty data structures

Searches in empty data structures
48 threads on: 2x24 thread Intel E7-4830 Lock-free skiplist Lock-free list Lazy list RCU-based BST Lock-free BST Lock-free (a,b)-tree Same search code operations per microsecond

Locating the false sharing
Using Linux Performance Tools: perf Record performance counter MEM_LOAD_UOPS_RETIRED_HIT_LFB Command perf record –e cpu/event=0xd1,umask=0x40/pp ./myprogram ≅ memory contention Use high precision event data (more accurate line numbers)

Exploring the perf data

What are these variables?
while (!done) { ++cnt; if (cnt is a multiple of 50) { if (get_time() - startTime >= run_time) { done = true; memory_fence(); break; } ... [perform a random operation]

The offending data layout
volatile long rngs[NUM_THREADS * PADDING]; volatile long startTime; volatile bool done; Thread t’s random # generator = rngs[t * PADDING] data empty padding 8 bytes 2 cache lines - 8 bytes

Expected data layout Actual data layout ... rngs[...] startTime done
No false sharing ... rngs[...] startTime done Actual data layout ... done startTime rngs[...]

Data could still be reordered!
Brittle Solution volatile long rngs[NUM_THREADS * PADDING]; volatile char padding[128]; volatile long startTime; volatile bool done; Data could still be reordered! ... done startTime padding[…] rngs[...]

Better Solution struct { volatile char pad0[128]
volatile long rngs[NUM_THREADS * PADDING]; volatile long startTime; volatile bool done; volatile char pad1[128] } g; ... g.pad0[…] g.rngs[...] g.startTime g.done g.pad1[…]

Lock-free skiplist Lock-free list Lazy list RCU-based BST Lock-free BST Lock-free (a,b)-tree operations per microsecond

Why is the skiplist slow?
Use PAPI measurements to investigate Lock-free BST Lock-free skiplist L1 miss / op 0.11 0.14 L2 miss / op L3 miss / op 0.04 0.05 Cycles / op 347 656 Instr. / op 307 700

Digging deeper with perf
perf record –e cpu-cycles:pp ./myprogram ; perf report

Confirm with an experiment
Flatten skiplist (MAX_LEVEL=1) Works because of empty data structure workload

Lock-free skiplist Lock-free list Lazy list RCU-based BST Lock-free BST Lock-free (a,b)-tree operations per microsecond

[#5] Problematic padding
operations per microsecond Original 2.05 L3 miss/op Padding removed 0.01 L3 miss/op

[#6] Data structure memory layout anomalies
Memory layout: NNNNNNNN Memory layout: NDNDNDND operations per microsecond 48 threads Prefill with 1M insertions Then do 100% searches Fixed More L2 misses and L3 misses But the external BST contains more nodes!?

My top 10 Sanity checks Key checksums Empty data structures Valgrind
Extremely high contention Memory reclamation: efficient vs eager Variance measurements Empty data structures Read-only workloads Inspect object sizes and first k object allocations per thread Trial length: 3s vs 60s 102 vs 105 vs 107 keys (Are the nodes allocated by each process interleaved with other objects? Are they spread out or allocated consecutively? Why?)

Conclusion Join me in performing R.A.R.E experiments
find problems with sanity checks find solutions with systems tools explain everything (with evidence!) Question: are Java experiments useful? Ongoing work: new test harness with tools to make R.A.R.E. experiments easier

Good data structure experiments are r.a.r.e.

Similar presentations

Presentation on theme: "Good data structure experiments are r.a.r.e."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Good data structure experiments are r.a.r.e.

Similar presentations

Presentation on theme: "Good data structure experiments are r.a.r.e."— Presentation transcript:

Similar presentations

About project

Feedback