Remix Online Detection and Repair of Cache Contention for the JVM

Remix Online Detection and Repair of Cache Contention for the JVM
Ariel Eizenberg, Joseph Devietti Shiliang Hu, Gilles Pokam

What’s wrong with this code?
public class Test extends AtomicLong implements Runnable { static void main(final String[] args) { Test test1 = new Test(); Thread t1 = new Thread(test1); Test test2 = new Test(); Thread t2 = new Thread(test2); t1.start();t2.start(); t1.join();t2.join(); } public void run() { for(int i = 0; i < ; ++ i) set(i); }} public class Test extends AtomicLong implements Runnable { static void main(final String[] args) { Test test1 = new Test(); Test test2 = new Test(); Thread t1 = new Thread(test1); Thread t2 = new Thread(test2); t1.start();t2.start(); t1.join();t2.join(); } public void run() { for(int i = 0; i < ; ++ i) set(i); }} 151ms 313ms The original code looks fine. However, a random small change improves performance twofold. What is going on? 1 Summary Background REMIX Detection Repair Performance

What happened? Version A: 313ms test1 test2 t1 t2 Version B: 151ms
Maybe the heap will provide a clue More surprises! test1 t1 test2 t2 GC, Class Loader, … 2 Summary Background REMIX Detection Repair Performance

Managed Languages Runtime has increased control over execution  opportunities for dynamic optimization Less programmer control over memory  more vulnerable to performance bugs 3 Summary Background REMIX Detection Repair Performance

REMIX First system to automatically detect and repair cache contention in managed languages Uses hardware performance counters to detect cache contention Automatically repairs cache contention bugs, significantly simplifying programmers job Implemented as a GC-like pass within the HotSpot JVM 4 Summary Background REMIX Detection Repair Performance

Outline Cache Coherency The Remix system Evaluation 5 Summary
Background REMIX Detection Repair Performance

Cache Coherency Multicore processors implement a cache coherence protocol to keep private caches in sync Operates on whole cache lines (usually 64 bytes) Cache lines have three key states: Read Shared Write Exclusive Invalid Multicore processors implement a cache coherence protocol to keep private caches in sync, for example invalidating a cache entry when the value it caches is modified by another core. The protocol operates on whole cache lines, usually 64 bytes, and this can cause cache contention, which, when severe, can significantly reduce parallelism. There are two types cache of contention: true sharing when a value “bounces” between to caches, and false sharing where two unrelated variables share a single cache line, forcing the entire line to bounce between cores. Reducing the size of the cache line would fix this type of contention, but is unpractical. 6 Summary Background REMIX Detection Repair Performance

True vs False sharing Cache Line Same Bytes True Sharing
X Core Core 1 Multicore processors implement a cache coherence protocol to keep private caches in sync, for example invalidating a cache entry when the value it caches is modified by another core. The protocol operates on whole cache lines, usually 64 bytes, and this can cause cache contention, which, when severe, can significantly reduce parallelism. There are two types cache of contention: true sharing when a value “bounces” between to caches, and false sharing where two unrelated variables share a single cache line, forcing the entire line to bounce between cores. Reducing the size of the cache line would fix this type of contention, but is unpractical. Different Bytes False Sharing X Y Core Core 1 7 Summary Background REMIX Detection Repair Performance

Intel PEBS Events PEBS – Precise Event-Based Sampling
Available in recent Intel multiprocessors Log detailed information about architectural events Recent intel multiprocessors support monitoring of architectural events through the PEBS interface. PEBS supports hundreds of different events, and these are very useful for profiling system behavior. The Linux kernel’s “perf” interface is an easy to use interface for recording PEBS events. 8 Summary Background REMIX Detection Repair Performance

PEBS HitM Events “Hit-Modified” - A cache miss due to a cache line in Modified state on a different core Core L1$ X:M 1 X:I Load X Miss Hit PEBS reports several cache related events. Two of them are triggered on loads with a cache hit on a remote core’s cache in the modified state, indicating cache contention. 9 Summary Background REMIX Detection Repair Performance

Related Work Sheriff [Liu and Berger, OOPSLA 2011] detect & repair
Plastic [Nanavati et al., EuroSys 2013] Laser [Luo et al., HPCA 2016] Cheetah [Liu and Liu, CGO 2016] Predator [Liu et. al., PPoPP 2014] Intel vTune Amplifier XE Oracle detect & repair unmanaged languages detection only Previous work in this area is focused on native code. Sheriff, Plastic and Laser offer schemes for both detection and repair of false sharing, however they rely invasive changes with substantial runtime overhead and are incompatible with Java memory consistency model. Java8 has added a new annotation which, when enabled, hints the class loader to pad the annotated fields. is a repair mechanism only, offers no detection capability, and requires access to the source code. In ReMIX we want a non-invasive, lightweight and completely automatic detection and repair mechanism. manual repair 10 Summary Background REMIX Detection Repair Performance

Remix A modified version of the Oracle HotSpot JVM
Detects & repairs cache contention bugs at runtime Works with all JVM languages, no source code modification required Provides performance matching hand-optimization Now I’ll present REMIX, our modified HotSpot JVM. REMIX detects & repairs cache contention bugs at runtime, completely automatically. REMIX works with all JVM languages, and, operating at the JVM level, requires no source code modification. As we’ll see in the benchmark results in later slides, REMIX provides performance matching hand-optimization, while adding no statistically significant runtime overhead. 11 Summary Background REMIX Detection Repair Performance

Remix System Overview Application Java, Scala, etc’ Runtime system JVM
Linux Kernel OS perf API Lets look at a high level overview of REMIX. The REMIX system uses the Linux kernel’s perf API to receive HITM PEBS events, and uses these events to detect cache contention arising from an application running within the JVM. CPU w/HITM PEBS Hardware HITM events 12 Summary Background REMIX Detection Repair Performance

Detection HITM Events classify >threshold repair classify report
native heap & stack HITM Events classify track complete no managed heap >threshold repair false sharing yes classify map to objects model $ lines Lets dig a little deeper into the detection process. REMIX enables HITM event collection, which the kernel stores into a memory buffer. Once every few seconds, REMIX processes HITM events and, filtering out corrupt data, classifies the data address as pointing to the native heap, the stack or to the managed heap. When a certain rate of hits is exceeded, REMIX models the cache line to determine the type of cache contention, and then maps heap addresses to objects and class types. report true sharing 13 Summary Background REMIX Detection Repair Performance

Cache Line Modelling Cache lines are modelled with 64-bit bitmaps
HITM event  set the address bit, count hit Multiple bits set  potential false sharing Repair is cheaper than more complete modelling! Repair when counter exceeds threshold Cache line modelling is an important step in deciding whether we are dealing with true- or false- sharing. Cache lines are modelled with 64-bit bitmaps, with each bit signifying one byte of a 64-byte aligned memory range. JVM type safety guarantees two overlapping memory accesses must have the same starting address, so we only need to set bit associated with the starting address of the HITM event. Cache lines with only one bit set indicate true sharing, while having multiple bits set can indicate both false sharing or true sharing, with the former being the far more frequent case. We don’t bother distinguishing between these two, as we have observed repair is much cheaper then more complex modelling techniques. 14 Summary Background REMIX Detection Repair Performance

Top Level Flow 15 Summary Background REMIX Detection Repair
At a high-level, once REMIX detects cache contention and determines it might be repaired, it automatically modifies the relevant classes and moves affected objects to the end of the heap, padding them in the process to remove the contention. 15 Summary Background REMIX Detection Repair Performance

Padding Instance data Instance size Field list
OOPMap (reference block map) Constant-pool cache Next lets look into repair. We had two approaches to repair, either just relocating contended objects or permanently padding contended classes. We chose the latter, as it is much more stable and prevents any future contention on the affected class types during the program’s execution. Padding classes requires several modification to JVM data structures. Beyond the obvious change to padded classes’ instances’ data, we need to modify the field list to adjust for new field offsets. We also need to modify a secondary map named OOPMap, a datastructure contains block ranges of references in instances. Finally we need to modify the constant-cache pool for any class whose methods access padded objects. 16 Summary Background REMIX Detection Repair Performance

Padding - Inheritance class A header (16b) byte a1 byte a2 pad 48b
class B extends A header (16b) byte a1 byte a2 byte b class C extends A header (16b) byte a1 byte a2 byte c Inheritance is a bit tricky, as we might need to pad more the just the affected classes. We obviously need to pad classes derived from the padded classes, but we might need to pad other parts of the hierarchy. Looking at the example on the slide, suppose we detect false sharing on field a1 in class B. However the field is actually a field inherited from B’s base class, class A. This means we need to pad all instance of class A, and of any class deriving from A, such as instances of B and C. The padding ensures that the contended field is at least 64bytes away from any other contended field. pad 48b pad 48b pad 62b pad 62b 17 Summary Background REMIX Detection Repair Performance

Repair Trace all strong+weak roots in the system
Traverse heap and find targeted instances Live  Relocate & pad, store forwarding pointer Dead  Fix size mismatch Adjust all pointers to forwarded objects Deoptimize all relevant stack frames The repair process itself is a modified mark-and-sweep GC pass. We first process all strong roots in the system, and mark all live objects in the system. We need to take care to mark weak roots as well so we do not modify the heap structure. Next we traverse heap and find instance of classes to pad, copy and pad marked instances to end of heap, while storing a forwarding pointer in the old location. We only need to pad live objects, there by reducing the number of instances to pad for short lived objects. Dead objects need a small adjustment as well to maintain the consistency of the heap. The 3rd step is to trace strong roots again and adjust all pointers to forwarded objects. Finally we must deoptomize all relevant stack frames to ensure that any code executing after the repair take into account all the classes that have been modified. This process usually happens only once or twice in the runtime of an application and from most benchmarks took well less then 250 milliseconds, and is amortized throughout the run of the program. 18 Summary Background REMIX Detection Repair Performance

Disruptor & Spring Reactor
Both are libraries for high-speed inter-thread message passing 56 bytes 56 bytes class Sequence { protected long a,b,c,d,e,f,g; // pad before protected volatile long value; protected long i,j,k,l,m,n,o; // pad after } class Sequence { protected volatile long value; } We now turn to performance evaluation. First lets take a look at how padding is implemented in many false sharing aware packages, for example the Disruptor high speed inter-thread messaging library used in the LMAX high speed trading framework, as well as in Spring Reactor. On the left side we see the convoluted method required to pad java classes (without using the available in Java8) . The complicated hierarchy is required to ensure the optimizer does not affect the field ordering. We will compare the performance of this manually padded classes with REMIX’s performance on the original unpadded class. a-g value i-o Is this enough? No! – Class loader optimizes aggressively, lays out value to before a. 19 Summary Background REMIX Detection Repair Performance

Disruptor & Spring Reactor
A complex hierarchy forces field order class LhsPadding { protected long a,b,c,d,e,f,g; } class Value extends LhsPadding { protected volatile long value; } class RhsPadding extends Value { protected long i,j,k,l,m,n,o; } class Sequence extends RhsPadding { // actual work } We now turn to performance evaluation. First lets take a look at how padding is implemented in many false sharing aware packages, for example the Disruptor high speed inter-thread messaging library used in the LMAX high speed trading framework, as well as in Spring Reactor. On the left side we see the convoluted method required to pad java classes (without using the available in Java8) . The complicated hierarchy is required to ensure the optimizer does not affect the field ordering. We will compare the performance of this manually padded classes with REMIX’s performance on the original unpadded class. 20 Summary Background REMIX Detection Repair Performance

Disruptor + Spring Reactor
This chart shows the speedup from the unpadded version of the libraries of both REMIX and the original, manually padded code. We can see that in most cases REMIX performs just as well as manual fixes, without requiring any manual padding. In one benchmark, HITM event rate is not high enough to trigger REMIX’s repair each execution, and the speedup fluctuates between no speedup and the same as the original version. Finally, in one case, REMIX outperforms when padding is actually unnecessary. Remix outperforms when manual fix adds unnecessary padding Remix repair threshold is adjustable Remix performs as well as manual fixes in most cases 21 Summary Background REMIX Detection Repair Performance

NAS Parallel Benchmarks
Remix reveals true-sharing running NAS suite HITM from true-sharing on JIT counters JIT-ing fixes contention NAS workloads exceed JIT size limit, not JIT-ed Remix detects this and forces compilation NAS is a parallel benchmark suite from NASA. Running NAS with Remix enabled revealed intense true-sharing HITM records coming from an internal JVM heap. Investigation found an interesting issue with the HotSpot JVM. HotSpot maintains invocation and backedge counters to determine when to compile methods. Counters are shared across threads, and suffer true sharing when updated with high frequency. Normally not a problem, as high update rates will ensure the function is compiled, and no more updates happen, removing the contention. However, the JVM has a limit on the size of functions compiled, and 3 of the NAS workloads never get compiled due to their size, which means the true-sharing persists. We augmented Remix with ability to detect this pathology and force compilation if this combination of very frequent execution together with a very large function size. This fix actually improves NAS runtime performance by 7x to 25x 22 Summary Background REMIX Detection Repair Performance

No Contention == No Impact
Speedup (higher is better) (DaCapo & ScalaBench) We also benchmarked REMIX with Dacapo, ScalaBench and SpecJVM. Notice that the axis on both charts zoom to a very narrow regions around 1. We used several statistical tests to show that there is no statistically significant speedup or slowdown. The serial benchmark in SpecJVM has some contention that is repaired by REMIX. Sunflow in DaCapo experiences high levels of true sharing that is misidentified as false-sharing, curiously this does not happened in the SpecJVM version of sunflow. (SpecJVM) 23 Summary Background REMIX Detection Repair Performance

Padding 24 Summary Background REMIX Detection Repair Performance
Lets examine some other performance indicators. First we’ll look at the heap overhead of padding. As can be seen, in most cases padding is minimal, as the number of live classes that need padding is minimal. 24 Summary Background REMIX Detection Repair Performance

Performance 25 Summary Background REMIX Detection Repair Performance
Now lets look at the time required for each of REMIX’s stages. HITM event collection is usually very fast, especially in cases with high true sharing. Cache line modelling and object classification has almost no impact. Repair is also fast in most cases, except in benchmarks that require modification of a very many instances. 25 Summary Background REMIX Detection Repair Performance

Conclusions Cache contention afflicts managed languages
Dynamic information opens up new avenues for optimization Performance counters are a gold mine REMIX can simplify programmers’ lives by automatically fixing false sharing bugs (GPLv2) Performance counters are very useful tool for improving performance. Using them in runtime opens up new avenues to optimize program behavior at runtime. This combined with managed languages better control of the memory address space. REMIX utilizes all of this to automatically detect and repair false sharing in all JVM languages. Finally REMIX sourcecode is available on github. 26 Summary Background REMIX Detection Repair Performance

Q&A 27 Summary Background REMIX Detection Repair Performance

PEBS HITM Event Accuracy
While the events are supposed to record HITM events on load operations, the are generated on some store operations as well, but with mostly invalid data. Luckily, we can detect these corrupt events fairly easily and filter them out. 28 Summary Background REMIX Detection Repair Performance

Objects Moved 33 Summary Background REMIX Detection Repair Performance

sun.misc.Unsafe Tracks unsafe access, prevents padding
FLD_OFFS = U.objectFieldOffset(k.getDeclaredField("fld")); Unsafe.putLong(o, FLD_OFFS, value); Remix Extended Unsafe : U.registerFieldOffset(k.getDeclaredField("fld"), k.getDeclaredField("FLD_OFFS“)); Just as fast, enables padding Another hurdle is classes using the internal sun.misc.Unsafe class for direct access to memory. A common pattern used by many high speed frameworks is to get the address of a field using the objectFieldOffset() method, and then access the field data directly using this offset. Padding a class invalidates these offsets, so REMIX tracks uses of objectFieldOffset and disables padding for classes which were accessed this way. To support Unsafe, we extend its functionaility by adding a slightly modified version of the function, which allows REMIX to update the stored field offsets whenever padding occurs. This change is required only when using Usafe with the top pattern, and as can be seen from the example, is very minimal, as well as being as fast the original code. 37 Summary Background REMIX Detection Repair Performance

Time-to-fix 34 Summary Background REMIX Detection Repair Performance
Time-to-fix is the percent of the program’s execution time until REMIX kicks in. This value is directly controlled by REMIX’s thresholds. Even though most benchmarks don’t have a very long runtime, in most cases REMIX starts working in less then 5% of the program’s execution time, leaving enough time for the padding to improve performance substantially. 34 Summary Background REMIX Detection Repair Performance

Performance Benchmark(s) Classes Loaded Processing (ms) Heap Size (MB)
DaCapo Sunflow 1879 1.47 854 Disruptor 40-85 SpringReactor WorkQueue 1617 0.748 41 SpecJVM serial 1498 1.21 1603 32 Summary Background REMIX Detection Repair Performance

Remix Speedup vs HITM rate
31 Summary Background REMIX Detection Repair Performance

Size-Preserving-Klass
Instances are stored contiguously in memory Object sizes not stored directly in the heap Calculated through the klass pointer in the object header Padding breaks this assumption Object left behind at old location has wrong size! Solution – create a ghost klass with the original size Modify dead instances to point to this ghost Take care not to traverse padded objects in heap until klass size is updated 30 Summary Background REMIX Detection Repair Performance

Padding - Inheritance 29 Summary Background REMIX Detection Repair
class A header (16b) byte a pad pad class B extends A header (16b) byte a byte b class C extends A header (16b) byte a byte c pad pad pad pad pad class E extends C header (16b) byte a byte c byte d class E extends C header (16b) byte a byte c byte e pad pad pad pad pad pad 29 Summary Background REMIX Detection Repair Performance

Savina Actors Benchmark
Evaluates actor libraries across 30 benchmarks Remix detects false and true sharing bugs in libraries & benchmarks: 36 Summary Background REMIX Detection Repair Performance

Virtualization Although Xen supports perf, it does not yet support virtualization of HW PEBS events 35 Summary Background REMIX Detection Repair Performance

Remix Online Detection and Repair of Cache Contention for the JVM

Similar presentations

Presentation on theme: "Remix Online Detection and Repair of Cache Contention for the JVM"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Remix Online Detection and Repair of Cache Contention for the JVM

Similar presentations

Presentation on theme: "Remix Online Detection and Repair of Cache Contention for the JVM"— Presentation transcript:

Similar presentations

About project

Feedback