NUMA Tuning for Java Server Applications Mustafa M. Tikir.

NUMA Tuning for Java Server Applications Mustafa M. Tikir

University of Maryland 2 / 17 Introduction Cache-coherent SMPs are widely used –High performance computing –Large-scale applications –Client-server computing cc-NUMA is the dominant architecture –Allows construction of large servers –Data locality is an important consideration Faster access to local memory units PlatformLocal AccessRemote AccessRatio Sun Fire 6800225ns300ns1:1.33 Sun Fire 15K225ns400ns1:1.78 SGI Altix 3000145ns605ns1:4.17

University of Maryland 3 / 17 Dynamic Page Migration Effective for scientific applications –Regular memory access patterns Large static arrays with many pages –Divided into segments –Distributed to multiple computation nodes A few nodes access each data segment most Our earlier work –Moved pages at fixed time intervals Profiles gathered from hardware counters –Resulted in up to 90% reduction in non-local accesses 16% improvement in execution times

University of Maryland 4 / 17 Java Server Applications Java programs –Make extensive use of heap-allocated memory –Typically have significant pointer chasing Dynamic page migration may not be as beneficial –A page may have objects with different access patterns Page placement is transparent to the standard allocation routines Larger page size increases the likelihood –cc-NUMA servers tend to use super pages Heap objects should be allocated or moved –Local to the processor accessing them most Migration at the object granularity

University of Maryland 5 / 17 Page Migration for SPECjbb2000 Around 25% reduction in non-local accesses –Unlike scientific applications where it is up to 90% Around 3% reduction in throughput –Overhead due to migrations of many pages % Reduction27.424.520.9 % Improvement-2.8-3.4-3.1

University of Maryland 6 / 17 Memory Behavior at Object Granularity Source code instrumentation of HotSpot VM –Object allocations by the Java application –Internal heap allocations by the VM –Changes in object addresses due to garbage collection Instrumentation using dyninst –Additional helper thread For address transaction sampling Via Sun Fire Link hardware counters Execution is divided into distinct intervals –Execution interval Gathers information on object allocations and accesses –Garbage collection interval Dumps allocation and transaction buffers

University of Maryland 7 / 17 Experiments using SPECjbb2000 Young Generation –Objects are initially allocated –Objects stay in until old enough to be tenured Survivor spaces Tenured (old) Generation –Objects reaching a certain age are promoted Permanent Generation –The reflective data of the VM are allocated –Such as class and method objects Java Heap Region Memory Accesses % Non-Local Count% Young Generation11,926,23135.883.7 Eden Space11,389,58634.283.8 Survivor Space536,6451.682.7 Old Generation16,477,99049.584.0 Permanent Generation236,1890.781.0 Internal Structures3,755,93711.380.4

University of Maryland 8 / 17 Potential Optimizations Estimation study using finer grained techniques –Based on information gathered during measurement Heap allocations and accesses Potential object centric optimizations –Static-optimal placement Has information on all object accesses Places objects at allocation time –Prior-knowledge placement Has information on object accesses during the next execution interval Moves objects at garbage collection time –Object-migration placement Gathers information since the start of execution Moves objects at garbage collection time

University of Maryland 9 / 17 Estimation Study Results Migration is effective in old generation –Many objects in young generation die fast One or a few processors access objects in young generation –Majority of accesses are from the allocator processor SPECjbb2000 has some dynamically changing memory behavior in the old generation

University of Maryland 10 / 17 NUMA-Aware Java Heaps NUMA-Aware heap layouts –NUMA-Eden NUMA-Aware young generation Original old generation Focus on the objects in the young generation –NUMA-Eden+Old NUMA-Aware young generation NUMA-Aware old generation –Combined with dynamic object migration Focus on the access locality to all objects

University of Maryland 11 / 17 NUMA-Aware Young Generation We divide eden space into segments –Each locality group is assigned a segment Pages in each segment are placed local to the group Object allocation –The requestor processor is identified –Object is placed in the segment of the processor’s group Garbage collection –When a segment does not have enough space –Other segments are also collected even if not full Potentially eliminates future synchronization

University of Maryland 12 / 17 NUMA-Aware Old Generation We divide tenured space into segments –Each locality group is assigned a segment Pages in each segment are placed local to the group When an object is promoted to old generation –Preferred location of the object is identified Processor that accesses the object most –Object is moved to the segment of the processor's group Object migrations during full garbage collection –Preferred locations of all objects are re-computed Additional object migrations –At every fixed number of minor collections To match the dynamically changing behavior

University of Maryland 13 / 17 Experimental Setup Representative Java workloads for simulation –Generated from actual runs –Sequence of requests To allocate or access objects by processors Same order as the actual run Workload Execution Machine –A hybrid execution simulator Consumes the generated parallel workload Issues memory allocations and accesses to the machine –Implements the underlying memory management algorithms Original algorithms in the HotSpot VM Algorithms for NUMA-Aware heap layouts

University of Maryland 14 / 17 NUMA-Aware Heap Experiments Application –SPECjbb2000 benchmark on HotSpot VM Run with 12 warehouses Platform –24 processor Sun Fire 6800 –24 GB main memory Sampling at every 1K transactions Partial workload from the actual run –10M allocation records –28M memory accesses Generated workloads with higher pressure –Scaled 16 and 32 times

University of Maryland 15 / 17 Reduction in Non-Local Accesses Scale Factor Heap Configuration Young Gen. Old Gen. Young + Old Gen. 16 NUMA-Eden57.6 %0.3 %28.1 % NUMA-Eden+Old55.3 %27.5 %41.0 % 32 NUMA-Eden50.9 %1.2 %27.3 % NUMA-Eden+Old48.0 %30.2 %39.5 %

University of Maryland 16 / 17 Execution Times NUMA-aware heaps are effective –27% improvement for NUMA-Eden configuration –40% improvement for NUMA-Eden+Old configuration More effective for higher memory pressure

University of Maryland 17 / 17 Conclusions NUMA-Aware heap layouts –Up to 41% reduction in non-local accesses –Up to 40% improvement in workload execution Dynamic object migration is beneficial –Compared to using only NUMA-aware young generation NUMA-aware heaps are more effective –As the memory pressure increases More effective on larger servers –Sun Fire 15K (latency ratio => 1:1.78) –SGI Altix 3000 (latency ratio => 1:4.17)

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Similar presentations

Presentation on theme: "NUMA Tuning for Java Server Applications Mustafa M. Tikir."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Similar presentations

Presentation on theme: "NUMA Tuning for Java Server Applications Mustafa M. Tikir."— Presentation transcript:

Similar presentations

About project

Feedback