NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Slides:

Advertisements

Similar presentations

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

Advertisements

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

SLA-Oriented Resource Provisioning for Cloud Computing

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

Mendel Rosenblum and John K. Ousterhout Presented by Travis Bale 1.

Thoughts on Shared Caches Jeff Odom University of Maryland.

Lecture 10: Heap Management CS 540 GMU Spring 2009.

University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

By Jacob SeligmannSteffen Grarup Presented By Leon Gendler Incremental Mature Garbage Collection Using the Train Algorithm.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Memory Management. History Run-time management of dynamic memory is a necessary activity for modern programming languages Lisp of the 1960’s was one of.

1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.

Task-aware Garbage Collection in a Multi-Tasking Virtual Machine Sunil Soman Laurent Daynès Chandra Krintz RACE Lab, UC Santa Barbara Sun Microsystems.

Data Parallel Algorithms Presented By: M.Mohsin Butt

JETT 2003 Java.compareTo(C++). JAVA Java Platform consists of 4 parts: –Java Language –Java API –Java class format –Java Virtual Machine.

Run-Time Storage Organization

Age-Oriented Concurrent Garbage Collection Harel Paz, Erez Petrank – Technion, Israel Steve Blackburn – ANU, Australia April 05 Compiler Construction Scotland.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

CLR: Garbage Collection Inside Out

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Exploiting Prolific Types for Memory Management and Optimizations By Yefim Shuf et al.

The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Department of Computer Science Mining Performance Data from Sampled Event Traces Bret Olszewski IBM Corporation – Austin, TX Ricardo Portillo, Diana Villa,

Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK

Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.

MEMORY ORGANIZTION & ADDRESSING Presented by: Bshara Choufany.

Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.

Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.

1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

A Region-Based Compilation Technique for a Java Just-In-Time Compiler Toshio Suganuma, Toshiaki Yasue and Toshio Nakatani Presenter: Ioana Burcea.

1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Sunpyo Hong, Hyesoon Kim

COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.

Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.

NUMA Optimization of Java VM

1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.

Virtual memory.

Xiaodong Wang, Shuang Chen, Jeff Setter,

Java 9: The Quest for Very Large Heaps

Chapter 9 – Real Memory Organization and Management

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

What we need to be able to count to tune programs

Department of Computer Science University of California, Santa Barbara

Adaptive Code Unloading for Resource-Constrained JVMs

CLUSTER COMPUTING.

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Department of Computer Science University of California, Santa Barbara

Run-time environments

Presentation transcript:

NUMA Tuning for Java Server Applications Mustafa M. Tikir

University of Maryland 2 / 17 Introduction Cache-coherent SMPs are widely used –High performance computing –Large-scale applications –Client-server computing cc-NUMA is the dominant architecture –Allows construction of large servers –Data locality is an important consideration Faster access to local memory units PlatformLocal AccessRemote AccessRatio Sun Fire ns300ns1:1.33 Sun Fire 15K225ns400ns1:1.78 SGI Altix ns605ns1:4.17

University of Maryland 3 / 17 Dynamic Page Migration Effective for scientific applications –Regular memory access patterns Large static arrays with many pages –Divided into segments –Distributed to multiple computation nodes A few nodes access each data segment most Our earlier work –Moved pages at fixed time intervals Profiles gathered from hardware counters –Resulted in up to 90% reduction in non-local accesses 16% improvement in execution times

University of Maryland 4 / 17 Java Server Applications Java programs –Make extensive use of heap-allocated memory –Typically have significant pointer chasing Dynamic page migration may not be as beneficial –A page may have objects with different access patterns Page placement is transparent to the standard allocation routines Larger page size increases the likelihood –cc-NUMA servers tend to use super pages Heap objects should be allocated or moved –Local to the processor accessing them most Migration at the object granularity

University of Maryland 5 / 17 Page Migration for SPECjbb2000 Around 25% reduction in non-local accesses –Unlike scientific applications where it is up to 90% Around 3% reduction in throughput –Overhead due to migrations of many pages % Reduction % Improvement

University of Maryland 6 / 17 Memory Behavior at Object Granularity Source code instrumentation of HotSpot VM –Object allocations by the Java application –Internal heap allocations by the VM –Changes in object addresses due to garbage collection Instrumentation using dyninst –Additional helper thread For address transaction sampling Via Sun Fire Link hardware counters Execution is divided into distinct intervals –Execution interval Gathers information on object allocations and accesses –Garbage collection interval Dumps allocation and transaction buffers

University of Maryland 7 / 17 Experiments using SPECjbb2000 Young Generation –Objects are initially allocated –Objects stay in until old enough to be tenured Survivor spaces Tenured (old) Generation –Objects reaching a certain age are promoted Permanent Generation –The reflective data of the VM are allocated –Such as class and method objects Java Heap Region Memory Accesses % Non-Local Count% Young Generation11,926, Eden Space11,389, Survivor Space536, Old Generation16,477, Permanent Generation236, Internal Structures3,755,

University of Maryland 8 / 17 Potential Optimizations Estimation study using finer grained techniques –Based on information gathered during measurement Heap allocations and accesses Potential object centric optimizations –Static-optimal placement Has information on all object accesses Places objects at allocation time –Prior-knowledge placement Has information on object accesses during the next execution interval Moves objects at garbage collection time –Object-migration placement Gathers information since the start of execution Moves objects at garbage collection time

University of Maryland 9 / 17 Estimation Study Results Migration is effective in old generation –Many objects in young generation die fast One or a few processors access objects in young generation –Majority of accesses are from the allocator processor SPECjbb2000 has some dynamically changing memory behavior in the old generation

University of Maryland 10 / 17 NUMA-Aware Java Heaps NUMA-Aware heap layouts –NUMA-Eden NUMA-Aware young generation Original old generation Focus on the objects in the young generation –NUMA-Eden+Old NUMA-Aware young generation NUMA-Aware old generation –Combined with dynamic object migration Focus on the access locality to all objects

University of Maryland 11 / 17 NUMA-Aware Young Generation We divide eden space into segments –Each locality group is assigned a segment Pages in each segment are placed local to the group Object allocation –The requestor processor is identified –Object is placed in the segment of the processor’s group Garbage collection –When a segment does not have enough space –Other segments are also collected even if not full Potentially eliminates future synchronization

University of Maryland 12 / 17 NUMA-Aware Old Generation We divide tenured space into segments –Each locality group is assigned a segment Pages in each segment are placed local to the group When an object is promoted to old generation –Preferred location of the object is identified Processor that accesses the object most –Object is moved to the segment of the processor's group Object migrations during full garbage collection –Preferred locations of all objects are re-computed Additional object migrations –At every fixed number of minor collections To match the dynamically changing behavior

University of Maryland 13 / 17 Experimental Setup Representative Java workloads for simulation –Generated from actual runs –Sequence of requests To allocate or access objects by processors Same order as the actual run Workload Execution Machine –A hybrid execution simulator Consumes the generated parallel workload Issues memory allocations and accesses to the machine –Implements the underlying memory management algorithms Original algorithms in the HotSpot VM Algorithms for NUMA-Aware heap layouts

University of Maryland 14 / 17 NUMA-Aware Heap Experiments Application –SPECjbb2000 benchmark on HotSpot VM Run with 12 warehouses Platform –24 processor Sun Fire 6800 –24 GB main memory Sampling at every 1K transactions Partial workload from the actual run –10M allocation records –28M memory accesses Generated workloads with higher pressure –Scaled 16 and 32 times

University of Maryland 15 / 17 Reduction in Non-Local Accesses Scale Factor Heap Configuration Young Gen. Old Gen. Young + Old Gen. 16 NUMA-Eden57.6 %0.3 %28.1 % NUMA-Eden+Old55.3 %27.5 %41.0 % 32 NUMA-Eden50.9 %1.2 %27.3 % NUMA-Eden+Old48.0 %30.2 %39.5 %

University of Maryland 16 / 17 Execution Times NUMA-aware heaps are effective –27% improvement for NUMA-Eden configuration –40% improvement for NUMA-Eden+Old configuration More effective for higher memory pressure

University of Maryland 17 / 17 Conclusions NUMA-Aware heap layouts –Up to 41% reduction in non-local accesses –Up to 40% improvement in workload execution Dynamic object migration is beneficial –Compared to using only NUMA-aware young generation NUMA-aware heaps are more effective –As the memory pressure increases More effective on larger servers –Sun Fire 15K (latency ratio => 1:1.78) –SGI Altix 3000 (latency ratio => 1:4.17)