Enhancing X10 Performance by Auto-tuning the Managed Java Back-end Vimuth Fernando, Milinda Fernando, Tharindu Rusira, Sanath Jayasena Department of Computer Science and Engineering University of Moratuwa
X10 Language X10[1] is an object-oriented, Asynchronous Partitioned Global Address Space (APGAS) language Designed to obtain productivity for high performance applications X10 code is compiled into either Native backend - C++ code Managed backend - Java code In general, the Java back end's performance is low
Performance Tuning the Java Backend Our work focused on improving the performance of X10 programs that uses the Java back-end The native backend is commonly used for performance critical applications Tuning, if successful, can provide best of both worlds for the Java backend in terms of Performance Java’s portability and availability of enterprise level libraries Employs auto-tuning techniques that have been widely used in similar High Performance Computing workloads[4],[5],[6]
OpenTuner Framework[2] A framework for developing auto-tuning applications Can be used to Define a set of configurations and the possible configuration space Search through the configuration space to identify the best configuration Structure of the OpenTuner Framework[2]
Our Auto-tuning Approach X10 Java backend code runs on a set of Java Virtual Machines (JVMs) JVMs expose hundreds of tunable parameters that can be used to change their behavior By tuning these parameters we can generate increased performance Some Tunable parameters exposed by Java Virtual Machines
Our Auto-tuning Approach ctd. X10 programs typically run on large distributed systems So a large number of JVMs are created each with 100s of parameters that needs to be tuned To overcome this issue we use JATT[2] – Introduces a hierarchical structure for the JVM tunable parameters making tuning faster Parameter duplication – All JVMs share a single set of parameter values. Works because the workloads are similar The process by our auto-tuner
Our Auto-tuning Approach ctd. OpenTuner is used to generate test configurations consisting of a specific combination of parameter values These configurations are then distributed to the nodes running the X10 program compiled to the Java backend The resulting performance is measured and used to direct the search algorithm to get to a optimal configuration faster This cycle continues until a good performance gain is achieved
Benchmark Applications Benchmark applications provided with X10 source distribution were used in our experiments The following benchmarks were available Kmeans – A program that implements K-means clustering problem. Total execution time in seconds is measured as the performance metric. SSCA1 – Implements the sequence alignment problem using Smith-Waterman algorithm. SSCA2 – A graph theory benchmark that stresses integer operations, large memory footprints, and irregular memory access patterns. Stream - Measures the memory bandwidth and related computation rate for vector kernels. Data rate in GB/s is measured as the performance metric. UTS - implements an Unbalanced Tree Search algorithm that measures X10’s ability to balance irregular work across many places. LULESH 2.0 - Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics [26], a proxy application for shock hydrodynamics.
Experimental Setup Our auto-tuning experiments were run on the following platforms Architecture-1 Intel Xeon E7-4820v2 CPU @2.00 GHz (32 cores) 64GB RAM Ubuntu 14.04 LTS OpenJDK 7 (HotSpot JVM) update 55 Architecture-2 Intel Core i7 CPU @3.40Ghz (4 cores) 16 GB RAM Ubuntu 12.04 LTS OpenJDK 7 (HotSpot JVM) update 55
Results Significant performance gains achieved through auto-tuning in all benchmarks Performance metrics differ across the benchmarks. So following formula used to measure performance gains Fig. 2: X10 benchmarks performance improvements
Both Architectures show similar performance gains
Performance tuning with time
Profiling Profiling of underlying JVM characteristics was used to try and identify the cause of performance gains We profiled Heap Usage Compilation Rate Class Loading Rate Garbage Collection Time The profiling data of the default configuration and the tuned one was used to compare the two runtimes
Discussion No single source of performance gains Profiling data shows that the performance gains are achieved through different means for each benchmark. This shows that there is no silver bullet configuration that fits all circumstances. We believe that this is further cause for auto-tuners as opposed to manual tuning methods. Java backend still performs poorly compared to the native backend But with the performance improvements they become more acceptable This with the additional advantages offered by the Java backend makes it more feasible for HPC applications
Input Sensitivity We wanted to analyze the identified configurations’ sensitivity to input sizes We found that identified good configurations tend to provide better performance that scale with bigger input sizes The figure shows how the configuration identified with the input size at it’s smallest perform against the default configuration for the LULESH benchmark Other Benchmarks displayed similar behavior This allows for us to tune benchmarks with smaller inputs leading to reduced tuning times.
Conclusions Auto-tuning methods can be used to generate performance gains in distributed X10 programs compiled to the Java Backend This performance gains scale well with larger input sizes Using profiling data we studied the causes for performance gains and identified that they differ based on the characteristics of the benchmark This work provides evidence that auto-tuning methods needs to be further explored in the search for better performance in X10 applications
Acknowledgements This project was supported by a Senate Research Grant awarded by the University of Moratuwa We would also like to acknowledge the support provided by the LK Domain Registry through the V. K. Samaranayake Grant
Thank you
Main References [1] Mikio Takeuchi, Yuki Makino, Kiyokuni Kawachiya, Hiroshi Horii, Toyotaro Suzumura, Toshio Suganuma, and Tamiya Onodera. 2011. “Compiling X10 to Java”. In Proceedings of the 2011 ACM SIGPLAN X10 Workshop (X10 ’11). ACM, New York, NY, USA, Article 3, 10 pages. [2] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan- Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. “OpenTuner: an extensible framework for program autotuning”. In Proceedings of the 23rd international conference on Parallel architectures and compilation (PACT ’14). ACM, New York, NY, USA, 303-316. [3] Sanath Jayasena, Milinda Fernando, Tharindu Rusira, Chalitha Perera and Chamara Philips, “Auto-tuning the Java Virtual Machine”, In proceedings of 10th IEEE International Workshop in Automatic Performance Tuning(iWAPT’15), 2015. [4] Hamouda, Sara S., Josh Milthorpe, Peter E. Strazdins, and Vijay Saraswat. “A Resilient Framework for Iterative Linear Algebra Applications in X10.” In 16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2015). 2015. [5] Cheng, Long, Spyros Kotoulas, Tomas E. Ward, and Georgios Theodoropoulos. “High Throughput Indexing for Large-scale Semantic Web Data.” (2015). [6] Bilmes J, Asanovic K, Chin CW, Demmel J. “Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSIC coding methodology”. Proceedings of the 1997 ACM International Conference on Supercomputing,1997.