Enhancing X10 Performance by Auto-tuning the Managed Java Back-end

Slides:



Advertisements
Similar presentations
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
ENERGY AND POWER CHARACTERIZATION OF PARALLEL PROGRAMS RUNNING ON THE INTEL XEON PHI JOAL WOOD, ZILIANG ZONG, QIJUN GU, RONG GE {JW1772, ZILIANG,
On the Locality of Java 8 Streams in Real- Time Big Data Applications Yu Chan Ian Gray Andy Wellings Neil Audsley Real-Time Systems Group, Computer Science.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
OpenFOAM on a GPU-based Heterogeneous Cluster
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Distributed Programming in Scala with APGAS Philippe Suter, Olivier Tardieu, Josh Milthorpe IBM Research Picture by Simon Greig.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Optimizing RAM-latency Dominated Applications
Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Energy Prediction for I/O Intensive Workflow Applications 1 MASc Exam Hao Yang NetSysLab The Electrical and Computer Engineering Department The University.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Document Clustering for Forensic Analysis: An Approach for Improving Computer Inspection.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
University of Maryland Towards Automated Tuning of Parallel Programs Jeffrey K. Hollingsworth Department of Computer Science University.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Full and Para Virtualization
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.
Running Mantevo Benchmark on a Bare-metal Server Mohammad H. Mofrad January 28, 2016
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,
Performance Assurance for Large Scale Big Data Systems
Object Oriented Programming in
1. Objective Sorting is used in human activities and devices like personal computers, smart phones, and it continues to play a crucial role in the development.
Advanced Computer Systems
Ioannis E. Venetis Department of Computer Engineering and Informatics
Software Architecture in Practice
Applying Control Theory to Stream Processing Systems
Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris ATC’15
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Parallel Density-based Hybrid Clustering
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Many-core Software Development Platforms
Linchuan Chen, Peng Jiang and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Performance Optimization for Embedded Software
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
(Computer fundamental Lab)
Amir Kamil and Katherine Yelick
Department of Computer Science University of California, Santa Barbara
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

Enhancing X10 Performance by Auto-tuning the Managed Java Back-end Vimuth Fernando, Milinda Fernando, Tharindu Rusira, Sanath Jayasena Department of Computer Science and Engineering University of Moratuwa

X10 Language X10[1] is an object-oriented, Asynchronous Partitioned Global Address Space (APGAS) language Designed to obtain productivity for high performance applications X10 code is compiled into either Native backend - C++ code Managed backend - Java code In general, the Java back end's performance is low

Performance Tuning the Java Backend Our work focused on improving the performance of X10 programs that uses the Java back-end The native backend is commonly used for performance critical applications Tuning, if successful, can provide best of both worlds for the Java backend in terms of Performance Java’s portability and availability of enterprise level libraries Employs auto-tuning techniques that have been widely used in similar High Performance Computing workloads[4],[5],[6]

OpenTuner Framework[2] A framework for developing auto-tuning applications Can be used to Define a set of configurations and the possible configuration space Search through the configuration space to identify the best configuration Structure of the OpenTuner Framework[2]

Our Auto-tuning Approach X10 Java backend code runs on a set of Java Virtual Machines (JVMs) JVMs expose hundreds of tunable parameters that can be used to change their behavior By tuning these parameters we can generate increased performance Some Tunable parameters exposed by Java Virtual Machines

Our Auto-tuning Approach ctd. X10 programs typically run on large distributed systems So a large number of JVMs are created each with 100s of parameters that needs to be tuned To overcome this issue we use JATT[2] – Introduces a hierarchical structure for the JVM tunable parameters making tuning faster Parameter duplication – All JVMs share a single set of parameter values. Works because the workloads are similar The process by our auto-tuner

Our Auto-tuning Approach ctd. OpenTuner is used to generate test configurations consisting of a specific combination of parameter values These configurations are then distributed to the nodes running the X10 program compiled to the Java backend The resulting performance is measured and used to direct the search algorithm to get to a optimal configuration faster This cycle continues until a good performance gain is achieved

Benchmark Applications Benchmark applications provided with X10 source distribution were used in our experiments The following benchmarks were available Kmeans – A program that implements K-means clustering problem. Total execution time in seconds is measured as the performance metric. 
 SSCA1 – Implements the sequence alignment problem using Smith-Waterman algorithm. SSCA2 – A graph theory benchmark that stresses integer operations, large memory footprints, and irregular memory access patterns. Stream - Measures the memory bandwidth and related computation rate for vector kernels. Data rate in GB/s is measured as the performance metric. UTS - implements an Unbalanced Tree Search algorithm that measures X10’s ability to balance irregular work across many places. LULESH 2.0 - Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics [26], a proxy application for shock hydrodynamics.

Experimental Setup Our auto-tuning experiments were run on the following platforms Architecture-1
 Intel Xeon E7-4820v2 CPU @2.00 GHz (32 cores) 64GB RAM
 Ubuntu 14.04 LTS
OpenJDK 7 (HotSpot JVM) update 55 Architecture-2
 Intel Core i7 CPU @3.40Ghz (4 cores) 16 GB RAM
 Ubuntu 12.04 LTS
OpenJDK 7 (HotSpot JVM) update 55

Results Significant performance gains achieved through auto-tuning in all benchmarks Performance metrics differ across the benchmarks. So following formula used to measure performance gains Fig. 2: X10 benchmarks performance improvements

Both Architectures show similar performance gains

Performance tuning with time

Profiling Profiling of underlying JVM characteristics was used to try and identify the cause of performance gains We profiled Heap Usage Compilation Rate Class Loading Rate Garbage Collection Time The profiling data of the default configuration and the tuned one was used to compare the two runtimes

Discussion No single source of performance gains Profiling data shows that the performance gains are achieved through different means for each benchmark. This shows that there is no silver bullet configuration that fits all circumstances. We believe that this is further cause for auto-tuners as opposed to manual tuning methods. Java backend still performs poorly compared to the native backend But with the performance improvements they become more acceptable This with the additional advantages offered by the Java backend makes it more feasible for HPC applications

Input Sensitivity We wanted to analyze the identified configurations’ sensitivity to input sizes We found that identified good configurations tend to provide better performance that scale with bigger input sizes The figure shows how the configuration identified with the input size at it’s smallest perform against the default configuration for the LULESH benchmark Other Benchmarks displayed similar behavior This allows for us to tune benchmarks with smaller inputs leading to reduced tuning times.

Conclusions Auto-tuning methods can be used to generate performance gains in distributed X10 programs compiled to the Java Backend This performance gains scale well with larger input sizes Using profiling data we studied the causes for performance gains and identified that they differ based on the characteristics of the benchmark This work provides evidence that auto-tuning methods needs to be further explored in the search for better performance in X10 applications

Acknowledgements This project was supported by a Senate Research Grant awarded by the University of Moratuwa We would also like to acknowledge the support provided by the LK Domain Registry through the V. K. Samaranayake Grant

Thank you

Main References [1] Mikio Takeuchi, Yuki Makino, Kiyokuni Kawachiya, Hiroshi Horii, Toyotaro Suzumura, Toshio Suganuma, and Tamiya Onodera. 2011. “Compiling X10 to Java”. In Proceedings of the 2011 ACM SIGPLAN X10 Workshop (X10 ’11). ACM, New York, NY, USA, Article 3, 10 pages. [2] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan- Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. “OpenTuner: an extensible framework for program autotuning”. In Proceedings of the 23rd international conference on Parallel architectures and compilation (PACT ’14). ACM, New York, NY, USA, 303-316. [3] Sanath Jayasena, Milinda Fernando, Tharindu Rusira, Chalitha Perera and Chamara Philips, “Auto-tuning the Java Virtual Machine”, In proceedings of 10th IEEE International Workshop in Automatic Performance Tuning(iWAPT’15), 2015. [4] Hamouda, Sara S., Josh Milthorpe, Peter E. Strazdins, and Vijay Saraswat. “A Resilient Framework for Iterative Linear Algebra Applications in X10.” In 16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC 2015). 2015. [5] Cheng, Long, Spyros Kotoulas, Tomas E. Ward, and Georgios Theodoropoulos. “High Throughput Indexing for Large-scale Semantic Web Data.” (2015). [6] Bilmes J, Asanovic K, Chin CW, Demmel J. “Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSIC coding methodology”. Proceedings of the 1997 ACM International Conference on Supercomputing,1997.