CLARA Based Application Vertical Elasticity

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Towards Efficient Wavefront Parallel Encoding of HEVC: Parallelism Analysis and Improvement Keji Chen, Yizhou Duan, Jun Sun, Zongming Guo 2014 IEEE 16th.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
Measuring Network Performance of Multi-Core Multi-Cluster (MCMCA) Norhazlina Hamid Supervisor: R J Walters and G B Wills PUBLIC.
Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
I/O bound Jobs Multiple processes accessing the same disc leads to competition for the position of the read head. A multi -threaded process can stream.
Thomas Jefferson National Accelerator Facility Page 1 Clas12 Reconstruction and Analysis Framework V. Gyurjyan S. Mancilla.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Network Aware Resource Allocation in Distributed Clouds.
Cluster Reliability Project ISIS Vanderbilt University.
/ ZZ88 Performance of Parallel Neuronal Models on Triton Cluster Anita Bandrowski, Prithvi Sundararaman, Subhashini Sivagnanam, Kenneth Yoshimoto,
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye
Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.
Thomas Jefferson National Accelerator Facility Page 1 Clas12 Reconstruction and Analysis Framework V. Gyurjyan S. Mancilla.
A Software Solution for the Control, Acquisition, and Storage of CAPTAN Network Topologies Ryan Rivera, Marcos Turqueti, Alan Prosser, Simon Kwan Electronic.
NA62 Trigger Algorithm Trigger and DAQ meeting, 8th September 2011 Cristiano Santoni Mauro Piccini (INFN – Sezione di Perugia) NA62 collaboration meeting,
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Classic Model of Parallel Processing
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
COOL: Control Oriented Ontology Language Component Option State Service Channel Process Rule Conclusions The control oriented ontology language has been.
V Gyurjyan, D Abbott, J Carbonneau, G Gilfoyle, D Heddle, G Heyes, S Paul, C Timmer, D Weygand V. Gyurjyan JLAB data acquisition and analysis group.
Dual Target Design for CLAS12 Omair Alam and Gerard Gilfoyle Department of Physics, University of Richmond Introduction One of the fundamental goals of.
Thomas Jefferson National Accelerator Facility Page 1 Overview Talk Content Break-out Sessions Planning 12 GeV Upgrade Software Review Jefferson Lab November.
Thomas Jefferson National Accelerator Facility Page 1 ClaRA Stress Test V. Gyurjyan S. Mancilla.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
Background Computer System Architectures Computer System Software.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
Outline Introduction and motivation, The architecture of Tycho,
The Post Windows Operating System
Getting the Most out of Scientific Computing Resources
CLARA Micro-services architecture for distributed data analytics
Introduction to UML.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Multiprocessing.
Presented by: Saurav Kumar Bengani
Distributed Processors
Java 9: The Quest for Very Large Heaps
Geant4 MT Performance Soon Yung Jun (Fermilab)
for the Offline and Computing groups
A task-based implementation for GeantV
Intel Atom Architecture – Next Generation Computing
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Hadoop Clusters Tess Fulkerson.
Indigo Doyoung Lee Dept. of CSE, POSTECH
Distributed Shared Memory
NumaGiC: A garbage collector for big-data on big NUMA machines
CLUSTER COMPUTING.
Introduction to Multiprocessors
Utility-Function based Resource Allocation for Adaptable Applications in Dynamic, Distributed Real-Time Systems Presenter: David Fleeman {
Concurrency: Mutual Exclusion and Process Synchronization
Outline Chapter 2 (cont) OS Design OS structure
Chapter 4 Multiprocessors
Physics data processing with SOA
CLARA . What’s new? CLAS Collaboration Meeting. March 6, 2019
System calls….. C-program->POSIX call
Database System Architectures
The only thing worse than an experiment you can’t control is an experiment you can. May 6, 2019 V. Gyurjyan CHEP2007.
Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu
Logical Architecture & UML Package Diagrams
Presentation transcript:

CLARA Based Application Vertical Elasticity V. Gyurjyan, D. Abbott, W. Gu, G. Heyes, E. Jastrzembski, S. Mancilla1, B. Muffit, R. Oyarzun1, C. Timmer 1 Universidad Technica Federico Santa Maria, Chile Abstract Results Processor speed and core counts are important factors for vertical scaling of the multi-threaded application. However, hardware specific memory subsystem and application core affinity can have equal or sometimes even greater impact on application performance. In this paper we conduct a performance study of the CLARA based CLAS12 reconstruction application, running on the JLAB computing farm.  We show that choosing a proper application deployment configuration that takes into account computing node processor architecture can substantially increase scaling performance. The rest of the study was done on the Xeon Broadwell 2.3 GHz dual 18 core hardware. The series of CLARA configurations scaling measurements were conducted (results are shown in Figure 2). First we tested the scaling of the reconstruction application running in a single CLARA runtime. In this case, CLAS12 reconstruction services were deployed in a single CLARA data processing environment (DPE). Thus, we scaled a single process for up to 72 parallel threads, reconstructing events from the simulated physics data having at least 4 tracks in the CLAS12 detector. For the measurements, the data files were staged in the shared memory to minimize IO contribution to the data processing latency. The results showed non-linear scaling, becoming more and more prominent at higher core counts (see Figure 2, curve 36P36H). The set of measurements were conducted testing different CLARA configurations, varying number of DPEs running in a same node. Each DPE was deploying complete chain of reconstruction services. We also measure performance in case DPEs were sharing single IO services (Single-IO-Services mode) to minimize IO latency impact on the performance even though file IO was of the memory. So, the reconstruction application was deployed in 2 CLARA DPEs, assigning half of the cores with their own non-uniform memory access controllers (NUMA) to each DPE. So, having two CLARA runtime processes running in the same node, assigned to dedicated NUMA nodes (18P18H-Node0 and 18P18H-Node1 configurations, see Table 1), we were able to improve application scalability by more than 45%. CLARA 2DPE configuration without core affinity showed only 23% improvement. Xeon Broadwell hardware configuration with 2 LLC memory rings [4] suggested the CLARA 4DPE configuration with the core assignments shown in the Table 1. The 4DPE configuration improved vertical scaling by only 11.5%. Introduction Software systems, including scientific data processing applications have a distinct component based structure. Most data processing and data mining applications are divided into more or less autonomous cooperating software units, cooperating to achieve data processing goals. CLARA is a micro-services framework that presents software applications as a network of “building block” processes (services) that communicate data through message passing to achieve the computing objectives of an application [1][2][3]. These services are isolated, decoupled units of development that work on a specific concern. By considering the data processing application as an organization of cooperating autonomous services, CLARA improves application agility in terms of deployment, customization, distribution, and scalability. The CLARA environment provides tools for service developers to present their developments (engines) as services. The CLARA service based application’s vertical and horizontal scaling are under responsibility of the framework, with the only requirement that service engines be thread safe. The CLARA framework is multilingual and offers language bindings for Java, Python and C++. Methods and Materials The CLAS12 event reconstruction application consists of services responsible for specific algorithms of overall data processing, such as hit based and time based track reconstruction in the forward and central detectors, electromagnetic calorimeter reconstruction, time-of–flight detector reconstruction and etc. The event reconstruction application vertical scaling tests were conducted on 4 different hardware flavors of the JLAB computing farm: AMD, Xeon Haswell 2.3GHz, Xeon Broadwell 2.3GHz and Xeon Phi KNL 1.3GHz. The results of this measurements are shown in Figure 1. Table 1. Figure 3. Discussion The CLAS12 data processing application vertical scaling results are shown in the Figure 3. This study showed that we can substantially improve performance by deploying 2 DPE CLARA configuration that scales over dedicated (assigned) processor cores. Two-process configuration without core assignments also showed improvement that was simply due to amelioration of the contention at the JRE garbage collector that is due to service engines excessive object creation. The IO latency can be an explanation for the marginal improvement of the 4DPE configuration performance. The results for the Single-IO-Services mode are very promising, however the CLARA orchestrator latency was not taken into account. Even though we are not expecting large application orchestration latency the new measurements is required to confirm the Single-IO-Services mode results. Figure 1. Figure 2. Due to the fact that the CLAS12 reconstruction application is written in Java and does not support vector floating point operations, we were not expecting pleasing vertical scaling in the KNL node. However, we were not satisfied with the result we were getting on Xeon Broadwell, expecting to get at least 50% increase in performance compared to Xeon Haswell. Code analyses showed no thread contention, however we realized that code refactoring to assure minimization of the object creation could potentially improve performance by improve contention at the JRE garbage collector level. At this was not feasible at this stage of the experiment readiness we decided to study CLARA configurations in the hopes of improving reconstruction application scalability. Conclusions We conducted the CLAS12 reconstruction application performance benchmark measurements. We showed that vertical scalability of the application could be improved by proper combination of multithreading and multiprocessing. We also demonstrated the importance of the thread affinity to achieve maximum performance. All these improvements were done without user code modifications, thanks to the CLARA agile environment. Contact References Vardan Gyurjyan Jefferson Lab Email: gurjyan@jlab.org Website: https://claraweb.jlab.org Phone: 757 753 9165 V. Gyurjyan et al, “CLARA: A Contemporary Approach to Physics Data Processing”, Journal of Physics: Conf. Ser. Volume 331, 2011 V. Gyurjyan et al, “Component Based Dataflow Processing Framework” 2015 IEEE International Conference on Big Data, Santa Clara V. Gyurjyan et al, “CLARA: CLAS12 Reconstruction and Analysis Framework”, 17th International Workshop On Advanced Computing and Analysis Techniques In Physics Research, ACAT 2016 F. Denneman, “NUMA Deep Dive Part 2: System Architecture”, http://frankdenneman.nl/2016/07/08/numa-deep-dive-part-2-system-architecture/