CLARA Based Application Vertical Elasticity V. Gyurjyan, D. Abbott, W. Gu, G. Heyes, E. Jastrzembski, S. Mancilla1, B. Muffit, R. Oyarzun1, C. Timmer 1 Universidad Technica Federico Santa Maria, Chile Abstract Results Processor speed and core counts are important factors for vertical scaling of the multi-threaded application. However, hardware specific memory subsystem and application core affinity can have equal or sometimes even greater impact on application performance. In this paper we conduct a performance study of the CLARA based CLAS12 reconstruction application, running on the JLAB computing farm. We show that choosing a proper application deployment configuration that takes into account computing node processor architecture can substantially increase scaling performance. The rest of the study was done on the Xeon Broadwell 2.3 GHz dual 18 core hardware. The series of CLARA configurations scaling measurements were conducted (results are shown in Figure 2). First we tested the scaling of the reconstruction application running in a single CLARA runtime. In this case, CLAS12 reconstruction services were deployed in a single CLARA data processing environment (DPE). Thus, we scaled a single process for up to 72 parallel threads, reconstructing events from the simulated physics data having at least 4 tracks in the CLAS12 detector. For the measurements, the data files were staged in the shared memory to minimize IO contribution to the data processing latency. The results showed non-linear scaling, becoming more and more prominent at higher core counts (see Figure 2, curve 36P36H). The set of measurements were conducted testing different CLARA configurations, varying number of DPEs running in a same node. Each DPE was deploying complete chain of reconstruction services. We also measure performance in case DPEs were sharing single IO services (Single-IO-Services mode) to minimize IO latency impact on the performance even though file IO was of the memory. So, the reconstruction application was deployed in 2 CLARA DPEs, assigning half of the cores with their own non-uniform memory access controllers (NUMA) to each DPE. So, having two CLARA runtime processes running in the same node, assigned to dedicated NUMA nodes (18P18H-Node0 and 18P18H-Node1 configurations, see Table 1), we were able to improve application scalability by more than 45%. CLARA 2DPE configuration without core affinity showed only 23% improvement. Xeon Broadwell hardware configuration with 2 LLC memory rings [4] suggested the CLARA 4DPE configuration with the core assignments shown in the Table 1. The 4DPE configuration improved vertical scaling by only 11.5%. Introduction Software systems, including scientific data processing applications have a distinct component based structure. Most data processing and data mining applications are divided into more or less autonomous cooperating software units, cooperating to achieve data processing goals. CLARA is a micro-services framework that presents software applications as a network of “building block” processes (services) that communicate data through message passing to achieve the computing objectives of an application [1][2][3]. These services are isolated, decoupled units of development that work on a specific concern. By considering the data processing application as an organization of cooperating autonomous services, CLARA improves application agility in terms of deployment, customization, distribution, and scalability. The CLARA environment provides tools for service developers to present their developments (engines) as services. The CLARA service based application’s vertical and horizontal scaling are under responsibility of the framework, with the only requirement that service engines be thread safe. The CLARA framework is multilingual and offers language bindings for Java, Python and C++. Methods and Materials The CLAS12 event reconstruction application consists of services responsible for specific algorithms of overall data processing, such as hit based and time based track reconstruction in the forward and central detectors, electromagnetic calorimeter reconstruction, time-of–flight detector reconstruction and etc. The event reconstruction application vertical scaling tests were conducted on 4 different hardware flavors of the JLAB computing farm: AMD, Xeon Haswell 2.3GHz, Xeon Broadwell 2.3GHz and Xeon Phi KNL 1.3GHz. The results of this measurements are shown in Figure 1. Table 1. Figure 3. Discussion The CLAS12 data processing application vertical scaling results are shown in the Figure 3. This study showed that we can substantially improve performance by deploying 2 DPE CLARA configuration that scales over dedicated (assigned) processor cores. Two-process configuration without core assignments also showed improvement that was simply due to amelioration of the contention at the JRE garbage collector that is due to service engines excessive object creation. The IO latency can be an explanation for the marginal improvement of the 4DPE configuration performance. The results for the Single-IO-Services mode are very promising, however the CLARA orchestrator latency was not taken into account. Even though we are not expecting large application orchestration latency the new measurements is required to confirm the Single-IO-Services mode results. Figure 1. Figure 2. Due to the fact that the CLAS12 reconstruction application is written in Java and does not support vector floating point operations, we were not expecting pleasing vertical scaling in the KNL node. However, we were not satisfied with the result we were getting on Xeon Broadwell, expecting to get at least 50% increase in performance compared to Xeon Haswell. Code analyses showed no thread contention, however we realized that code refactoring to assure minimization of the object creation could potentially improve performance by improve contention at the JRE garbage collector level. At this was not feasible at this stage of the experiment readiness we decided to study CLARA configurations in the hopes of improving reconstruction application scalability. Conclusions We conducted the CLAS12 reconstruction application performance benchmark measurements. We showed that vertical scalability of the application could be improved by proper combination of multithreading and multiprocessing. We also demonstrated the importance of the thread affinity to achieve maximum performance. All these improvements were done without user code modifications, thanks to the CLARA agile environment. Contact References Vardan Gyurjyan Jefferson Lab Email: gurjyan@jlab.org Website: https://claraweb.jlab.org Phone: 757 753 9165 V. Gyurjyan et al, “CLARA: A Contemporary Approach to Physics Data Processing”, Journal of Physics: Conf. Ser. Volume 331, 2011 V. Gyurjyan et al, “Component Based Dataflow Processing Framework” 2015 IEEE International Conference on Big Data, Santa Clara V. Gyurjyan et al, “CLARA: CLAS12 Reconstruction and Analysis Framework”, 17th International Workshop On Advanced Computing and Analysis Techniques In Physics Research, ACAT 2016 F. Denneman, “NUMA Deep Dive Part 2: System Architecture”, http://frankdenneman.nl/2016/07/08/numa-deep-dive-part-2-system-architecture/