Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note-2011-052.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.

Parallel Processing with OpenMP

Introduction to Openmp & openACC

Lecture 6: Multicore Systems

Introductions to Parallel Programming Using OpenMP

Intel® performance analyze tools Nikita Panov Idrisov Renat.

IGOR: A System for Program Debugging via Reversible Execution Stuart I. Feldman Channing B. Brown slides made by Qing Zhang.

Rules for Designing Multithreaded Applications CET306 Harry R. Erwin University of Sunderland.

HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.

Chapter 4 M. Keshtgary Spring 91 Type of Workloads.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

OpenMP Andrew Williams References Chandra et al, Parallel Programming in OpenMP, Morgan Kaufmann Publishers 1999 OpenMP home:

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

1 Lecture 6 Performance Measurement and Improvement.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.

Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.

Using JetBench to Evaluate the Efficiency of Multiprocessor Support for Parallel Processing HaiTao Mei and Andy Wellings Department of Computer Science.

HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.

Computer System Architectures Computer System Software

High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

Parallel Programming in Java with Shared Memory Directives.

OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Lecture 8: Caffe - CPU Optimization

GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.

DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

Computer Organization David Monismith CS345 Notes to help with the in class assignment.

Hybrid MPI and OpenMP Parallel Programming

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

CSC 7600 Lecture 28 : Final Exam Review Spring 2010 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS FINAL EXAM REVIEW Daniel Kogler, Chirag Dekate.

SUSSIX: A Computer Code for Frequency Analysis of Non- Linear Betatron Motion F. Schmidt R. Bartolini H. Renshall G. Vanbavinckhove E. Maclean (presenting)

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

CIS250 OPERATING SYSTEMS Chapter One Introduction.

Threaded Programming Lecture 2: Introduction to OpenMP.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

EKT303/4 Superscalar vs Super-pipelined.

Single Node Optimization Computational Astrophysics.

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-3. OMP_INIT_LOCK OMP_INIT_NEST_LOCK Purpose: ● This subroutine initializes a lock associated with the lock variable.

1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Background Computer System Architectures Computer System Software.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Tuning Threaded Code with Intel® Parallel Amplifier.

Operating System Concepts with Java – 7 th Edition, Nov 15, 2006 Silberschatz, Galvin and Gagne ©2007 Chapter 0: Historical Overview.

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio

Loop Parallelism and OpenMP CS433 Spring 2001

Computer Engg, IIT(BHU)

High Level Parallelisation of SUSSIX

DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.

Multithreading Why & How.

Parallel Computing Explained How to Parallelize a Code

Programming with Shared Memory Specifying parallelism

Presentation transcript:

Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note MD (July 2011) 23/01/20121 BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX

23/01/2012 BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 2 The Problem: SUSSIX is a FORTRAN program for the post processing of turn-by-turn BeamPositionMonitor data, which computes the frequency, amplitude, and phase of tunes and resonant lines to a high degree of precision through the use of an interpolated FFT. Analysis of such data represents a vital component of many linear and non-linear dynamics measurements. For analysis of LHC BPM data a specific version sussix4drive, run through the C steering code Drive God lin, has been implemented in the CCC by the beta-beating team. Analysis of all LHC BPMs, however, represents a major real time computational bottleneck in the control room, which has prevented truly on-line study of the BPM data. In response to this limitation an effort has been underway to decrease the real computational time, with a factor of 10 as the target, of the C and Fortran codes by parallelizing them.

23/01/2012 BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 3 Solutions considered: Since the application is run on dedicated servers in the CCC the obvious technique is to profit from the current multi-core hardware: 24/48 cores are now typical. The first idea was to use a parallelised FFT from the NAG fsl6i2dcl library for SMP and multicore together with the intel 64-bit Fortran compiler and the intel maths kernel library recommended by NAG. As a learning exercise various NAG installation validation examples of enhanced routines were run, including multi-dimensional FFTs, and all took about the same real time, but increasing user cpu time, as the number of cores was increased on a fairly idle 16-core lxplus machine. Not surprising since the examples only take msec, comparable to the overhead to launch a new thread.

23/01/2012 BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 4 The Sussix application calls cfft (D704 in the CERN program library) which maps onto NAG c06ecf which has not yet been enhanced. c06ecf was 10% slower than cfft on a simple test case giving the same numerical results, probably due to extra housekeeping and extra numerical controls. At the same time profiling the Sussix application (with gprof) showed that only 7.5% of the total cpu time was spent in cfft and with less than 10 msec per individual call hence one could expect little or no real-time speedup by using a parallelised version. The profile showed that 70% of the cpu time was spent in a function calcr searching for the maxima of the fourier spectra with large numbers of executions of a compact reverse inner loop over the number of turns of bpm data.

23/01/2012 BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 5 This inverse loop over maxd, the number of LHC turns measured by an individual bpm, could not be improved. In a real case maxd is typically 1000 and this loop is executed 10 million times: double complex zp,zpp,zv zpp=zp(maxd) do np=maxd-1,1, -1 zpp=zpp*zv+zp(np) enddo It was decided to try and parallelise using, like NAG, the OPENMP implementation supported by the Intel compiler and examining the granularity revealed that the highest level of independent code execution was over the processing of individual BPM data.

23/01/2012 BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 6 The pure FORTRAN offline version was parallelised first by adding OPENMP parallelisation directives round the main bpm loop. Each bpm data is in a separate file: !$OMP PARALLEL DO PRIVATE(n,iunit,filename,nturn) !$OMP& SHARED (isix,ntot,iana,iconv,nt1,nt2,narm,istune,etune,tunex, tuney,tunez,nsus,idam,ntwix,ir,imeth,nrc,eps,nline, lr,mr,kr,idamx,ifin,isme,iusme,inv,iinv,icf,iicf) do n=1,ntot ! Parallel loop over all bpm (typically 500) call datspe(iunit,idam,ir,nt1,nt2,nturn,imeth,narm,iana) call ordres(eps,narm,nrc,idam,n,nturn) enddo !$OMP END PARALLEL DO In addition !$OMP THREADPRIVATE directives were added for all non-shareable variables in the called subroutine trees.

23/01/2012 BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 7 This gave good scaling up to 10 cores on a non-dedicated 16-core lxplus machine (reported at the 24 th ICE section meeting of 2011) so was worth extending to the target mixed C and Fortran version to be run in the control room. The bpm data is read into memory from a single file then a bpm loop is called from C code with a different but similar OPENMP syntax to give the same scaling result: #pragma omp parallel private(i,ii,ij,kk) #pragma omp for for (i=pickstart; i<=maxcounthv ; i++){ sussix4drivenoise_(&doubleToSend[0], &tune[0], &amplitude[0]) #pragma omp critical /* here I/O C-code in the loop needing sequential execution */ } The Fortran datspe and ordres call trees were unchanged.

23/01/2012 BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 8 The OPENMP directives multi-thread the code and the threads then map onto physical CPUs in a multi-core machine. The run- time environment variable OMP_NUM_THREADS instructs OPENMP how many threads, hence cores, it can use for an execution and enables easy measurement of the scaling. Since the order of processing of individual BPMs is arbitrary the results file is post-processed by a unix sort as part of the application to give the same results as a non-parallel execution. A test case of real 1000 turn LHC BPM data, analysed to find 160 lines, was performed on a reserved 24 core machine cs- ccr-spareb7 in the CCC. A normal run of this test case takes about 50 seconds on this machine. The observed wall-time speedup of C-Fortran Sussix as a function of the number of cores (from E. Maclean) is shown on the final slide.

23/01/2012 BE-NAG Meeting: Parallelisation of C-Fortran SUSSIX 9 About a factor of 10 improvement in the real computation time has been realised for this test case saturating at 12 cores, probably due to memory bandwidth limits. For the study of amplitude detuning reported in CERN-ATS- Note the parallelized C-Fortran SUSSIX was utilised within the beta- beat GUI and the target tenfold real-time reduction was verified in practice. This technique could be of interest to other applications.