The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology makebc performance Ilia Bermous 21 June.

Slides:



Advertisements
Similar presentations
NGS computation services: API's,
Advertisements

The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology On the efficient tracking of grid point.
Multiprocessing with SAS ® Software Now Bill Fehlner, Kathleen Wong, Kifah Mansour SAS Toronto.
Distributed Indexed Outlier Detection Algorithm Status Update as of March 11, 2014.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Memory Management A memory manager should take care of allocating memory when needed by programs release memory that is no longer used to the heap. Memory.
CVS II: Parallelizing Software Development Author: Brian Berliner John Tully.
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Students: Nadia Goshmir, Yulia Koretsky Supervisor: Shai Rozenrauch Industrial Project Advanced Tool for Automatic Testing Final Presentation.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Incorporation of TAMDAR into Real-time Local Modeling Tom Hultquist Science & Operations Officer NOAA/National Weather Service Marquette, MI.
Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan
DOE BER Climate Modeling PI Meeting, Potomac, Maryland, May 12-14, 2014 Funding for this study was provided by the US Department of Energy, BER Program.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
SICSA Concordance Challenge: Using Groovy and the JCSP Library Jon Kerridge.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
Monte Carlo Instrument Simulation Activity at ISIS Dickon Champion, ISIS Facility.
RISICO on the GRID architecture First implementation Mirko D'Andrea, Stefano Dal Pra.
Block1 Wrapping Your Nugget Around Distributed Processing.
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
Outline 3  PWA overview Computational challenges in Partial Wave Analysis Comparison of new and old PWA software design - performance issues Maciej Swat.
Ganga A quick tutorial Asterios Katsifodimos Trainer, University of Cyprus Nicosia, Feb 16, 2009.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
This document gives one example of how one might be able to “fix” a meteorological file, if one finds that there may be problems with the file. There are.
STAR Event data storage and management in STAR V. Perevoztchikov Brookhaven National Laboratory,USA.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Introduction to z/OS Basics © 2006 IBM Corporation Chapter 7: Batch processing and the Job Entry Subsystem (JES) Batch processing and JES.
The steps involved in developing an Information System are: Analysis Feasibility Study System Design Testing Implementation Documentation.
MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Lecture 7 Conditional Scripting and Importing/Exporting.
Development of test suites for the certification of EGEE-II Grid middleware Task 2: The development of testing procedures focused on special details of.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
HPC pilot code. Danila Oleynik 18 December 2013 from.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Portable Batch System – Definition and 3 Primary Roles Definition: PBS is a distributed workload management system. It handles the management and monitoring.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Prepared by: Arjaa Salem Makkawi ID: Sec : 2.
20 October 2005 LCG Generator Services monthly meeting, CERN Validation of GENSER & News on GENSER Alexander Toropin LCG Generator Services monthly meeting.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Scenario use cases Szymon Mueller PSNC. Agenda 1.General description of experiment use case. 2.Detailed description of use cases: 1.Preparation for observation.
Multiple Sequence Alignment with PASTA Michael Nute Austin, TX June 17, 2016.
ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
HPC In The Cloud Case Study: Proteomics Workflow
PROJECT COMMUNICATION PROCEDURE AND DOCUMENT MANAGEMENT
Michael Naughton, Wenming Lu
Imageodesy for co-seismic shift study
Dr. Ilia Bermous, the Australian Bureau of Meteorology
H. Rème, I.Dandouras and A. Barthe IRAP, Toulouse, France
3.3.3 Computer architectures
ACCESS Rose-cylc Technical Infrastructure at NCI & BoM
-A File System for Lots of Tiny Files
Programming Logic and Design Fourth Edition, Comprehensive
Software life cycle models
Laura Bright David Maier Portland State University
External Sorting.
Overview of Workflows: Why Use Them?
CENG 351 Data Management and File Structures
Production client status
Presentation transcript:

The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology makebc performance Ilia Bermous 21 June 2012

2 Performance issues with the current version of makebc  makebc run for a City model on a single core (Wenming’s message of 7 March 2012) takes 80-90min  reading and uncompressing large pi files in total of ~100GB each pi file containing 2 time segments with 30 min frequency for 13 model fields with a 1088x746x70 resolution for R12 is ~2.2GB  Performance analysis showed that significant amount of time is spent in reading and uncompressing of the input files in the following loop DO JJ=1,IY ICX = IC(JJ) CALL XPND( ) END DO taking ~50% of total elapsed time. This loop can be parallelised with multithreading => giving theoretical performance improvement on Solar by 50% + 50%/8 => 100%/56.25% < 1.8 times

3 Three main approaches for performance improvement  Approach #1: after parent model completion parallel makebc processing of each pi file and parallel merging of the sequential LBCs  Approach #2 (suggested by Yi Xiao) includes 2 stages:  Extracting frames from pi files during parent model execution  makebc processing of the frame files with efficient multithreading on a single node after the parent model completion This method can be efficiently used in research where pi files are retrieved from the archive  Approach #3 (Tom Green & Mike Naughton)  On fly makebc processing of sequentially produced pi files during parent model execution (the most efficient scenario is with a batch job submission using a whole Solar node for each makebc processing)  Sequential merging (on fly) of “simple” LBCs into the accumulated single LBC file, the efficient way is with an appending operation if it is available

4 Concepts of parallel implementation (approach #1)  makebc processing stage All makebc tasks are run in batch jobs using parallel running background processes  merging LBCs into a single LBC file N – number of LBC files to merge at the (j) merging stage All merges on each merging stage (j) are done in parallel

5 Main performance advantages of the new implementation Old schema Single makebc run which includes: 1. Sequential reading and uncompressing of all relatively large input pi files 2. Sequential in terms of time&date segment processing of the read information New schema 1.Parallel running makebc processing for each input pi file a)reading, uncompressing and generation of boundary condition files for M input pi files are done in parallel using M processes/cores 2.Parallel merging of sequential 2 LBCs in the merging tree structure

6 Main ideas in parallel processing (approach #1)  Parallel processing of each pi file  A number of makebc processes are packed into a batch job submitted from the main batch job  Due to significant memory requirements depending on a size of the pi files the most efficient way to execute makebc tasks on a Solar node is with the usage of no more than 3-4 parallel running processes in the background  Some reasonable HPC resources are required for parallel makebc processing 48hour => 48 pi files => 3 LBCs per 8 core node => (1+16) 8 core nodes pi000 batch job #1 pi001 pi002 lbc002 lbc001 lbc000 makebc &...

7 Main ideas in parallel processing (approach #1) Some additional comments for the implementation  Execution of the submitted makebc jobs are monitored within the main batch job until all pi files are processed successfully  Parallel merging procedure for LBCs pi files are merged in stages  If the number of the merging processes running in parallel on any stage is greater than the number of the cores available per node (8 on Solar system) than the corresponding merging processes are packed into batch jobs (8 per a job) which are executed in parallel

8 Parallel merging tree structure

9 Some issues in relation to makebc processing of pi files produced by the R12 model  Current makebc version (vn7.9) does not allow to process pi files starting with not a whole hour number => as a result a number of source changes have been implemented by Tom Green to resolve this problem pi000 pi001 pi002  form of the makebc command for processing pi0nn file is the following makebc –n file.nml –i pi000 pi pi0nn -ow lbc with some special settings in the input file.nml namelist files such as N_DUMPS=NEXT_HOUR A_INTF_START_HR=CURRENT_HOUR A_INTF_END_HR=NEXT_HOUR 09:00 09:30 10:00 10:30 11:00 11:30 12:00

10 Some possible cases of resulting LBC files after makebc processing of pi files City model case Regional model case pi000 pi001 pi002 pi000 pi002 pi004 09:00 09:30 10:00 10:30 11:00 21:00 22:00 23:00 09:00 09:30 10:00 10:30 11:00 11:30 12:00 21:00 22:00 23:00 11:30 12:00 00:00 01:00 02:00 03:00 00:00 01:00 02:00 03:00 LBC1 LBC2 LBC3 makebc input output Before merging process  “first” duplicate segments are removed  orography field is removed from all LBCs excluding LBC1 Before merging process  orography field is removed from all LBCs excluding LBC1

11 Software used for removing duplicate segments and merging LBCs  Duplicate segments are removed from the beginning of each LBC file starting from a second LBC file  subset_um script recommended by Martin Dix and based on a program developed by Alan Iwi at Reading University is used for this purpose, at this stage parameters to run the script are not chosen automatically and their values depend on  the number of fields produced into the LBC file  the number of segments in each LBC file  Merging LBC files is done by using VAR VarScr_UMFileUtils script with a corresponding VarProg_UMFileUtils program  According to Tom Green a UM mergeum utility can not be used as it has not been updated correspondingly in the latest UM versions, also it is unused at the Met Office

12 Manual stuff for the main batch script set up  Separate makebc processing of pi files is required to be able to understand  on how the input namelist file for makebc execution should be properly set up for processing this type of pi files  on what kind of output information is generated into the output LBC files  Which (time&date) segments and how many segments are produced  How many fields including orography field are produced into an LBC  Is any overlapping in the produced segments into LBCs?

13 Results  Performance of the implemented procedure (approach #1) has been tested using 3 model cases  in a City model case (Brisbane) processing 41 pi files each of ~2.2GB gave the best elapsed time of 3min 47sec (with the latest improved merging procedure as well as the usage of Lustre file striping) running 3 makebc processes per an 8 core node in comparison with ~50min of elapsed time using the current makebc processing executed on a single core => a speed up of ~13 times  pure makebc processing of a single pi file with 2 time&date segments takes 2min10sec-2min30sec to produce an LBC with 3 time segments  in a Sydney model case using pi files from Wenming’s job ran yesterday the best elapsed time (from 3 runs) of 4min16sec for processing 51 pi files was obtained in comparison with ~83min taken by Wenming’s makebc job (processing of 48 pi files) => a speed up of over 19 times.

14 Results (cont #2)  in a regional model case processing 39 pi files each of ~630MB gave the best elapsed time of 3min30sec in comparison with ~15.5min in the standard usage case. The achieved performance improvement is not so significant as in the above mentioned City model case due to  a relatively large final LBC file of just over 7GB, unfortunately merging of relatively large files (over 1 GB) is a relatively expensive operation even if it is done in parallel  a relatively smaller value for the size(pi)/size(LBC0) ratio in the regional model case

15 Factors attributed on how fast the new procedure runs in comparison with the standard method  Number of pi files for processing  Size of each separate LBC file produced after makebc processing  size of the resulting LBC file  Ratio size(pi)/size(LBC0), where LBC0 are resulting LBC files after makebc processing

16 Some aspects in approach #2  Advantages  In the city model case each frame file should be ~1000 times smaller in size than pi file => <2MB  makebc executed with multithreading on 8 cores will be very efficient from the performance point of view and the whole run depending on the size of the resulting LBC file should not take more than 2-3 minutes  minimal HPC resources are required with no more than a single 8 core Solar node  Issues to be addressed  At the moment there are problems to run frames utility, maybe the latest version with vn8.2 resolves the problems, also currently this utility has a limitation of handing hourly datasets only (Tom Green comment)  a special utility is required to be able to identify on whether a pi file is complete or not during parent model execution, this is very important with asynchronous I/O usage starting from UM7.8. According to Tom Green: “This is something we are also trying to understand how best to handle”.

17 Some aspects in approach #3  Advantages  As soon as a sequentially produced by the parent model pi file is ready, it will be processed by makebc running in parallel with the parent model  After makebc processing the obtained LBC file can be merged on fly to the accumulated LBC version  Some technical issues for addressing  as in approach #2 a special utility is required to be able to identify on whether a pi file is complete or not  merging simple LBCs to the accumulated version should be done in the right order  Notes on performance comparison with approach #1  The last pi file will be produced near the parent model completion, so its makebc processing will not have any significant performance advantage in comparison with parallel makebc processing implemented with approach #1  Merging of the last simple LBC file with the accumulated LBC version will take a significant amount of time comparable with the time taken in the merging process in approach #1 at the final merging stage Summary: I don’t expect a reasonable advantage of this approach in comparison with approach #1.

18 Forms of parallel processing  Current implementation uses a set of batch jobs created in the main batch job and submitted/monitored from this job  Other possible ways of parallel processing can be the following  Usage of mpirun command with the scripts for parallel processing called within a Fortran/C MPI program  Usage of pbsdsh command with PBS  Usage of GNU parallel utility

19 Conclusions  The implemented parallel makebc processing can reduce the corresponding elapsed times in makebc processing by up to times or even higher  Corresponding numerical results produced by the forecast model are identical with the previous results obtained using old makebc execution  The main job ASCII output has 2 different output lines  “Last Validity time” line: during the merging procedure this line is taken from the first LBC file but in the old makebc run case the value corresponds to the time taken from the last pi file  “Model Data” line: the first number (starting address of the data in the file after the header) is different as well as the other 2 have non-zero values which are harmless, at the moment this problem can be fixed using Martin Dix Python change_inthead.py utility  It is still worthwhile to try the frames approach #2 which can be more efficient from the HPC resources usage and providing a better performance with the usage of multithreading at the same time if any saving in the elapsed time is achievable with this approach it will be a minimal and no more than 1-3min on our Solar system