CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

Section 6.2. Record data by magnetizing the binary code on the surface of a disk. Data area is reusable Allows for both sequential and direct access file.

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.

Operating System.

Tag line, tag line Perforce Benchmark with PAM over NFS, FCP & iSCSI Bikash R. Choudhury.

Instructor: Sazid Zaman Khan Lecturer, Department of Computer Science and Engineering, IIUC.

RESULTS FROM CGRER-IOWA 2003 JUAN L. PEREZ INDEX 1.CMAQ PARALLEL vs CMAQ SERIAL. 2.MM5 PARALLEL vs MM5 SERIAL. 3. CMAQ SAPRAC vs CMAQ CB4 4.STEM3.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

What to expect.  Linux  Windows Server (2008 or 2012)

Chapter One The Essence of UNIX.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

Chapter 2 Analytical Models 2.1Cost Models 2.2Cost Notations 2.3Skew Model 2.4Basic Operations in Parallel Databases 2.5Summary 2.6Bibliographical Notes.

Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.

IT Infrastructure: Software September 18, LEARNING GOALS Identify the different types of systems software. Explain the main functions of operating.

Lesson 5-Accessing Networks. Overview Introduction to Windows XP Professional. Introduction to Novell Client. Introduction to Red Hat Linux workstation.

70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.

Guide To UNIX Using Linux Third Edition

1 SOFTWARE TECHNOLOGIES BUS Abdou Illia, Spring 2007 (Week 2, Thursday 1/18/2007)

Operating System Organization

Performance of CMAQ on a Mac OS X System Tracey Holloway, John Bachan, Scott Spak Center for Sustainability and the Global Environment University of Wisconsin-Madison.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

Week 6 Operating Systems.

Page 19/4/2015 CSE 30341: Operating Systems Principles Raid storage  Raid – 0: Striping  Good I/O performance if spread across disks (equivalent to n.

Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>

Chapter 4 COB 204. What do you need to know about hardware? 

OpenTS for Windows Compute Cluster Server. Overview  Introduction  OpenTS (academic) for Windows CCS  T-converter  T-microkernel  OpenTS installer.

The D0 Monte Carlo Challenge Gregory E. Graham University of Maryland (for the D0 Collaboration) February 8, 2000 CHEP 2000.

Operating System. Architecture of Computer System Hardware Operating System (OS) Programming Language (e.g. PASCAL) Application Programs (e.g. WORD, EXCEL)

Computing hardware CPU.

RUP Implementation and Testing

So, Jung-ki Distributed Computing System LAB School of Computer Science and Engineering Seoul National University Implementation of Package Management.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Introduction Computer Organization and Architecture: Lesson 1.

GBT Interface Card for a Linux Computer Carson Teale 1.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

O.S.C.A.R. Cluster Installation. O.S.C.A.R O.S.C.A.R. Open Source Cluster Application Resource Latest Version: 2.2 ( March, 2003 )

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.

PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

Office of Research and Development Atmospheric Modeling Division, National Exposure Research Laboratory WRF-CMAQ 2-way coupled system: Part I David Wong,

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

ADVANCED EV3 PROGRAMMING LESSON By Droids Robotics 1 Data Logging (Part 2)

Weekly Report By: Devin Trejo Week of June 21, 2015-> June 28, 2015.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Welcome to the PRECIS training workshop

THE WINDOWS OPERATING SYSTEM Computer Basics 1.2.

15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK

Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,

Background Computer System Architectures Computer System Software.

Hello world !!! ASCII representation of hello.c.

10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.

Operating Systems A Biswas, Dept. of Information Technology.

OPERATING SYSTEMS (OS) By the end of this lesson you will be able to explain: 1. What an OS is 2. The relationship between the OS & application programs.

1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.

Scientific Linux Inventory Project (SLIP) Troy Dawson Connie Sieh.

CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.

Chapter 3 Getting Started. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. Objectives To give an overview of the structure of a contemporary.

Introduction to Operating Systems Concepts

Chapter 1: A Tour of Computer Systems

Operating System.

IT Infrastructure: Software

Computer Fundamentals

Presentation transcript:

CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a a University of Connecticut, Atmospheric Resources Lab, Dept. Natural Resource Management and Engineering, Storrs, CT b University of Connecticut, Dept. Computer Science and Engineering, Storrs, CT Voice: (860) FAX (860) OBJECTIVE The purpose of this study was to describe how processing time for the parallel version of CMAQ was affected by the number of processors used, local vs. remote data access and the size of the datasets involved. 2. HARDWARE CONFIGURATION The Atmospheric Resources Lab (ARL) at UCONN (University of Connecticut) recently purchased a Linux cluster. This consists of one head node, four slave nodes and a file server as shown in Figure 1. The details of the components are shown in Table 1 below. Figure 1. Configuration of the UCONN ARL Linux Cluster. Table 1. UCONN ARL Linux Cluster Description 1) Head Node:  Dual CPU's AMD MP 2000, 1.667GHz CPU's, 266Mhz Bus, 2.048GB PC2100 ECC RAM  (2) 120GB EIDE Hard Drives, 7200 RPM, Mirrored RAID level 1 2) 4 Slave Nodes:  Dual CPU's AMD MP 2000, 1.667GHz CPU's, 266Mhz Bus, 2.048GB PC2100 ECC REG DDR RAM  120GB EIDE Hard Drive 3) EIDE RAID 5 File Server:  AMD XP 2000, 512 MB DDR PC2100 RAM  (1) 20GB Hard Drive  (8) 120GB Hard Drive 4) 8 Port Gigabit networking Switch 1. SOFTWARE The cluster operating system is RedHat Linux 7.3, with the parallel libraries provided by MPICH version The May 2003 release of CMAQ (version 4.2.2) and other Models3 tools (netCDF, IOAPI, MCIP) were compiled and the resulting executables used for this test 1. DATASETS Two datasets were used in this study. A “small” spatial domain, consisting of the tutorial files provided with the Models3 tools (38x38x6, 24Hrs) and a “large” spatial domain created in our lab based on an MM5 run provided to us through NYDEC, that was generated at the University of Maryland. This dataset is based on the 36km Unified Grid and covers the eastern portion of the United States (67x78x21, 24Hrs). Table 2 shows a summary of the two domains, including the sizes of some of the largest input and output files. Table 2. Description of Datasets Feature“Small” Tutorial “Large” CT ARL Rows 3867 Columns3878 Layers621 Timesteps24 In: Emissions (bytes)37,383,99237,543,856 (area only) In: MetCRO3D (bytes)34,322,976634,358,520 Out: CCTM_*CONC (bytes)67,624,592821,822, METHODS The experimental design looked at runtime as affected by three factors: number of processors, dataset size and data IO location. A total of 48 runs were made (see Table 3). Number of Processors: For the single processor runs, the parallel version of the CMAQ code was not used. For the multi-processor runs, the parallel version was used and the run script modified as needed for the correct number of processors. Data IO: For “Local Read” runs, all input files were stored on and read from the head node, while for “Remote Read” runs, these same files were stored and read from the file server. For “Local Write” runs, all output files were written out to the head node, while for “Remote Write” runs, these same files were written out to the file server. Runs were made sequentially, in that a run was not started until the previous run had finished. No other work was being done on the cluster at the time of the runs. Timing statistics were collected from the log files, which store the output of the UNIX “time” command. The total amount of time spent in by the computer in “user” mode, “system” mode, and the total elapsed time was recorded for each run. Table 3. Experimental Design: Runtime Configurations Number of Processor s Local Read / Local Write Local Read / Remote Write Remote Read / Local Write Remote Read / Remote Write 1  Small  Large  Small  Large  Small  Large  Small  Large 2  Small  Large  Small  Large  Small  Large  Small  Large 4  Small  Large  Small  Large  Small  Large  Small  Large 6  Small  Large  Small  Large  Small  Large  Small  Large 8  Small  Large  Small  Large  Small  Large  Small  Large 10  Small  Large  Small  Large  Small  Large  Small  Large 1. RESULTS Results are shown in Figure 2 below. Figures 2a and 2d show that for both the large and small datasets, and for all data IO locations, the amount of time spent in user mode decreased as the number of processors increased. The greatest decrease in processing time occurred as the number of processors increased from one to four. Beyond four processors, there were still decreases in time, but not as great. Figures 2b and 2e show that when the data IO involved Remote Writes, increasing the number of processors actually resulted in an increase in the amount of time spent in “System” mode. Reading remotely did not seem to add to the amount of system time used, however, regardless of the number of processors. Figures 2c and 2f show the total elapsed time of each run. When data was written locally, run time decreased as processors were added, especially as the number of processors went from one to four. When data was written remotely, however, processing time actually increased as the number of processors exceeded two, for the small dataset, and four for the large dataset. 2. CONCLUSIONS In general, CMAQ ran faster as more processors were used, as long as the data was written locally. When data was written remotely to the file server, adding processors actually increased processing time due to the greater overhead involved in the remote writes. In the LINUX cluster environment, these remote writes are accomplished through calls to NFS (Network File System) write. This call involves a fair amount of system overhead to ensure data integrity. NFS write performance can be slightly tuned-up through adjusting some performance parameters such as block size and transfer protocol. However, read performance of NFS is good and comparable to local disk read. As of this writing we have not tested version 4.3 of CMAQ (released in September 2003), but noticed that NFS latency problems in an MPICH cluster appear to have been addressed. We plan to repeat our test on this latest release. TUTORIAL DATAEASTERN US DATA NUMBER OF PROCESSORS USER TIME SYSTEM TIME ELAPSED TIME We would like to thank the NYDEC and the University of Maryland for the use of their MM5 1997/2002 dataset. This research was supported by the Connecticut River Airshed Watershed Consortium, USEPA Cooperative Agreement R