Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)

Slides:

Advertisements

Similar presentations

Distributed Processing, Client/Server and Clusters

Advertisements

Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

HIRLAM Use of the Hirlam NWP Model at Met Éireann (Irish Meteorological Service) (James Hamilton -- Met Éireann)

S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.

Beowulf Supercomputer System Lee, Jung won CS843.

Introduction to DBA.

Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.

Information Technology Center Introduction to High Performance Computing at KFUPM.

Managing Linux Clusters with Rocks Tim Carlson - PNNL

Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.

Novell Server Linux vs. windows server 2008 By: Gabe Miller.

An Introduction to Princeton’s New Computing Resources: IBM Blue Gene, SGI Altix, and Dell Beowulf Cluster PICASso Mini-Course October 18, 2006 Curt Hillegas.

CSC Site Update HP Nordic TIG April 2008 Janne Ignatius Marko Myllynen Dan Still.

Introduction to Scientific Computing Doug Sondak Boston University Scientific Computing and Visualization.

Linux clustering Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University.

Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.

CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

Building a High-performance Computing Cluster Using FreeBSD BSDCon '03 September 10, 2003 Brooks Davis, Michael AuYeung, Gary Green, Craig Lee The Aerospace.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.

Bob Thome, Senior Director of Product Management, Oracle SIMPLIFYING YOUR HIGH AVAILABILITY DATABASE.

Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.

University of Southampton Clusters: Changing the Face of Campus Computing Kenji Takeda School of Engineering Sciences Ian Hardy Oz Parchment Southampton.

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

I-SPAN’05 December 07, Process Scheduling for the Parallel Desktop Designing Parallel Operating Systems using Modern Interconnects Process Scheduling.

Farm Management D. Andreotti 1), A. Crescente 2), A. Dorigo 2), F. Galeazzi 2), M. Marzolla 3), M. Morandin 2), F.

MURI Hardware Resources Ray Garcia Erik Olson Space Science and Engineering Center at the University of WI - Madison.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.

CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.

Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.

Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper

10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.

ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.

O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.

Cluster Software Overview

CERN Database Services for the LHC Computing Grid Maria Girone, CERN.

Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Building and managing production bioclusters Chris Dagdigian BIOSILICO Vol2, No. 5 September 2004 Ankur Dhanik.

Grid Remote Execution of Large Climate Models (NERC Cluster Grid) Dan Bretherton, Jon Blower and Keith Haines Reading e-Science Centre

Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.

Ole’ Miss DOSAR Grid Michael D. Joy Institutional Analysis Center.

Database CNAF Barbara Martelli Rome, April 4 st 2006.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Operational and Application Experiences with the Infiniband Environment Sharon Brunett Caltech May 1, 2007.

Background Computer System Architectures Computer System Software.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

VGrADS and GridSolve Asim YarKhan Jack Dongarra, Zhiao Shi, Fengguang Song Innovative Computing Laboratory University of Tennessee VGrADS Workshop – September.

CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

Low-Cost High-Performance Computing Via Consumer GPUs

White Rose Grid Infrastructure Overview

NL Service Challenge Plans

Cluster / Grid Status Update

Parallel Density-based Hybrid Clustering

Support for ”interactive batch”

Cluster Computers.

Presentation transcript:

Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)

Outline Linux Cluster at Environmental Agency of Slovenia, history and present state Linux Cluster at Environmental Agency of Slovenia, history and present state Operational experiences Operational experiences Future requirements for limited area modelling Future requirements for limited area modelling Needed ingredients for future system? Needed ingredients for future system?

History & background EARS: small service, limited resources for NWP EARS: small service, limited resources for NWP Small NWP group, research & operations Small NWP group, research & operations First research Alpha-Linux cluster (1996) – 20 nodes First research Alpha-Linux cluster (1996) – 20 nodes First Linux operational cluster at EARS (1997) First Linux operational cluster at EARS (1997) 5 x Alpha CPU 5 x Alpha CPU One among first operational clusters in Europe in the field of meteorology One among first operational clusters in Europe in the field of meteorology

Tuba – current cluster system Installed 3 years ago, already outdated Installed 3 years ago, already outdated Important for gathering of experiences Important for gathering of experiences Hardware: Hardware: 13 Compute Nodes, 13 Compute Nodes, 1 Master Node, Dual Xeon 2.4 Ghz, 1 Master Node, Dual Xeon 2.4 Ghz, 28 GB memory 28 GB memory Gigabit Ethernet Gigabit Ethernet Storage: 4 TB IDE2SCSI disk array, xfs filesystem Storage: 4 TB IDE2SCSI disk array, xfs filesystem

Tuba software Open source, whenever possible Cluster management software: Cluster management software: OS: RH Linux + SCore (5.8.2) ( OS: RH Linux + SCore (5.8.2) ( Mature parallel environment Mature parallel environment Lower latency MPI implementation Lower latency MPI implementation Transparent to user Transparent to user Gang scheduling Gang scheduling Pre-empting Pre-empting Checkpointing Checkpointing Parallel shell Parallel shell Automatic fault recovery (hardware or SCore) Automatic fault recovery (hardware or SCore) FIFO scheduler FIFO scheduler Capability of integration with OpenPBS and SGE Capability of integration with OpenPBS and SGE Lahey and Intel compilers Lahey and Intel compilers

Ganglia - Cluster Health monitoring

Operational experiences In production for almost 3 years In production for almost 3 years Unmonitored suite Unmonitored suite Minimal hardware related problems so far! Minimal hardware related problems so far! Some problems with SCore (mainly related to buffers in MPI) Some problems with SCore (mainly related to buffers in MPI) NFS related problems NFS related problems ECMWF's SMS, solves majority of problems ECMWF's SMS, solves majority of problems

Reliability

Operational setup ALADIN model 290x240x37 domain 290x240x37 domain 9.3 km resolution 9.3 km resolution 54h integration 54h integration Target: 1 h Target: 1 h

Optimizations Not everything in a hardware Code optimizations B-Level parallelization (up two 20 % at greater number of processors) B-Level parallelization (up two 20 % at greater number of processors) Load balancing of grid point computations (depending on the number of processors) Load balancing of grid point computations (depending on the number of processors) Parameter tuning Parameter tuning NPROMA cash tuning NPROMA cash tuning MPI message size MPI message size Improvement in compilers (Lahey –> Intel – 25 %) Improvement in compilers (Lahey –> Intel – 25 %) Still to work on: OpenMP (better efficiency of memory usage) Still to work on: OpenMP (better efficiency of memory usage)

Non operational use Downscaling of ERA-40 reanalysis with ALADIN model Downscaling of ERA-40 reanalysis with ALADIN model Estimation of wind energy potential over Slovenia Estimation of wind energy potential over Slovenia Multiple nesting of target computational domain into ERA- 40 data Multiple nesting of target computational domain into ERA- 40 data 10 years period, 8 years / month 10 years period, 8 years / month Major question: How to ensure coexistence with operational suite Major question: How to ensure coexistence with operational suite

Foreseen developments in limited area modeling Currently ALADIN 9 km Currently ALADIN 9 km Arome, 2.5 km : ALADIN NH solver + Meso NH physics Arome, 2.5 km : ALADIN NH solver + Meso NH physics 3 times more expensive per Grid Point 3 times more expensive per Grid Point Target Arome: ~200 x – 300 x more expensive (same computational domain, same time range) Target Arome: ~200 x – 300 x more expensive (same computational domain, same time range)

How to get there (if?) Linux commodity cluster at EARS? First upgrade in the mid 2006 First upgrade in the mid times the current system (if possible, below 64 processors) 5 times the current system (if possible, below 64 processors) Tests going on with: Tests going on with: New processors: AMD Opteron, Intel Itanium-2 New processors: AMD Opteron, Intel Itanium-2 Interconnection: Infinyband, Quadrics? Interconnection: Infinyband, Quadrics? Compilers: PathScale (AMD Opteron) Compilers: PathScale (AMD Opteron) Crucial: Parallel file system (TerraGrid), already installed, replacement of NFS Crucial: Parallel file system (TerraGrid), already installed, replacement of NFS

How to stay at the open side of the fence? Linux and other OpenSource projects are evolving Linux and other OpenSource projects are evolving Great number of more and more complex software projects Great number of more and more complex software projects Specific (operational) requirements in meteorology Specific (operational) requirements in meteorology Space for system integrators Space for system integrators Price/performance gap between commodity and brand name systems is getting smaller when the size of system is growing Price/performance gap between commodity and brand name systems is getting smaller when the size of system is growing Pioneer time of Beowulf clusters seems to be over Pioneer time of Beowulf clusters seems to be over Importance of extensive test of all cluster components Importance of extensive test of all cluster components

Conclusions Positive experiences with small commodity Linux cluster, great price/performance ratio Positive experiences with small commodity Linux cluster, great price/performance ratio Our present type of development of new cluster works for small cluster, might work for medium sized and doesn’t for big systems Our present type of development of new cluster works for small cluster, might work for medium sized and doesn’t for big systems Future are probably Linux clusters, but branded Future are probably Linux clusters, but branded