Bureau use of SMS for operational scheduling

Slides:



Advertisements
Similar presentations
ICS 434 Advanced Database Systems
Advertisements

GENI Experiment Control Using Gush Jeannie Albrecht and Amin Vahdat Williams College and UC San Diego.
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Enterprise Reporting with Reporting Services SQL Server 2005 Donald Farmer Group Program Manager Microsoft Corporation.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Architecture Of ASP.NET. What is ASP?  Server-side scripting technology.  Files containing HTML and scripting code.  Access via HTTP requests.  Scripting.
The VPO Operator. [vpo_operator] 2 The VPO Operator Section Overview The role of the VPO operator Starting and stopping the Motif GUI The VPO Operator.
TELE 301 Lecture 10: Scheduled … 1 Overview Last Lecture –Post installation This Lecture –Scheduled tasks and log management Next Lecture –DNS –Readings:
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
Computer Emergency Notification System (CENS)
Introduction to the Adapter Server Rob Mace June, 2008.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
3 Copyright © 2004, Oracle. All rights reserved. Working in the Forms Developer Environment.
SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.
ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart
IT System Administration Lesson 3 Dr Jeffrey A Robinson.
TOPIC 7.0 LINUX SERVICES AND CONFIGURATION. ROOT USER Root user is called “super user” because it has power far beyond those of mortal user. As root,
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Chapter – 8 Software Tools.
Systems Software. Systems software Applications software such as word processing, spreadsheet or graphics packages Operating systems software to control.
SoftUpdate New features and management technique.
SQL Database Management
Progress Apama Fundamentals
Architecture Review 10/11/2004
Core LIMS Training: Project Management
The SCEC CSEP TESTING Center Operations Review
ReportWorX vs. ReportWorX Express
How to Contribute to System Testing and Extract Results
What’s new in FUSION? Bob McGaughey
Working in the Forms Developer Environment
Project Center Use Cases Revision 2
Project Center Use Cases
HORIZONT TWS/WebAdmin DS TWS/WebAdmin DS Tips & Tricks
Topics Introduction Hardware and Software How Computers Store Data
Web Development Web Servers.
CARA 3.10 Major New Features
z/Ware 2.0 Technical Overview
System Design Ashima Wadhwa.
Computer 4 JEOPARDY Bobbie, Sandy, Trudy.
Tango Administrative Tools
Chapter 2: System Structures
LCGAA nightlies infrastructure
NCAR-Developed Tools Bill Anderson and Marc Genty
Application with Cross-Platform GUI
Deploying and Configuring SSIS Packages
Introduction to Operating System (OS)
Project Center Use Cases Revision 3
Project Center Use Cases Revision 3
TYPES OFF OPERATING SYSTEM
Chapter 12: Automated data collection methods
Chapter 4.
Chapter 2: System Structures
Web Application Architectures
POP: Building Automation Around Secure Server Deployment
Exploring the Power of EPDM Tasks - Working with and Developing Tasks in EPDM By: Marc Young XLM Solutions
Process Description and Control
Software models - Software Architecture Design Patterns
Course: Module: Lesson # & Name Instructional Material 1 of 32 Lesson Delivery Mode: Lesson Duration: Document Name: 1. Professional Diploma in ERP Systems.
Web Application Architectures
LO2 – Understand Computer Software
Comparison IWS/Graph and IWS/WebAdmin for IWSz
Objectives Explain the role of computers in client-server and peer-to-peer networks Explain the advantages and disadvantages of client- server and peer-to-peer.
Web Application Architectures
Features Overview.
Presentation transcript:

Bureau use of SMS for operational scheduling (and plans for evaluation of job scheduling options…) Jim Fraser Bureau National Operations Centre 14 Oct 2014

Role of Bureau National Operations Centre BNOC serves as the real-time operational hub for Australia’s national weather service by: providing numerical forecast guidance to the Regional Forecasting Centres and external clients issuing a range of analysis and prediction products relating to weather, oceans and climate supporting the operational communications and computing infrastructure maintaining an archive of synoptic analyses and observations over the Southern Hemisphere Various international roles: WMC, RSMC, RAFC, RTWP…

Structure of BNOC Central Operations: Operational Support: 6-7 staff on duty at all times (24x7) Senior Meteorologist (supervisor) Aviation Meteorologists (2) ITOps Computer Operator ITOps Help Desk (IT and Comms) Operational Support: Operations Development (5) NWP model implementation & support Oceanographic Systems (6) Ocean model implementation and support National Weather Graphics (3) National Tides Unit Tsunami Systems (1) Approximately 70 staff in total

Responsibilities Operational Development Section IT Operations Support of the Operational NWP Suites: Day-to-day support and maintenance of all the models which form the Operational NWP Suite. Out of hours troubleshooting support for unusual problems Implementation in the operational scheduler & running on operational machines of model upgrades / new systems developed by Bureau’s research centre (CAWCR) Interfacing with downstream systems Model forecast verification/assessment IT Operations Monitoring of operational systems First-level support; escalation of problems to appropriate support staff

SMS job scheduling in BNOC SMS = ECMWF “Supervisor Monitor Scheduler” Powerful GUI environment for operational job scheduling and monitoring Core component of BoM BNOC Operations for last two decades (first introduced into NMC in early 1990's) Currently runs 17,200+ jobs per day in 55 separate suites Allows for system inter-dependencies Easy visual monitoring of system status Gives interactive facilities for operational staff

Overview of SMS Components: SMS server –The scheduler, continuously running daemon process that submits tasks to remote machines upon trigger dependencies being satisfied Child commands talk back to server to update status and attributes cdp – Command line interface to SMS XCDP – Graphical interface to SMS

SMS server functionality Server (nmoc-sms) hosts the suites Checkpoints (backup) suites tree periodically Handles user and job requests (jobs submitted under userid "rto") Logs activity

XCDP – graphical user interface ●Windowed GUI ●Allows direct interaction with SMS Servers ●Can access many SMS client commands ●Provides easy access to helpful information (script, manual, output, etc.) ●Contains alarm features, runs even when iconized ●User can mask information from being displayed ●User can search for task names, types, status etc

XCDP – graphical user interface

XCDP – graphical user interface

XCDP – graphical user interface

How it works Define your suite (grouping of tasks) in a .def file Structure, dependencies Locations of scripts/outputs, etc Add “hooks” to communicate to SMS server (using includes) When ready, server submits the task (pre-processor adds SMS variables & expands include files) Task tells server task has started - init client command If an error is detected, task tells the server Use error trapping to communicate errors, Send abort client command If task completes, tells the server Send complete client command

SMS .def file example Simple human readable scripting language

Implementing ACCESS in SMS SMS "scs_run.sms" task calls a "run-scs-suite" script which then initiiates the individual SCS tasks – a scheduler within a scheduler! $RUN_SCRIPT/run-scs-suite $cold -d $SUITE_DIR -t $DATETIME Hooks added inside SCS steps communicate back to SMS, signalling start, status, completion via function calls SCS html output can only be viewed in web-browser (rather than XCDP)

Implementing ACCESS in SMS - II Common scripts used wherever possible between all ACCESS suites to minimize maintenance ACCESS suites comprised of 109 individual SMS scripts Tools exist to automatically create multiple (3236) soft-links to these scripts giving tasks unique job names to make life easier for the operators Scheduler syntax expands loops etc in .def file to submit a total of 11,420 individual ACCESS jobs per day

Task triggering in SMS A variety of methods are used to control task initiation within SMS: Simple dependencies on completion of prior jobs; Time triggers (time of day; every x minutes; day of week etc); Regular polling of files via SMS (e.g. EER suites); Cross suite triggering by setting flags via cdp commands; External daemons starting a task via cdp commands (e.g. VAWS); Operator intervention (manually executing tasks) Hindcast looping can be achieved by removing any time dependencies and specify start date and end date

Operational job runtimes on Ngamai supercomputer

Things BNOC likes about SMS One single GUI for operational system monitoring (can view multiple SMS instances in a single window) Intuitive, relatively easy to use Robust & stable (can run for months or years without crashing) Lightweight, scalable (yet to hit any internal limits) Can interactively replay/modify parts of a running suite Can manually change SMS variables or perform one-off modified "edit" runs Security – user lists control who can see what & do what

Some disadvantages of SMS No longer "supported" by ECMWF (replaced by ecFlow recently) 32-bit code will eventually require replacement (10 years?) Single-threaded – could hang if sshfs fails (workaround proposed) 00Z date cycling – can make inter-suite triggering delicate if 18Z model runs are delayed (alert emails now warn us if this happens) Awkward to view previous days' output files (stored on disk but not viewable in GUI) Rarely used by CAWCR developers → disjointed transition to operations Can't handle multi-billion year astrophysical model runs!

Future options? Could conceivably continue using SMS for many years… Migrate to ecFlow (Python API, improved error handling, migration tools) Hybrid approach – use Cylc for ACCESS VAR/UM + SMS for rest Convert all operational SMS suites to Cylc? (Tools, resourcing?)

For more info on SMS in BNOC refer to http://nmoc-svnop:8080/projects/sms/wiki/SMSlinuxsge http://wiki.bom.gov.au/foswiki/NMOC/OperationsDevelopment/SmsBasics Thank you… Jim Fraser, BNOC Ph: (03) 9669 4039 J.Fraser@bom.gov.au