1. 2 Introduction SUMS (STAR Unified Meta Scheduler) overview –Usage Architecture Deprecated Configuration Current Configuration –Configuration via Information.

Slides:

Advertisements

Similar presentations

CSF4 Meta-Scheduler Tutorial 1st PRAGMA Institute Zhaohui Ding or

Advertisements

4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

CSF4, SGE and Gfarm Integration Zhaohui Ding Jilin University.

1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.

Resource Manager for Grid with global job queue and with planning based on local schedules V.N.Kovalenko, E.I.Kovalenko, D.A.Koryagin, E.Z.Ljubimskii,

Grid Load Balancing Scheduling Algorithm Based on Statistics Thinking The 9th International Conference for Young Computer Scientists Bin Lu, Hongbin Zhang.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.

IBM Proof of Technology Discovering the Value of SOA with WebSphere Process Integration © 2005 IBM Corporation SOA on your terms and our expertise WebSphere.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Christopher Jeffers August 2012

I. Pribela, M. Ivanović Neum, Content Automated assessment Testovid system Test generator Module generators Conclusion.

KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.

SUMS ( STAR Unified Meta Scheduler ) SUMS is a highly modular meta-scheduler currently in use by STAR at there large data processing sites (ex. RCF /

11 MANAGING AND DISTRIBUTING SOFTWARE BY USING GROUP POLICY Chapter 5.

Stephen Booth EPCC Stephen Booth GridSafe Overview.

Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.

INFSO-RI Module 01 ETICS Overview Alberto Di Meglio.

Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.

Overview Why are STAR members encouraged to use SUMS ? Improvements and additions to SUMS Research –Job scheduling with load monitoring tools –Request.

GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.

Aug 13 th 2003Scheduler Tutorial1 STAR Scheduler – A tutorial Lee Barnby – Kent State University Introduction What is the scheduler and what are the advantages?

Jean-Sébastien Gay LIP ENS Lyon, Université Claude Bernard Lyon 1 INRIA Rhône-Alpes GRAAL Research Team Join work with DIET TEAM D istributed I nteractive.

INFSO-RI Module 01 ETICS Overview Etics Online Tutorial Marian ŻUREK Baltic Grid II Summer School Vilnius, 2-3 July 2009.

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

Supporting further and higher education The Akenti Authorisation System Alan Robiette, JISC Development Group.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -

Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.

Overview of Privilege Project at Fermilab (compilation of multiple talks and documents written by various authors) Tanya Levshina.

February 28, 2003Eric Hjort PDSF Status and Overview Eric Hjort, LBNL STAR Collaboration Meeting February 28, 2003.

SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.

Pilot Factory using Schedd Glidein Barnett Chiu BNL

MySQL and GRID status Gabriele Carcassi 9 September 2002.

Adrian Jackson, Stephen Booth EPCC Resource Usage Monitoring and Accounting.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

STAR Scheduling status Gabriele Carcassi 9 September 2002.

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

STAR Scheduler Gabriele Carcassi STAR Collaboration.

Automating Installations by Using the Microsoft Windows 2000 Setup Manager Create setup scripts simply and easily. Create and modify answer files and UDFs.

CSF. © Platform Computing Inc CSF – Community Scheduler Framework Not a Platform product Contributed enhancement to The Globus Toolkit Standards.

Active-HDL Server Farm Course 11. All materials updated on: September 30, 2004 Outline 1.Introduction 2.Advantages 3.Requirements 4.Installation 5.Architecture.

CSF4 Meta-Scheduler Zhaohui Ding College of Computer Science & Technology Jilin University.

Five todos when moving an application to distributed HTC.

Node.js Modules Header Mastering Node.js, Part 2 Eric W. Greene

OpenPBS – Distributed Workload Management System

U.S. ATLAS Grid Production Experience

LOCO Extract – Transform - Load

GWE Core Grid Wizard Enterprise (

Building Grids with Condor

ARCH-1: Application Architecture made Simple

An innovative campus grid prototype

Module 01 ETICS Overview ETICS Online Tutorials

Sun Grid Engine.

Presentation transcript:

1

2 Introduction SUMS (STAR Unified Meta Scheduler) overview –Usage Architecture Deprecated Configuration Current Configuration –Configuration via Information services Future Configuration Topics:

3 Quick Overview of SUMS The first version was developed in 2002, the STAR physics community has been using it for the past four years. Benefits: –Resource management, and knowledge of complex resources is taken off the users hands. –Administrator has tighter control over jobs Used for both user analyses and production (see next slide for usage  ) Developers : Jerome Lauret and Levente Hajdu – Architect, coding, administration of SUM at BNL Lidia Didenko – Testing for grid readiness David Alexander, Paul Hamill, Chuang Li (Tech-X corp) - Private organization developing third party modules for SUMS in (nuclear physics) Eric Hjort – File transfer solutions (SRM integration) Iwona Sakrejda, Doug Olson – administration of SUMS at PDSF Efstratios Efstathiadis – Queue monitoring, research Valeri Fine – Grid testing Andrey Y. Shevel - administration of SUM at Stony Brook University and development of a PBS module Elisabeth Atems - administration of SUM at Wayne State University Michael DePhillips – statistics monitoring / Data base administration Wayne Betts – Test bed administration and deployment

4

5 Holidays

6 Architecture Overview Dispatchers and Policies Format of the configuration file Configuration of the policy Configuration of the Dispatcher –Nuances Configuration of the Queue

7 An overview of the configuration The configuration continues to evolve over time. The original format of the configuration is SUN JAVA object serialized XML as implemented by java.beans.XMLDecoder in the JAVA JDK. –For more information see: –The benefits include Automated parsing Easy to edit by hand The hierarchical structure is easily understood (XML) Ability to reference configuration blocks – example if five policies use the same queue, it is only declared once IDFER=“BNL_LSF_LongQueue” Ability to make function calls (powerful initialization tool) No need for data base engine

8 Configuration of the Policy What parameters are needed ? A list of Queues (sometimes with weights) The base algorithm to use A name which the user can call to invoke the policy Configuration for monitoring plug-ins (optional)

9 Configuration of the Policy What does it really look like ?

10 Configuration of the Dispatcher The base class for the given submission method –LSF, CONDOR, SGE, … Timing information –delay between submissions, timeout time, number of retries Gatekeeper names (for grid submission) Script generator –Program location table. Site specific nuances.

11 Configuration of the Dispatcher

12 Configuration of the Dispatcher Site specific nuances: –Some Examples: Submitting via the condor-batch system some sites require additional keywords such as +Experiment=“star” else the job is held indefinitely. At PDSF it is necessary to use the “module load [name]” command before being able to access certain software packages such as Java or Globus, otherwise the user gets “program not found” errors.

13 Configuration of the Queue The queue objects are virtual entities representing a subset of nodes examples: A condor Pool, A subset of a batch system queue where memory > 256MB What parameters are needed ? –Queue weight Policies use Queue weights for decision making, Dynamic (Monitoring) policies derive there own weights, Static policies have user configured weights. –Will the job “fit” ? Time limits (cpu, wall) Memory Scratch Space

14 Configuration of the Queue Typical Configuration:

15

16 Note: That the configuration file for site “A” is different then for “B”.

17 *This approach works for a small number of sites however it does not scale well, because every configuration is different and when a new site is added all configurations need to be updated.

18 In order to reduce redundancy the following steps where taken: Merge all files. –All sites are merged in the same file. –There is a higher “ site block” to encapsulate (delimitate) all sites in the configuration. Normalize of the configuration (removal of duplicates) –The duplication of queues inside policies was a major source of redundancy as a result queues where pulled out and only referenced in the policy. –Batch system information was pulled out of the queue blocks to a high level that encapsulates the queue blocks. This level is referred to as the batch system. –Gatekeeper information was pulled out of the dispatchers and put into the batch system block. –The dispatcher blocks where moved into the batch system block. There is only prevision for two dispatchers per batch system block. One for submission to the batch block by local users One for submission to the batch block by remote users Demand Drives the Need to Evolve

19

20 Benefits Policies are reusable, one policy can be used by multiple sites. Changes are easily implemented and distributed. Redundancy is cut down.

21 How it works. After a queue is assigned to a job by a policy, it has to be determined if the local or grid dispatcher should be used. This is done by recovering the users domain name. Multiple methods are used to try and recover the domain name the most common is “/bin./domainname”. This is compared with the domain of the site on which the queue resides (from config file). If they are the same the local dispatcher is used. If they are different the grid dispatcher is used.

22 How it works.

23 The Job Script Building user sand boxes and data recovery

24 Adding New Sites 1.When SUMS is initialized it tries to find its site in the configuration. 2.If the site is there, SUMS will ask the administrator a minimal number of questions to configure the site. 3.SUMS will write the site information in a special location. –The administrator can decide if they what to feed this information back to the master configuration.

25 Getting to the Dynamic Part

26 Fragment of Program Locations Table from nersc.gov String, String pairs String, Method call returning a string

27 Information service cycle 1.A configuration with some parts obtained via configuration and some parts obtained via an information service. 2.The information service absorbs (makes available) parameters previously statically configured. 3.Site N requires a new configuration parameter, deemed necessary in order to submit to the site. The parameter is statically added to the configuration.

28 Improvements and Plans for Future Development Where do go from here ?

29 Accuracy Counts

30

31 Conclusions SUMS produces jobs that take best advantage of resources on any given site. SUMS provides seamless GRID and local integration. We have provided an easy to use method for adding new sites. We are moving to dynamic recovery of configuration parameters. We want to be able to provide one install for all sites in many different communities.

32 The End Questions ?