1
2 Introduction SUMS (STAR Unified Meta Scheduler) overview –Usage Architecture Deprecated Configuration Current Configuration –Configuration via Information services Future Configuration Topics:
3 Quick Overview of SUMS The first version was developed in 2002, the STAR physics community has been using it for the past four years. Benefits: –Resource management, and knowledge of complex resources is taken off the users hands. –Administrator has tighter control over jobs Used for both user analyses and production (see next slide for usage ) Developers : Jerome Lauret and Levente Hajdu – Architect, coding, administration of SUM at BNL Lidia Didenko – Testing for grid readiness David Alexander, Paul Hamill, Chuang Li (Tech-X corp) - Private organization developing third party modules for SUMS in (nuclear physics) Eric Hjort – File transfer solutions (SRM integration) Iwona Sakrejda, Doug Olson – administration of SUMS at PDSF Efstratios Efstathiadis – Queue monitoring, research Valeri Fine – Grid testing Andrey Y. Shevel - administration of SUM at Stony Brook University and development of a PBS module Elisabeth Atems - administration of SUM at Wayne State University Michael DePhillips – statistics monitoring / Data base administration Wayne Betts – Test bed administration and deployment
4
5 Holidays
6 Architecture Overview Dispatchers and Policies Format of the configuration file Configuration of the policy Configuration of the Dispatcher –Nuances Configuration of the Queue
7 An overview of the configuration The configuration continues to evolve over time. The original format of the configuration is SUN JAVA object serialized XML as implemented by java.beans.XMLDecoder in the JAVA JDK. –For more information see: –The benefits include Automated parsing Easy to edit by hand The hierarchical structure is easily understood (XML) Ability to reference configuration blocks – example if five policies use the same queue, it is only declared once IDFER=“BNL_LSF_LongQueue” Ability to make function calls (powerful initialization tool) No need for data base engine
8 Configuration of the Policy What parameters are needed ? A list of Queues (sometimes with weights) The base algorithm to use A name which the user can call to invoke the policy Configuration for monitoring plug-ins (optional)
9 Configuration of the Policy What does it really look like ?
10 Configuration of the Dispatcher The base class for the given submission method –LSF, CONDOR, SGE, … Timing information –delay between submissions, timeout time, number of retries Gatekeeper names (for grid submission) Script generator –Program location table. Site specific nuances.
11 Configuration of the Dispatcher
12 Configuration of the Dispatcher Site specific nuances: –Some Examples: Submitting via the condor-batch system some sites require additional keywords such as +Experiment=“star” else the job is held indefinitely. At PDSF it is necessary to use the “module load [name]” command before being able to access certain software packages such as Java or Globus, otherwise the user gets “program not found” errors.
13 Configuration of the Queue The queue objects are virtual entities representing a subset of nodes examples: A condor Pool, A subset of a batch system queue where memory > 256MB What parameters are needed ? –Queue weight Policies use Queue weights for decision making, Dynamic (Monitoring) policies derive there own weights, Static policies have user configured weights. –Will the job “fit” ? Time limits (cpu, wall) Memory Scratch Space
14 Configuration of the Queue Typical Configuration:
15
16 Note: That the configuration file for site “A” is different then for “B”.
17 *This approach works for a small number of sites however it does not scale well, because every configuration is different and when a new site is added all configurations need to be updated.
18 In order to reduce redundancy the following steps where taken: Merge all files. –All sites are merged in the same file. –There is a higher “ site block” to encapsulate (delimitate) all sites in the configuration. Normalize of the configuration (removal of duplicates) –The duplication of queues inside policies was a major source of redundancy as a result queues where pulled out and only referenced in the policy. –Batch system information was pulled out of the queue blocks to a high level that encapsulates the queue blocks. This level is referred to as the batch system. –Gatekeeper information was pulled out of the dispatchers and put into the batch system block. –The dispatcher blocks where moved into the batch system block. There is only prevision for two dispatchers per batch system block. One for submission to the batch block by local users One for submission to the batch block by remote users Demand Drives the Need to Evolve
19
20 Benefits Policies are reusable, one policy can be used by multiple sites. Changes are easily implemented and distributed. Redundancy is cut down.
21 How it works. After a queue is assigned to a job by a policy, it has to be determined if the local or grid dispatcher should be used. This is done by recovering the users domain name. Multiple methods are used to try and recover the domain name the most common is “/bin./domainname”. This is compared with the domain of the site on which the queue resides (from config file). If they are the same the local dispatcher is used. If they are different the grid dispatcher is used.
22 How it works.
23 The Job Script Building user sand boxes and data recovery
24 Adding New Sites 1.When SUMS is initialized it tries to find its site in the configuration. 2.If the site is there, SUMS will ask the administrator a minimal number of questions to configure the site. 3.SUMS will write the site information in a special location. –The administrator can decide if they what to feed this information back to the master configuration.
25 Getting to the Dynamic Part
26 Fragment of Program Locations Table from nersc.gov String, String pairs String, Method call returning a string
27 Information service cycle 1.A configuration with some parts obtained via configuration and some parts obtained via an information service. 2.The information service absorbs (makes available) parameters previously statically configured. 3.Site N requires a new configuration parameter, deemed necessary in order to submit to the site. The parameter is statically added to the configuration.
28 Improvements and Plans for Future Development Where do go from here ?
29 Accuracy Counts
30
31 Conclusions SUMS produces jobs that take best advantage of resources on any given site. SUMS provides seamless GRID and local integration. We have provided an easy to use method for adding new sites. We are moving to dynamic recovery of configuration parameters. We want to be able to provide one install for all sites in many different communities.
32 The End Questions ?