Overview Why are STAR members encouraged to use SUMS ? Improvements and additions to SUMS Research –Job scheduling with load monitoring tools –Request.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.

Introduction CSCI 444/544 Operating Systems Fall 2008.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.

Computer Organization and Architecture

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.

Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.

Polish Infrastructure for Supporting Computational Science in the European Research Space Policy Driven Data Management in PL-Grid Virtual Organizations.

GRID job tracking and monitoring Dmitry Rogozin Laboratory of Particle Physics, JINR 07/08/ /09/2006.

Offline Performance Monitoring for Linux Abhishek Shukla.

KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.

OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.

Polish Infrastructure for Supporting Computational Science in the European Research Space QoS provisioning for data-oriented applications in PL-Grid D.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL July 15, 2003 LCG Analysis RTAG CERN.

SUMS ( STAR Unified Meta Scheduler ) SUMS is a highly modular meta-scheduler currently in use by STAR at there large data processing sites (ex. RCF /

Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

Stephen Booth EPCC Stephen Booth GridSafe Overview.

100 Million events, what does this mean ?? STAR Grid Program overview.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Combining the strengths of UMIST and The Victoria University of Manchester Utility Driven Adaptive Workflow Execution Kevin Lee School of Computer Science,

Scalable Web Server on Heterogeneous Cluster CHEN Ge.

Aug 13 th 2003Scheduler Tutorial1 STAR Scheduler – A tutorial Lee Barnby – Kent State University Introduction What is the scheduler and what are the advantages?

Jean-Sébastien Gay LIP ENS Lyon, Université Claude Bernard Lyon 1 INRIA Rhône-Alpes GRAAL Research Team Join work with DIET TEAM D istributed I nteractive.

David Adams ATLAS DIAL status David Adams BNL July 16, 2003 ATLAS GRID meeting CERN.

CHEP Sep Andrey PHENIX Job Submission/Monitoring in transition to the Grid Infrastructure Andrey Y. Shevel, Barbara Jacak,

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

CERN - IT Department CH-1211 Genève 23 Switzerland Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.

Metadata Mòrag Burgon-Lyon University of Glasgow.

David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.

February 28, 2003Eric Hjort PDSF Status and Overview Eric Hjort, LBNL STAR Collaboration Meeting February 28, 2003.

Proposal for a IS schema Massimo Sgaravatto INFN Padova.

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

Project18’s Communication Drawing Design By: Camilo A. Silva BIOinformatics Summer 2008.

ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart

Adrian Jackson, Stephen Booth EPCC Resource Usage Monitoring and Accounting.

K. Harrison CERN, 3rd March 2004 GANGA CONTRIBUTIONS TO ADA RELEASE IN MAY - Outline of Ganga project - Python support for AJDL - LCG analysis service.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

STAR Scheduling status Gabriele Carcassi 9 September 2002.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.

Production Mode Data-Replication Framework in STAR using the HRM Grid CHEP ’04 Congress Centre Interlaken, Switzerland 27 th September – 1 st October Eric.

David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.

CHEP ‘06Eric Hjort, Jérôme Lauret Data and Computation Grid Decoupling in STAR – An Analysis Scenario using SRM Technology Eric Hjort (LBNL) Jérôme Lauret,

STAR Scheduler Gabriele Carcassi STAR Collaboration.

10 March Andrey Grid Tools Working Prototype of Distributed Computing Infrastructure for Physics Analysis SUNY.

1. 2 Introduction SUMS (STAR Unified Meta Scheduler) overview –Usage Architecture Deprecated Configuration Current Configuration –Configuration via Information.

CSF. © Platform Computing Inc CSF – Community Scheduler Framework Not a Platform product Contributed enhancement to The Globus Toolkit Standards.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)

29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.

Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.

Introduction  Model contains different kinds of elements (such as hosts, databases, web servers, applications, etc)  Relations between these elements.

Processes and threads.

William Stallings Computer Organization and Architecture

Process Description and Control

Wide Area Workload Management Work Package DATAGRID project

Chapter 5: CPU Scheduling

Production Manager Tools (New Architecture)

Presentation transcript:

Overview Why are STAR members encouraged to use SUMS ? Improvements and additions to SUMS Research –Job scheduling with load monitoring tools –Request Definition Language (RDL) Who contributes to SUMS research and development ?

Intro SUMS is a scheduler written mainly by STAR with contributions from many other organizations. It’s used to run programs on datasets (large set of files) that may be distributed across different nodes, clusters, batch systems, and, sites. It is distributed free of charge under the GNU General Public License. SUMS unifies scientific computing requests into one general format. It is hoped that more experiments will join us and use our scheduler.

Why are STAR members encouraged to use SUMS ? Resolves Problem of over use of limited resources, to the point of failure. Resolves Problem of accessing files on distributed disks Breaks dependents on any one particular batch system Is helpful in file catalog resolution and optimizing jobs for less experienced users

Why are STAR members encouraged to use SUMS ? Resolves Problem of over use of limited resources, to the point of failure. Resolves Problem of accessing files on distributed disks Breaks dependents on any one particular batch system Is helpful in file catalog resolution and optimizing jobs for less experienced users

STARDATA24 QUEUE NODES JOBS STARDATA05 STARDATA02 RCF

STARDATA24 QUEUE NODES JOBS STARDATA05 STARDATA02 RCF

STARDATA24STARDATA02STARDATA05 SD24 virtual resource (102 units total) SD02 virtual resource (800 units total) SD05 virtual resource (2040 units total) Queued jobs Running jobs SD05 = 500 SD24 = 50 SD05 = 500 SD24 = 10 SD05 = 1000 SD24 = 2 SD05 = 30 SD24 = 3 SD05 = 30 SD02 = 600 SD05 = 800 SD24 = 50 SD05 = 700 SD02 = 450 SD05 = 750

Past reactions demonstrate that without proper explanation users believe this kind of strategy slows down there jobs. In reality the strategy (well implemented) greatly increases farm through-put. Jobs that need access to resources no one else is using run first (and fast). Jobs don’t need to compete for the same resource, slowing each other down, running for longer then they have to (slots open up faster). My jobs are not run sequentially they are run in the most optimal order.

Why are STAR members encouraged to use SUMS ? Resolves Problem of over use of limited resources, to the point of failure. Resolves Problem of accessing files on distributed disks Breaks dependents on any one particular batch system Is helpful in file catalog resolution and optimizing jobs for less experienced users

Why are STAR members encouraged to use SUMS ? Resolves Problem of over use of limited resources, to the point of failure. Resolves Problem of accessing files on distributed disks Breaks dependents on any one particular batch system Is helpful in file catalog resolution and optimizing jobs for less experienced users

Why are STAR members encouraged to use SUMS ? Resolves Problem of over use of limited resources, to the point of failure. Resolves Problem of accessing files on distributed disks Breaks dependents on any one particular batch system Is helpful in file catalog resolution and optimizing jobs for less experienced users

Improvements and additions to SUMS Resubmit\Status\kill switches minMemory, maxMemory, minStorageSpace, maxStorageSpace keywords SGE dispatcher $FILEBASENAME schema validation Extend report file uuidgen

Resubmit\status\kill switches star-resubmit -kr all DD9CFB586F4139E8D14C6.session.xml //Kill and resubmit all jobs star-resubmit -r 1,2,3,5 DDD9CFB586F4139E8D14C6.session.xml //Resubmit jobs 1,2,3, and 5 star-resubmit -s 5 DD9CFB586F4139E8D14C6.session.xml //Get the status of job 5

maxMemory and maxStorageSpace The “max” limits lets sums know not to submit your job in a queue where memory or storage space may be an issue Under LSF it also lets the queue know to kill the job if it exceeds the max limits. –(very handy if you suspect potential for “run away jobs”)

minMemory, minStorageSpace, The “min” limits let the batch system know how much space should be freed before you job can run Under LSF, if this limit is smaller then the default limit your job will most likely find a slot before other jobs seeking larger slots. Under LSF, if this limit is way to high your job may never find a slot (may never run)

The SGE dispatcher PDSF will be converting from the LSF batch system to the SGE batch system. At the request of PDFS a new dispatcher plug-in was written for the SGE batch system. During the transition period, as LSF nodes go down and SGE nodes come up, SUMS will determine which batch system local files are on and format submits in the appropriate way. Our new SGE dispatcher plug-in lets SUM dispatch to SGE just like it did previously with LSF. Users at PDSF will notice absolutely no difference in how they submit there jobs with SUMS.

Schema validation \ Extend report file Users request are compared to a JDL schema. If there is an error a validation package will give an error stating both the line number and a description of the error. Similar to what you would expect from the java compiler.

uuidgen Universally Unique Identifier Generator A UUID can reasonably be considered unique among all UUIDs created on the local system, and among UUIDs created on other systems in the past and in the future. Prevents jobs from colliding in the local environment and on the grid

Research In addition to providing a working scheduler for its users. SUMS is also used as a research tool mostly (but not exclusively) for grid research. Currently ongoing research projects include: 1.Job scheduling with load monitoring tools 2.Request Definition Language (RDL)

Job scheduling with load monitoring tools The aim of this research is to devise simple, and efficient algorithms taking into account many variables for determining the best location to submit a job at the current moment in time.

Passive Policy Monitoring Policy

Later test done during the week showed results not as good. Analysis shows fluctuating queue loads cause problems for this algorithm as it samples, and averages over 3 hours. Observation: It’s inefficient to make a computation as intensive as the actual jobs your trying to submit. Job scheduling with load monitoring tools

A different equation is needed taking into account the differential (rate of change) of load and projecting forward in time. More testing will be needed. This will prove to be a valuable way of determining “where to submit”. The future looks good. Job scheduling with load monitoring tools conclusions

RDL (Request Description Language) What if there existed a language so powerful that it could define not just a set of jobs but relationships between them not just NP or HEP but other fields as well. –Ex: Run these two jobs, take the one that’s done first, merge the output run another process on it and draw me up some graphs. RDL is a new XML based definition language being developed by STAR in conjunction with Tech-X corp. that will allow for this level of abstraction and much more. Allows for Web-Service submitting and control RDL incorporates reality of every day work from different experiments. –Ex. J-Lab

RDL V.S. JDL Submitting on a grid landscape Supports submit of multiple jobs Supports submit of multiple request Separates task and application Supports work flow XML format RDL JDL

Who contributes to SUMS research and development PPDG – funding Jerome Lauret and Levente Hajdu – coding and administration of SUM at BNL Lidia Didenko – Testing for grid readiness David Alexander and Paul Hamill (Tech-X corp) - RDL deployment and prototype client and web service Eric Hjort, Iwona Sakrejda, Doug Olson – administration of SUMS at PDSF Efstratios Efstathiadis – Queue monitoring, research Valeri Fine – Grid testing Andrey Y. Shevel - administration of SUM at Stony Brook University And Others