Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College.

Slides:



Advertisements
Similar presentations
Condor use in Department of Computing, Imperial College Stephen M c Gough, David McBride London e-Science Centre.
Advertisements

UCL HEP Computing Status HEPSYSMAN, RAL,
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
S.Chechelnitskiy / SFU Simon Fraser Running CE and SE in a XEN virtualized environment S.Chechelnitskiy Simon Fraser University CHEP 2007 September 6 th.
Birmingham site report Lawrie Lowe: System Manager Yves Coppens: SouthGrid support HEP System Managers’ Meeting, RAL, May 2007.
Beowulf Supercomputer System Lee, Jung won CS843.
Presented by: Yash Gurung, ICFAI UNIVERSITY.Sikkim BUILDING of 3 R'sCLUSTER PARALLEL COMPUTER.
6/2/20071 Grid Computing Sun Grid Engine (SGE) Manoj Katwal.
Academic and Research Technology (A&RT)
UK -Tomato Chromosome Four Sarah Butcher Bioinformatics Support Service Centre For Bioinformatics Imperial College London
IFIN-HH LHCB GRID Activities Eduard Pauna Radu Stoica.
High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Site Report US CMS T2 Workshop Samir Cury on behalf of T2_BR_UERJ Team.
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
Edinburgh Site Report 1 July 2004 Steve Thorn Particle Physics Experiments Group.
Computing/Tier 3 Status at Panjab S. Gautam, V. Bhatnagar India-CMS Meeting, Sept 27-28, 2007 Delhi University, Delhi Centre of Advanced Study in Physics,
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
December 8 & 9, 2005, Austin, TX SURA Cyberinfrastructure Workshop Series: Grid Technology: The Rough Guide Configuring Resources for the Grid Jerry Perez.
Gilbert Thomas Grid Computing & Sun Grid Engine “Basic Concepts”
Introduction to HPC resources for BCB 660 Nirav Merchant
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
:: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: ::::: :: GridKA School 2009 MPI on Grids 1 MPI On Grids September 3 rd, GridKA School 2009.
CCPR Workshop Introduction to the Cluster March 2, 2005.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre.
HEPiX/HEPNT TRIUMF,Vancouver 1 October 18, 2003 NIKHEF Site Report Paul Kuipers
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei
CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.
28 April 2003Imperial College1 Imperial College Site Report HEP Sysman meeting 28 April 2003.
Issues in (Financial) High Performance Computing John Darlington Director Imperial College Internet Centre Fast Financial Algorithms and Computing 4th.
21 st October 2002BaBar Computing – Stephen J. Gowdy 1 Of 25 BaBar Computing Stephen J. Gowdy BaBar Computing Coordinator SLAC 21 st October 2002 Second.
Laboratório de Instrumentação e Física Experimental de Partículas GRID Activities at LIP Jorge Gomes - (LIP Computer Centre)
SLAC Site Report Chuck Boeheim Assistant Director, SLAC Computing Services.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.
10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Quick Introduction to NorduGrid Oxana Smirnova 4 th Nordic LHC Workshop November 23, 2001, Stockholm.
Deploying a Network of GNU/Linux Clusters with Rocks / Arto Teräs Slide 1(18) Deploying a Network of GNU/Linux Clusters with Rocks Arto Teräs.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.
London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney.
Tier2 Centre in Prague Jiří Chudoba FZU AV ČR - Institute of Physics of the Academy of Sciences of the Czech Republic.
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
Enabling Grids for E-sciencE SGE J. Lopez, A. Simon, E. Freire, G. Borges, K. M. Sephton All Hands Meeting Dublin, Ireland 12 Dec 2007 Batch system support.
Cluster Software Overview
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
Gareth Smith RAL PPD RAL PPD Site Report. Gareth Smith RAL PPD RAL Particle Physics Department Overview About 90 staff (plus ~25 visitors) Desktops mainly.
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
CERN Computer Centre Tier SC4 Planning FZK October 20 th 2005 CERN.ch.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Building and managing production bioclusters Chris Dagdigian BIOSILICO Vol2, No. 5 September 2004 Ankur Dhanik.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
Ole’ Miss DOSAR Grid Michael D. Joy Institutional Analysis Center.
Evangelos Markatos and Charalampos Gkikas FORTH-ICS Athens, th Mar Institute of Computer Science - FORTH Christos.
INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Tier2 Centre in Prague Jiří Chudoba FZU AV ČR - Institute of Physics of the Academy of Sciences of the Czech Republic.
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
CFI 2004 UW A quick overview with lots of time for Q&A and exploration.
White Rose Grid Infrastructure Overview
LCG 3D Distributed Deployment of Databases
Computing Board Report CHIPP Plenary Meeting
Cluster Computers.
Presentation transcript:

Batch Scheduling at LeSC with Sun Grid Engine David McBride Systems Programmer London e-Science Centre Department of Computing, Imperial College

Overview ● End-user requirements ● Brief description of compute hardware ● Sun Grid Engine software deployment ● Tweaks to the default SGE configuration ● Future changes ● References for more information and questions.

End-User Requirements ● We have many different users: high-energy physicists, bioinfomaticians, chemists, parallel software researchers. ● Jobs are many and varied: – Some users run relatively few long running tasks, others submit large clusters of shorter jobs. – Some require several cluster nodes to be co-allocated at runtime (16, 32+ MPI hosts), others simply use a single machine. – Some require lots of RAM.. (1, 2, 4, 8GB+ per machine) ● In general users are fairly happy so long as they get a reasonable response time.

Hardware ● Saturn: 24-way 750Mhz UltraSparcIII Sun E6800 – 36GB RAM, ~20TB online RAID storage, – 24TB tape library to support long-term offline backups. – Running Solaris 8 ● Viking cluster: 260 Node dual P4 Xeon 2Ghz+ – 128 machines with Fast Ethernet; 2x64 machines also with Myrinet – 2 front-end nodes & 2 development nodes. – Running RedHat Linux 7.2 (plus local additions and updates) ● Mars cluster: 204 Node dual AMD Opteron 1.8Ghz+ – 128 machines with Gigabit Ethernet; 72 machines also with Infiniband. – Running RedHat Enterprise Linux 3 (plus local refinements) – 4 front-end interactive nodes.

Sun Grid Engine Deployment ● Two separate logical SGE installations – Saturn acts as the master node for both cells. – However, Viking is running SGE 5.3 and Mars is running SGE 6.0. ● Mars is still ‘in beta’; Viking is still providing the main production service. ● When Mars’s configuration is finalized, end-users will be migrated to Mars – Viking will then be reinstalled with the new configuration.

Changes to Default Configuration ● Issue 1: – If all the available worker nodes are running long-lived jobs, then a new short-lived job added to the queue will not execute until one of the long-lived jobs has completed. (SGE does not provide a job checkpoint-and-preempt facility.) – Resolution: A subset of nodes are configured to only run short- lived jobs. – Trades slightly reduced cluster utilization for shorter average-case response time for short-lived jobs. – End users only benefit if they specify the job will finish quickly at submission-time.

Changes to Default Configuration ● Issue 2: – Clusters are internally heterogenous; eg some have more memory, faster processors, bigger local disks than others. – Sometimes a low-requirement job will be allocated to one of these more capable machines unnecessarily because the submitter has not specified the job’s requirements. – This can prevent a job which does have high requirements from being run as quickly. – Experiment with changing the SGE configuration so that a job will, by default, only require the resources of the least-capable node. – Again, places onus on user to request extra resources if needed.

Changes to Default Configuration ● Issue 3: – If a job is submitted that requires the co-allocation of several cluster nodes simultaneously (eg for a 16-way MPI job) then that job can be starved by a larger number of single-node jobs. – Resolution: Manually intervene to manipulate queues so that the large 16-way job will be scheduled. (SGE 5.3) – Resolution: Upgrade to SGE 6 which uses a more advanced scheduling algorithm (advance reservation with backfill.)

Future Changes: LCG ● We are participating in the Large Hadron Collider Compute Grid as part of the London Tier-2. ● This has been non-trivial; the standard LCG distribution only supports PBS-based clusters. – We’ve developed SGE-specific Globus JobManager and Information Reporter components for use with LCG. – We have also been working with the developers to address issues with running on 64bit Linux distributions. ● Currently deploying front-end nodes (CE, SE, etc.) to expose Mars as an LCG compute site. ● We are also joining the LCG Certification Testbed to provide a SGE-based test site to help ensure future support.

References ● London e-Science Centre homepage: – ● SGE intergration tools for Globus Toolkit 2, 3, 4 and LCG: –

Q&A