Remote Cluster Connect Factories David Lesny University of Illinois.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Community Grids Lab1 CICC Project Meeting VOTable Developed VotableToSpreadsheet Service which accepts VOTable file location as an input, converts to Excel.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Efi.uchicago.edu ci.uchicago.edu Connecting Campus Infrastructures with HTCondor Services Lincoln Bryant Computation and Enrico Fermi Institutes University.
Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
1 October 2013 APF Summary Oct 2013 John Hover John Hover.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Campus High Throughput Computing (HTC) Infrastructures (aka Campus Grids) Dan Fraser OSG Production Coordinator Campus Grids Lead.
Efi.uchicago.edu ci.uchicago.edu ATLAS Connect Status Rob Gardner Computation and Enrico Fermi Institutes University of Chicago US ATLAS Computing Facility.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Skeleton Key: Sharing Data Across Campus Infrastructures Suchandra Thapa Computation Institute / University of Chicago.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Minerva Infrastructure Meeting – October 04, 2011.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Connect.usatlas.org ci.uchicago.edu ATLAS Connect Technicals & Usability David Champion Computation Institute & Enrico Fermi Institute University of Chicago.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
Site Lightning Report: MWT2 Mark Neubauer University of Illinois at Urbana-Champaign US ATLAS Facilities UC Santa Cruz Nov 14, 2012.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
05/29/2002Flavia Donno, INFN-Pisa1 Packaging and distribution issues Flavia Donno, INFN-Pisa EDG/WP8 EDT/WP4 joint meeting, 29 May 2002.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Portal Update Plan Ashok Adiga (512)
Pilot Factory using Schedd Glidein Barnett Chiu BNL
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
1 Worker Node Requirements TCO – biggest bang for the buck –Efficiency per $ important (ie cost per unit of work) –Processor speed (faster is not necessarily.
LSF Universus By Robert Stober Systems Engineer Platform Computing, Inc.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
Accelerating Campus Research with Connective Services for Cyberinfrastructure Rob Gardner Steve Tuecke.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
HTCondor-CE for USATLAS Bob Ball AGLT2/University of Michigan OSG AHM March, 2015 Bob Ball AGLT2/University of Michigan OSG AHM March, 2015.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
FIFE Architecture Figures for V1.2 of document. Servers Desktops and Laptops Desktops and Laptops Off-Site Computing Off-Site Computing Interactive ComputingSoftware.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Parag Mhashilkar (Fermi National Accelerator Laboratory)
Out of the basement, into the cloud: Extending the Tier 3 Lincoln Bryant Computation and Enrico Fermi Institutes.
Parrot and ATLAS Connect
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
ATLAS Cloud Operations
Condor Glidein: Condor Daemons On-The-Fly
The Condor JobRouter.
Condor: Firewall Mirroring
Presentation transcript:

Remote Cluster Connect Factories David Lesny University of Illinois

Remote Cluster Connect Factories rccf.usatlas.org Central component of ATLAS Connect –User (user login) –Cluster (remote T3 flocking) –Panda (pilots) Provides Tier3 queue-like access to beyond-pledge, and off-grid resources Can connect off-grid clusters to Panda 2/19/142

efi.uchicago.edu ci.uchicago.edu 3 MWT2, AGLT2, SWT2, CSU Fresno Amazon cloud TACC Stampede connect.usatlas.org portal rccf.usatlas.org FAXbox UC3 UC RCC NEW A 10k test 2/5/14 Tier1 login.usatlas.org SOON NEW RCC Factories for direct login service

RCC Factories for Tier 3 flocking (implemented)  RCC Factories for Panda queues (up next)  2/19/144

Remote Cluster Connect Nearly all the pieces have already been created Bosco is the job switch fabric used on the RCC ATLAS Connect login service, Faxbox, for user job and Tier 3 flocking submissions Parrot/CVMFS (CCTools) CVMFS server at MWT2 Existing Frontier Squids 2/19/145

Remote Cluster Connect Remote Cluster Connect (RCC) provides a mechanism by which an authenticated remote user either via the ATLAS Connect login service or a HTCondor flocked Tier3, can access computing resources at Tier2s (MWT2, AGLT2, SWT2), Tier1 (BNL), Tier3s, or campus clusters (Midway, Stampede, ICC, etc.). RCC uses multiple instances of “Bosco”, called factories. Each factory is installed on a unique port and runs under a unique user account. The factory accepts jobs from one or more HTCondor sources, such as the ATLAS Connect login host, and injects them into one or more target clusters. Jobs are submitted using the targets batch scheduler such as HTCondor, PBS, SGE, LSF and SLURM (with PBS emulation). Requirements on the target are an account under which the jobs are run, SSH access to a submit host with password less key for injecting the jobs, and outgoing firewall access from the compute nodes to the RCC. 2/19/146

RCC – Multiple Single User Bosco Instances Remote Cluster Connect Factory (RCCF) Single User Bosco Instance running as an RCC User on a unique SHARED_PORT Each RCCF is a separate Condor pool with a SCHEDD/Collector/Negotiator The RCCF injects glideins via SSH into a Target Scheduler The glidein creates a virtual job slot from Target Scheduler to the RCCF Any jobs which are in that RCCF then run in that virtual job slot Jobs are submitted to the RCCF by flocking from a Source SCHEDD The RCCF can inject glideins to multiple Target Scheduler hosts The RCCF can accept flocked jobs from multiple Source SCHEDD hosts Must have open bidirectional access to at least one port on Target Scheduler Firewalls can create problems – SHARED_PORT makes it easier (single port) Scale testing to a level of 5k jobs so far

Source is always HTCondor User submitting jobs always uses HTCondor submit files regardless of the target User does not have to know what scheduler is used at the target Universe = Vanilla Requirements = ( IS_RCC ) && ((Arch == "X86_64") || (Arch == "INTEL")) +ProjectName = ”atlas-org-illinois” Executable = gensherpa.sh Should_Transfer_Files = IF_Needed When_To_Transfer_Output = ON_Exit Transfer_Output = True Transfer_Input_Files = _sherpa_input.tar.gz, MC Sherpa_CT10_llll_ZZ.py Transfer_Output_Files = EVNT.pool.root Transfer_Output_Remaps = "EVNT.pool.root=output/EVNT_$(Cluster)_$(Process).pool.root” Arguments = "$(Process) 750” Log = logs/$(Cluster)_$(Process).log Output = logs/$(Cluster)_$(Process).out Error = logs/$(Cluster)_$(Process).err Notification = Never Queue 100 2/19/148

AtlasTier1,2 vs. Campus Clusters Tier1,2 targets are known and defined CVMFS is installed and working Atlas repositories are configured and available Required Atlas RPMs are installed on all compute nodes Campus Clusters are typically not “ATLAS-ready” CVMFS most likely not installed No Atlas repositories and thus no atlas software Very unlikely every RPM installed We could “ask” that these pieces be added, but we prefer to be unobtrusive 2/19/149

Provide CVMFS via Parrot/CVMFS Parrot/CVMFS (CCTools) has the ability to get all these missing elements CCTools, job wrapper and environment variables in a single tarball Tarball uploaded and unpacked on target as part of virtual slot creation Package only used on sites without CVMFS (Campus Clusters) Totally transparent to the end user The wrapper executes the users job in the Parrot/CVMFS environment Atlas CVMFS repositories are available then available to the job With CVMFS we can also access the MWT2 CVMFS Server 2/19/1410

CVMFS Wrapper Script The CVMFS Wrapper Script is the glue that binds Defines Frontier Squids (site dependent list) for CVMFS Sets up access to MWT2 CVMFS repository Runs the users jobs in the Parrot/CVMFS environment One missing piece remains to run Atlas jobs – Compatibility Libraries 2/19/1411

Atlas Compatibility Libraries Atlas requires a vary large number of RPMS not normally installed on CC. These include compatibility libraries and the 32 bit libraries List of “required” RPMs are dependencies in the HEP_OSlibs_SL6 RPM We could “ask” the the CC to installed this RPM but maybe another way Provide all libraries via a CVMFS repository 2/19/1412

HEP_OSlibs_SL6 Dumped all dependencies listed in HEP_OSlibs_SL Fetch all RPMS from Scientific Linux server Many of these are not relocatable RPMs so used cpio to unpack rpm2cpio $RPM| cpio --quiet --extract --make-directories --unconditional Also added a few other RPMs not currently part of HEP_OSlibs This creates a structure which looks like drwxr-xr-x 2 ddl mwt Feb 17 22:58 bin drwxr-xr-x 15 ddl mwt Feb 17 22:58 etc. drwxr-xr-x 6 ddl mwt Feb 17 22:58 lib drwxr-xr-x 4 ddl mwt Feb 17 22:58 lib64 drwxr-xr-x 2 ddl mwt Feb 17 22:58 sbin drwxr-xr-x 9 ddl mwt Feb 17 22:57 usr drwxr-xr-x 4 ddl mwt Feb 17 22:57 var 2/19/1413

Deploy via MWT2 CVMFS Server Put this “dump” on the MWT2 CVMFS Server ~]# ll /cvmfs/osg.mwt2.org/atlas/sw/HEP_OSlibs_SL6/ total 28 drwxr-xr-x 2 root root 4096 Feb 18 11:38 bin drwxr-xr-x 15 root root 4096 Feb 18 11:38 etc. drwxr-xr-x 6 root root 4096 Feb 18 11:38 lib drwxr-xr-x 4 root root 4096 Feb 18 11:38 lib64 drwxr-xr-x 2 root root 4096 Feb 18 11:38 sbin drwxr-xr-x 9 root root 4096 Feb 18 11:38 usr drwxr-xr-x 4 root root 4096 Feb 18 11:38 var Link the users jobs to this libraries in the Parrot/CVMFS wrapper myPATH=/cvmfs/osg.mwt2.org/atlas/sw/HEP_OSlibs_SL6/ export LD_LIBRARY_PATH="$myPATH/usr/lib64:$myPATH/lib64:$myPATH/usr/lib:$myPATH/lib:$LD_LIBRARY_PATH" 2/19/1414

Success to Stampede and Midway Two test cases have successfully run on Midway and Stampede Clusters Peter Onyisi provided both a Reco_trf and Sherpa jobs to test Jobs can now run transparently on any cluster available which are MWT2 AGLT2 SWT2 FresnoState (Tier3) Midway Coming soon BNL (Tier1) Stampede (need change for 16 jobs in one submission) 2/19/1415