Parrot and ATLAS Connect

Slides:



Advertisements
Similar presentations
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Advertisements

Testing PanDA at ORNL Danila Oleynik University of Texas at Arlington / JINR PanDA UTA 3-4 of September 2013.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
CVMFS AT TIER2S Sarah Williams Indiana University.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Configuration Management with Cobbler and Puppet Kashif Mohammad University of Oxford.
Remote Cluster Connect Factories David Lesny University of Illinois.
UMD TIER-3 EXPERIENCES Malina Kirn October 23, 2008 UMD T3 experiences 1.
ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
National Energy Research Scientific Computing Center (NERSC) CHOS - CHROOT OS Shane Canon NERSC Center Division, LBNL SC 2004 November 2004.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
CernVM-FS Infrastructure for EGI VOs Catalin Condurache - STFC RAL Tier1 EGI Webinar, 5 September 2013.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Debugging Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
ARC-CE: updates and plans Oxana Smirnova, NeIC/Lund University 1 July 2014 Grid 2014, Dubna using input from: D. Cameron, A. Filipčič, J. Kerr Nilsen,
New Features of Xrootd SE Wei Yang US ATLAS Tier 2/Tier 3 meeting, University of Texas, Arlington,
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
Towards Dynamic Database Deployment LCG 3D Meeting November 24, 2005 CERN, Geneva, Switzerland Alexandre Vaniachine (ANL)
Parag Mhashilkar (Fermi National Accelerator Laboratory)
CVMFS Alessandro De Salvo Outline  CVMFS architecture  CVMFS usage in the.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Virtual machines ALICE 2 Experience and use cases Services at CERN Worker nodes at sites – CNAF – GSI Site services (VoBoxes)
Advanced Computing Facility Introduction
Jean-Philippe Baud, IT-GD, CERN November 2007
CernVM-FS vs Dataset Sharing
Dynamic Extension of the INFN Tier-1 on external resources
Extending the farm to external sites: the INFN Tier-1 experience
WLCG IPv6 deployment strategy
Status: ATLAS Grid Computing
Open OnDemand: Open Source General Purpose HPC Portal
The EDG Testbed Deployment Details
Doug Benjamin Duke University On Behalf of the Atlas Collaboration
Deploying Galaxy in a secure environment to analyse sensitive data
Blueprint of Persistent Infrastructure as a Service
Dag Toppe Larsen UiB/CERN CERN,
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Belle II Physics Analysis Center at TIFR
Dag Toppe Larsen UiB/CERN CERN,
ATLAS Cloud Operations
Heterogeneous Computation Team HybriLIT
PanDA setup at ORNL Sergey Panitkin, Alexei Klimentov BNL
HPC DOE sites, Harvester Deployment & Operation
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Introduction to CVMFS A way to distribute HEP software on cloud
How to enable computing
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
Panda-based Software Installation
Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.
R.Mashinistov (UTA) July
The ATLAS software in the Grid Alessandro De Salvo <Alessandro
Southwest Tier 2.
WLCG Collaboration Workshop;
Integration of Singularity With Makeflow
OSG Rob Gardner • University of Chicago
Data Security for Microsoft Azure
CCR Advanced Seminar: Running CPLEX Computations on the ISE Cluster
Advanced Computing Facility Introduction
Module 01 ETICS Overview ETICS Online Tutorials
Any Data, Anytime, Anywhere
Backfilling the Grid with Containerized BOINC in the ATLAS computing
This work is supported by projects Research infrastructure CERN (CERN-CZ, LM ) and OP RDE CERN Computing (CZ /0.0/0.0/1 6013/ ) from.
Presentation transcript:

Parrot and ATLAS Connect Rob Gardner Dave Lesny

ATLAS Connect A Condor and Panda-based batch service to easily connect resources Connect to ATLAS Compliant resources like a Tier2 Connect to opportunistic resources such as campus clusters Stampede cluster at the Texas Advance Computing Center Midway cluster at University of Chicago Illinois Campus Cluster at UIUC/NCSA Each is RHEL6 or equivalent with either SLURM or PBS as local scheduler

Accessing Stampede Use simple Condor submit using BLAHP protocol (ssh login to stampede local submit host) (factory based on http://bosco.opensciencegrid.org) Test for prerequisites APF uses same mechanism PanDA queues – operated from MWT2 APF for pilot submission CONNECT: production queue ANALY_CONNECT: analysis queue MWT2 storage for DDM endpoints Frontier squid service

Challenges Additional system libraries (“ATLAS compatibility libraries”) as packaged in HEP_Oslibs_SL6 Access to CVMFS clients and cache Environment variables normally setup by an OSG CE, needed by the pilot $OSG_APP, $OSG_GRID, $VO_ATLAS_SW_DIR Approach was to provide via the user job wrapper these components

Approaches Linux Image with all libraries built using fake[ch]root Deploy this image locally via tarball or via a CVMFS repo Use the CERN VM3 image in /cvmfs/cernvm-prod.cern.ch Use Parrot to provide access to CVMFS repositories Use Parrot “–mount” to map file references into the Image /usr/lib64  /cvmfs/cernvm-prod.cern.ch/cvm3/usr/lib64 Install a Certificate Authority and OSG WN Client Emulate the CE by defining env vars Some defined in APF ($VO_ATLAS_SW_DIR, $OSG_SITE_NAME) Others defined in “wrapper” ($OSG_APP, $OSG_GRID)

Problems (1) Symlinks cannot be followed between repositories Not possible with Parrot due to restrictions with libcvmfs /cvmfs/osg.mwt2.org/atlas/sw  /cvmfs/atlas.cern.ch/repo/sw In general, we find cross-referencing CVMFS repos unreliable A python script located in atlas.cern.ch needs a lib.so If lib.so resides in another repo, might get “File not found” Solution was to use a local disk for the Linux Image Solution: Download a tarball and installed locally on disk Also install local OSG worker-node client and CA in same location

Problems (2): Parrot stability Parrot is very sensitive to the kernel version When used on kernels 2.x, many atlas programs hang Parrot uses ptrace and clones the system call Bug in ptrace in some kernels cause a timing problem Program being traced is awakened with “sigcont” before it should Result is that the program stays in “T” state forever Kernels known to have issues with Parrot ICC 2.6.32-358.23.2.el6.x86_64 Stampede 2.6.32-358.18.1.el6.x86_64 Midway 2.6.32-431.11.2.el6.x86_64 Custom kernel at MWT2 which seems to work is “3.2.13-UL3.el6”

Towards a solution: Parrot 4.1.4rc5 To work around the hangs, CCTools team provided a feature --cvmfs-enable-thread-clone-bugfix Stops many (not all) hangs with a huge performance penalty Simple ARLB with an asetup of a release take 10x to 100x longer Needed on 2.x kernels to avoid many of the hangs Programs which tend to run on 2.x without “bugfix” are Atlas Local Root Base setup (and diagnostics db-readReal and db-fnget) Reconstruction Panda Pilots Validation jobs Programs which tend to hang Sherpa (always) Release 16.x jobs Some HammerCloud tests (16.x always, 17.x sometimes)

Alternatives to Parrot? The CCTools team will be working on Parrot to fix bugs May need to use kernel 3.x on target site for reliability Three solutions we are pursuing: Parrot with Chirp (avoid libcvmfs) NFS mounting of local CVMFS (requires admin) Use Environment Modules, common on HPC facilities Treat CVMFS client as a user application Jobs “module load cmvfs-client” Prefix has privileges – can load needed FUSE modules Cache re-use my multi-core job slots Might be more palatable to HPC admins

Conclusions Good experience accessing opportunistic resources without WLCG or ATLAS services A general problem for campus clusters Would greatly help if we: Relied on only one CVMFS repo + stock SL6 (like CMS) Will continue pursuing the three alternatives Hope we can learn from others here!