LBNE/Daya Bay utilization of Panda: project review and status report PAS Group Meeting November 12, 2010 Maxim Potekhin for BNL Physics Applications Software.

Slides:



Advertisements
Similar presentations
4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.
Advertisements

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
Grid Services at NERSC Shreyas Cholia Open Software and Programming Group, NERSC NERSC User Group Meeting September 17, 2007.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Assessment of Core Services provided to USLHC by OSG.
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
OSG Public Storage and iRODS
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
Slide 1 Experiences with NMI R2 Grids Software at Michigan Shawn McKee April 8, 2003 Internet2 Spring Meeting.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
Geospatial Technical Support Module 2 California Department of Water Resources Geospatial Technical Support Module 2 Architecture overview and Data Promotion.
Publication and Protection of Site Sensitive Information in Grids Shreyas Cholia NERSC Division, Lawrence Berkeley Lab Open Source Grid.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL
Brookhaven Analysis Facility Michael Ernst Brookhaven National Laboratory U.S. ATLAS Facility Meeting University of Chicago, Chicago 19 – 20 August, 2009.
Tarball server (for Condor installation) Site Headnode Worker Nodes Schedd glidein - special purpose Condor pool master DB Panda Server Pilot Factory -
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.
Introducing Virtualization via an OpenStack “Cloud” System to SUNY Orange Applied Technology Students SUNY Innovative Instruction Technology Grant Christopher.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
A PanDA Backend for the Ganga Analysis Interface J. Elmsheuser 1, D. Liko 2, T. Maeno 3, P. Nilsson 4, D.C. Vanderster 5, T. Wenaus 3, R. Walker 1 1: Ludwig-Maximilians-Universität.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
Pilot Factory using Schedd Glidein Barnett Chiu BNL
PHENIX and the data grid >400 collaborators 3 continents + Israel +Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.
Lightweight construction of rich scientific applications Daniel Harężlak(1), Marek Kasztelnik(1), Maciej Pawlik(1), Bartosz Wilk(1) and Marian Bubak(1,
OSG Area Coordinator’s Report: Workload Management May14 th, 2009 Maxim Potekhin BNL
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
National Energy Research Scientific Computing Center (NERSC) CHOS - CHROOT OS Shane Canon NERSC Center Division, LBNL SC 2004 November 2004.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
OSG Area Coordinator’s Report: Workload Management Maxim Potekhin BNL May 8 th, 2008.
OSG Area Coordinator’s Report: Workload Management March 25 th, 2010 Maxim Potekhin BNL
OSG Area Coordinator’s Report: Workload Management October 6 th, 2010 Maxim Potekhin BNL
T3g software services Outline of the T3g Components R. Yoshida (ANL)
The National Grid Service User Accounting System Katie Weeks Science and Technology Facilities Council.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
Western Tier 2 Site at SLAC Wei Yang US ATLAS Tier 2 Workshop Harvard University August 17-18, 2006.
OSG Area Coordinator’s Report: Workload Management February 9 th, 2011 Maxim Potekhin BNL
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
OSG Area Coordinator’s Report: Workload Management June 3 rd, 2010 Maxim Potekhin BNL
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.
Panda Monitoring, Job Information, Performance Collection Kaushik De (UT Arlington), Torre Wenaus (BNL) OSG All Hands Consortium Meeting March 3, 2008.
TeraGrid Software Integration: Area Overview (detailed in 2007 Annual Report Section 3) Lee Liming, JP Navarro TeraGrid Annual Project Review April, 2008.
Accessing the VI-SEEM infrastructure
Virtualization and Clouds ATLAS position
Artem Trunov and EKP team EPK – Uni Karlsruhe
LQCD Computing Operations
Cloud based Open Source Backup/Restore Tool
US CMS Testbed.
Presentation transcript:

LBNE/Daya Bay utilization of Panda: project review and status report PAS Group Meeting November 12, 2010 Maxim Potekhin for BNL Physics Applications Software Group Brookhaven National Laboratory

Overview Intro: Daya Bay and LBNE Motivations for PAS to support LBNE/Daya LBNE/Daya: Pre-PanDA mode of operation Pandification Current status of local Daya cluster managed by Panda Expansion to other sites Conclusion 2

Intro Both Daya Bay and LBNE are experiments studying neutrino oscillations, in different energy domains: Daya Bay is a short base experiment in China currently entering the period of data taking after years of construction. It utilizes two nuclear reactors as source of neutrinos. LBNE is a proposed complex Long Base Neutrino Experiment that will be utilizing FNAL neutrino beams with some of the detectors placed at DUSEL deep underground facility in South Dakota, the deepest of its kind. These include a 300 kiloton water Cherenkov detector. Daya Bay Personnel at BNL and a few other labs is also involved in the development of LBNE. Elements of software infrastructure are being inherited from Daya Bay and Ice Cube and will be utilized in LBNE. Maintenance, configuration and utilization of simulation software is one of primary responsibilities of the BNL group. Focus currently is on Monte Carlo simulation of large-scale Cherenkov detectors. 3

Motivations PAS has a broad mandate to support research conducted by the Physics Department. Standing of the group depends on how successful we are in doing that. In addition, BNL (primarily through PAS) is a stakeholder in Open Science Grid (OSG) and is benefiting from this collaboration. OSG aims to promote science in a wide range of disciplines by providing researchers with access to high- throughput computing via an open platform. PAS is de-facto owner of the PanDA product, and must leverage it co accomplish the above goals. LBNE/Daya Bay experiments are high-profile projects that are ideal area of application for PanDA. 4

Pre-PanDA mode of operation The BNL Daya Bay/LBNE group is using proven software components, including: Geant 4 Gaudi SPADE (Data movement mechanism developed by Ice Cube collaboration which includes automatic sinking of data into HPSS) The main platform for configuration and steering component of the simulation software is a collection of Python modules. The framework thus developed is called NuWa. Prior to PAS involvement, there was no Workload Management System or Monitoring facility with which to direct and coordinate workflow either locally at BNL or across sites (PDSF, IIT etc). Job submission methods depended on the local interfaces to batch systems at each site, e.g. qsub would be used at PDSF and condor_submit at BNL. 5

Pandification LBNE/Daya group operates a cluster at BNL of approximately 16 cores, which are included in a local Condor pool. This serves as a small-scale production and validation platform, and there is an intention to expand to facilities outside BNL. Requirements with regard to resources, while not well defined, have been qualitatively revised upward since the beginning of our collaboration, both in terms of sites and amount of simulations. After a series of meetings in 2009, the BNL Daya Bay/LBNE group realized the potential of using PanDA to manage the workflow of their simulation activity across the facilities. The advantages are: One point of entry to monitor workflow across nodes & sites Single point of access to log files and other diagnostics Easy and reliable versioning of task definition via hosting of PanDA transformation in a single point 6

Pandification, cont’d (1) Some hurdles to overcome were: No machine committed to the role of Daya/LBNE gatekeeper at BNL, making the standard Condor-G pilot submission from one of our established and managed hosts impossible Segregation of disk mounts between RCF and ATLAS parts of RACF facility, combined with paths being hardcoded in a few places in configuration software Paths hardcoded in certain Panda setup scripts No turn-key data movement mechanism in Panda (outside of ATLAS) and lack of ready staging area for LBNE data at BNL, plus functioning data sink to NERSC (the final storage point) still being established Introduction of validation step into the workflow before the job is declared a success On the wish-list: near-time access to log files – practical on Condor-C but impractical on Condor-G; odd behavior of some jobs needs to be debugged 7

Pandification, cont’d (2) Solutions: Set up Condor-C pilot submission on one of Daya machines Set up a Web server for serving log files: ◦Can’t properly install Apache on Daya cluster machines due to combination of technical and political reasons ◦Used al light-weight “nullhttp” server, then utilized an “official” OSG development server with Apache on it, that has BNL-approved conduits and proper disk mounts – the current solution Data movement mechanism managed in the job wrapper with data deposited into xrootd at BNL, with final movement and decision made at a later stage Additional issues: Need to look at evicted jobs that can’t be restarted due to increased image size – less of an issue lately but should look out of it, as it can clog the pilot queue 8

Current status of Daya cluster Pilot submission works Web server for serving log files – works Size of log files produced by Daya jobs, previously large, was reduced on our recommendation Pilot code was modified by Jose to purge files already staged out, as well as leftover Python code in the working directory, to conserve disk space Current “rolling” disk usage is about 1GB which is recognized as acceptable by our Daya collaborators We are standing ready to commence production for Daya managed by Panda, on BNL cluster, pending their finalizing of job configuration, as they reported in the meeting on 11/10/2010 9

Expansion to other sites Daya/LBNE collaboration has done preliminary setup at NERSC/PDSF. Points of interest: NERSC HPSS will be the final point of storage of all data produced everywhere (including BNL) Correct versions of ROOT, Python and other important components are provided by means of a “private” LBNE install w/o reliance on facility-wide installs SPADE data mover is used to scoop up data from “dropboxes” and marshal it into HPSS (good!) Management of groups and accounts at NERSC is different from what is now “mainstream” OSG auth/auth mechanisms – Virtual Organizations aren’t supported, instead users are assigned to Unix groups and provide their DN on NERSC Information System (NIM) as a way to authenticate with their plain Grid certificate (no VOMS/GUMS etc) 10

Expansion to other sites, cont’d PDSF: Access issues (due to this non-standard setup) have been resolved in consultation with NERSC and Condor-G submission has been tested – ready to commence pilot submission SPADE will be set up shortly In a few weeks, Daya/LBNE group expects a new cluster at Illinois Institute of Technology (IIT) to come on line. SPADE system will be deployed to transport data to NERSC HPSS (no special code needed in the wrapper or the pilot) We are in contact with personnel involved in the system setup We’ll need to secure installation of at least a part of OSG software stack in order to implement Condor-G submission of pilots to that facility – discussions under way 11

Conclusion We don’t see any outstanding problems to start production at BNL when Daya team finalizes task configuration Pilot submission to PDSF and testing will commence in a few days Will negotiate with IIT regarding Condor software stack installation there when cluster is online in a few weeks 12