CHEP2004 27-Sep-2004 1 Andrey PHENIX Job Submission/Monitoring in transition to the Grid Infrastructure Andrey Y. Shevel, Barbara Jacak,

Slides:

Advertisements

Similar presentations

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Advertisements

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

STAR Software Walk-Through. Doing analysis in a large collaboration: Overview The experiment: – Collider runs for many weeks every year. – A lot of data.

Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.

Data Management for Physics Analysis in PHENIX (BNL, RHIC) Evaluation of Grid architecture components in PHENIX context Barbara Jacak, Roy Lacey, Saskia.

Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.

DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

Tech talk 20th June Andrey Grid architecture at PHENIX Job monitoring and related stuff in multi cluster environment.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

HEP Experiment Integration within GriPhyN/PPDG/iVDGL Rick Cavanaugh University of Florida DataTAG/WP4 Meeting 23 May, 2002.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.

November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.

MySQL and GRID Gabriele Carcassi STAR Collaboration 6 May Proposal.

3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

Overview Why are STAR members encouraged to use SUMS ? Improvements and additions to SUMS Research –Job scheduling with load monitoring tools –Request.

Data Import Data Export Mass Storage & Disk Servers Database Servers Tapes Network from CERN Network from Tier 2 and simulation centers Physics Software.

Nick Brook Current status Future Collaboration Plans Future UK plans.

1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.

Tier 1 Facility Status and Current Activities Rich Baker Brookhaven National Laboratory NSF/DOE Review of ATLAS Computing June 20, 2002.

1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.

PNPI HEPD seminar 4 th November Andrey Shevel Distributed computing in High Energy Physics with Grid Technologies (Grid tools at PHENIX)

IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.

ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.

4/20/02APS April Meeting1 Database Replication at Remote sites in PHENIX Indrani D. Ojha Vanderbilt University (for PHENIX Collaboration)

HEPD sem 14-Dec Andrey History photos: A. Shevel reports on CSD seminar about new Internet facilities at PNPI (Jan 1995)

17th Sep 2007 Andrey Computing cluster at NCG Introduction Past upgrades Current state of the cluster Problems with cluster Where to find.

Andrey Meeting 7 October 2003 General scheme: jobs are planned to go where data are and to less loaded clusters SUNY.

PHENIX and the data grid >400 collaborators Active on 3 continents + Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.

The BioBox Initiative: Bio-ClusterGrid Maddie Wong Technical Marketing Engineer Sun APSTC – Asia Pacific Science & Technology Center.

Distributed Computing for CEPC YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep , 2014 Draft.

Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.

PPDG update l We want to join PPDG l They want PHENIX to join NSF also wants this l Issue is to identify our goals/projects Ingredients: What we need/want.

February 28, 2003Eric Hjort PDSF Status and Overview Eric Hjort, LBNL STAR Collaboration Meeting February 28, 2003.

FRANEC and BaSTI grid integration Massimo Sponza INAF - Osservatorio Astronomico di Trieste.

Databases for data management in PHENIX Irina Sourikova Brookhaven National Laboratory for the PHENIX collaboration.

January 26, 2003Eric Hjort HRMs in STAR Eric Hjort, LBNL (STAR/PPDG Collaborations)

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

PHENIX and the data grid >400 collaborators 3 continents + Israel +Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.

Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

IBM Express Runtime Quick Start Workshop © 2007 IBM Corporation Deploying a Solution.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.

David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.

Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.

10 March Andrey Grid Tools Working Prototype of Distributed Computing Infrastructure for Physics Analysis SUNY.

CMS Production Management Software Julia Andreeva CERN CHEP conference 2004.

Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

Enabling Grids for E-sciencE LRMN ThIS on the Grid Sorina CAMARASU.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

U.S. ATLAS Grid Production Experience

Networking for Home and Small Businesses – Chapter 2

Near Real Time Reconstruction of PHENIX Run7 Minimum Bias Data From RHIC Project Goals Reconstruct 10% of PHENIX min bias data from the RHIC Run7 (Spring.

Michael P. McCumber Task Force Meeting April 3, 2006

Networking for Home and Small Businesses – Chapter 2

Preparations for Reconstruction of Run7 Min Bias PRDFs at Vanderbilt’s ACCRE Farm (more substantial update set for next week) Charles Maguire et al. March.

Presentation transcript:

CHEP Sep Andrey PHENIX Job Submission/Monitoring in transition to the Grid Infrastructure Andrey Y. Shevel, Barbara Jacak, Roy Lacey, Dave Morrison, Michael Reuter, Irina Sourikova, Timothy Thomas, Alex Withers

CHEP Sep Andrey Brief info on PHENIX + Large, widely-spread collaboration (same scale as CDF and D0), more than 450 collaborators, 12 nations, 57 Institutions, 11 U.S. Universities, currently in fourth year of data-taking. + ~250 TB/yr of raw data. + ~230 TB/yr of reconstructed output. + ~370 TB/yr microDST + nanoDST. + In total about ~850TB+ of new data per year. + Primary event reconstruction occurs at BNL RCF (RHIC Computing Facility). + Partial copy of raw data is at CC-J (Computing Center in Japan) and part of DST output is at CC-F (France).

CHEP Sep Andrey Phenix Grid and Grid related activity + In addition to our concentrating now on the job monitoring and submission we are involved into: - testing D-cache prototype for file management at RCF; - moving most of Phenix databases to Postgres database (separate presentation); - participating in Grid3 testbed ( ).

CHEP Sep Andrey PHENIX Grid We could expect in total about 10 clusters in nearest years. RIKEN CCJ (Japan) University of New Mexico Job submission Cluster RAM Brookhaven National Lab Vanderbilt University PNPI (Russia) Stony Brook Data moving IN2P3 (France) CCJ

CHEP Sep Andrey PHENIX multi cluster conditions + Computing Clusters have different: - computing power; - batch job schedulers; - details of administrative rules. + Computing Clusters have common: - OS Linux (there are clusters with different Linux versions); - Most of clusters have gateways with Globus toolkit; - Grid status board ( )

CHEP Sep Andrey Remote cluster environment + Max number of the computing clusters is about Max number of the submitted at the same time Grid jobs is about 10**4 or less. + The amount of the data to be transferred (between BNL and remote cluster) for physics analysis is varied from about 2 TB/quarter to 5 TB/week. + We use PHENIX file catalogs: - centralized file catalog ( ); - cluster file catalogs (for example at SUNYSB is used slightly re- designed version MAGDA ).

CHEP Sep Andrey Exporting the application software to run on remote clusters + The porting of PHENIX software in binary form is presumably most common port method in PHENIX Grid : - copying over AFS to mirror PHENIX directory structure on remote cluster (by cron job); - preparing PACMAN packages for specific class of tasks (e.g. specific simulation).

CHEP Sep Andrey User job scripts + One of the important issues for multi cluster environment is to make sure that all user scripts will run on all available clusters with same results; + It could be done at least by two ways: - to eliminate from the existing user job scripts all specifics and make them neutral to the cluster environment (to do that we introduced pre assigned environment variables like GRIDVM_JOB_SUBMIT, GRIDVM_BATCH_SYSTEM, etc); - change processing scheme – for example - start to use STAR scheduler [SUMS] ( with use same environment variables.

CHEP Sep Andrey The job submission scenario at remote Grid cluster + User needs just qualified computing cluster: i.e. with enough available disk space, specific version for compiler and related software pieces. + Before job production: - To copy/replicate the major data sets (physics data) to remote cluster. - To copy/replicate the minor data sets (scripts, parameters, etc.) to remote cluster. - To guarantee that remote cluster with required environment is functioning properly by the set of test tools (scripts). + To start the master job (script) which will submit many sub-jobs with default (for remote cluster) batch system. Also master job might deploy the software components required for sub-jobs. + During production (job run) and after all jobs were accomplished (after job production): - To watch the jobs with monitoring system. - To copy the result data from remote cluster to target destination (desktop or RCF).

CHEP Sep Andrey The requirements for job monitoring in multi cluster environment + What is job monitoring ? + To keep track of the submitted jobs - whether the jobs have been accomplished; - in which cluster the jobs are performed; - where the jobs were performed in the past (one day, one week, one month ago). + Obviously the information about the jobs must be written in the database and kept there. The same database might be used for job control purpose (cancel jobs, resubmit jobs, other job control operations in multi cluster environment) + We developed PHENIX job monitoring tool on the base of BOSS (

CHEP Sep Andrey Globus job submission tools STAR scheduler Cataloging engine Job monitoring User Jobs (jdl texts, job specs, etc.) GT 2.4.latest (or VDT) PHENIX Grid job submission/monitoring

CHEP Sep Andrey Basic job flow BOSS DB BOSS Local job Scheduler Computing node n Computing node m BODE (Web interface) To wrap the job STAR scheduler CLI interface Cluster X To other clusters

CHEP Sep Andrey Example for job submission panel - Need parameter input for event generator, detector response package, reconstruction etc. - Created GUI to provide easy input specification and hide JDL file creation. - Use Star-Scheduler for job submission to the Grid and Pacman for job software deployment.

CHEP Sep Andrey Continuation of the example (BOSS/BODE view of jobs submitted by STAR scheduler)

CHEP Sep Andrey Challenges for PHENIX Grid + Admin service (where the user can complain if something is going wrong with his Grid jobs on some cluster?). + More sophisticated job control in multi cluster environment; job accounting. + Complete implementing technology for run-time installation for remote clusters. + More checking tools to be sure that most things in multi cluster environment are running well – i.e. automate the answer for the question “is account A on cluster N being PHENIX qualified environment?”. To check it every hour or so. + Portal to integrate all PHENIX Grid tools in one user window.

CHEP Sep Andrey Summary + The multi cluster environment is our reality and we need more user friendly tools for typical user to reduce the cost of clusters power integration. + In our condition the best way to do that is to use already developed subsystems as bricks to build up the robust PHENIX Grid computing environment. Most effective way to do that is to be AMAP cooperative with other BNL collaborations (STAR as good example). + Serious attention must be paid to automatic installation of the existing physics software.