Gestion des jobs grille CMS and Alice Artem Trunov CMS and Alice support.

Slides:

Advertisements

Similar presentations

1 14 Feb 2007 CMS Italia – Napoli A. Fanfani Univ. Bologna A. Fanfani University of Bologna MC Production System & DM catalogue.

Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.

Introduction to CMS computing CMS for summer students 7/7/09 Oliver Gutsche, Fermilab.

FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.

1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.

1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.

Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;

F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

Stefano Belforte INFN Trieste 1 CMS Simulation at Tier2 June 12, 2006 Simulation (Monte Carlo) Production for CMS Stefano Belforte WLCG-Tier2 workshop.

T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

EGEE is a project funded by the European Union under contract IST VO box: Experiment requirements and LCG prototype Operations.

VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)

The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.

Doug Benjamin Duke University. 2 ESD/AOD, D 1 PD, D 2 PD - POOL based D 3 PD - flat ntuple Contents defined by physics group(s) - made in official production.

JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

+ AliEn site services and monitoring Miguel Martinez Pedreira.

ANALYSIS TOOLS FOR THE LHC EXPERIMENTS Dietrich Liko / CERN IT.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.

Operational Experience with CMS Tier-2 Sites I. González Caballero (Universidad de Oviedo) for the CMS Collaboration.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

CMS data access Artem Trunov. CMS site roles Tier0 –Initial reconstruction –Archive RAW + REC from first reconstruction –Analysis, detector studies, etc.

The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.

Federating Data in the ALICE Experiment

ALICE and LCG Stefano Bagnasco I.N.F.N. Torino

Xiaomei Zhang CMS IHEP Group Meeting December

Report PROOF session ALICE Offline FAIR Grid Workshop #1

INFN-GRID Workshop Bari, October, 26, 2004

Model (CMS) T2 setup for end users

CMS transferts massif Artem Trunov.

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Short update on the latest gLite status

Artem Trunov and EKP team EPK – Uni Karlsruhe

Artem Trunov, Günter Quast EKP – Uni Karlsruhe

Simulation use cases for T2 in ALICE

Survey on User’s Computing Experience

ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010

N. De Filippis - LLR-Ecole Polytechnique

Artem Trunov Computing Center IN2P3

Pierre Girard ATLAS Visit

Support for ”interactive batch”

The LHCb Computing Data Challenge DC06

Presentation transcript:

Gestion des jobs grille CMS and Alice Artem Trunov CMS and Alice support

CMS and Alice site roles Tier0 –Initial reconstruction –Archive RAW + REC from first reconstruction –Analysis, detector studies, etc Tier1 –Archive a fraction of RAW (2 nd copy) –Subsequent reconstruction –“Skimming” (off AOD) –Archiving Sim data produced at T2s –Serve AOD data to other T1 and T2s –Analysis Tier2 –Simulation Production –Analysis

CMS Tools Bookkeeping tools –Description of data datasets, provenance, production information, data locatoin –DBS/DLS (Data Bookkeeping Service/Data Location Service) Production tools –Manage scheduled or personal production of MC, reconstruction, skimming –ProdAgent Transfer tools –All scheduled intersite transfers –Phedex User Tools –For end user to submit analysis jobs. –CRAB DBS/DLS ProdAgent Phedex WebInterface CRAB Oracle LFC UI FTS srmcp UI Oracle

Bookkeeping Tools Centrally set up database Has different scopes (“local”, “global”), to differentiate official data from personal/group data. Use Oracle server at CERN. Data location service uses either Oracle or LFC. Very complex system

Production Agent Managing all production jobs at all sites This is a set of daemons running on top of LCG UI by operators at their home institution. –usually a dedicated machine due to low submission rate and large CPU consumption per summision. –Operators certificates are used to submit jobs –Operators have VOMS role cmsprod Instances of it are used by production operators –1 for OSG + few for EGEE sites Merging production job output to create a reasonable size files. Registering new data in Phedex for transfer to the final destination –MC data produced at T2 to T1. All prod info goes to DBS User make MC production requests via the web interface and also can track their requests.

Phedex CMS data transfer tool The only tool that is required to run at all sites –Work in progress to ease this requirement – with SRM, fully remote operation is possible. However local support is still need to debug problems. Set of site-customizable agents that perform various transfer related tasks. –download files to site –produce SURL of local files for other sites to download –follow migration of files to the MSS –staging of files to the MSS –removing local files Uses ‘pull’ model of transfers, i.e. transfers are initiated at the destination site by Phedex running at this site. Uses Oracle at CERN to keep it’s state information Can use FTS to perform transfer, or srmcp –Or another mean, like direct gridftp, but CMS requires SRM at sites. One of the oldest and stable SW component of CMS –Secret of success: development is carried out by the CERN+site people who are/were involved in daily operations Uses someone’s personal proxy certificate

CRAB Main CMS user tool to submit analysis jobs Users specifies a data set to be analyzed, his application and job configuration. CRAB: –locates the data with DLS –splits jobs –submits jobs to the Grid –tracks jobs –collects results Can be configured to upload job output to some Storage Element with srm or gridftp.

“Last mile” tools working with ntuples, custom skims etc Not well developed! People are by now on their own. CRAB is not scaling to well down to end-user analysis – too rigid, requires maintaining bookkeeping infrastructure service etc. –Not convenient to physicists User questionnaire on CRAB (D-CMS group) –6 – use it 3 – like it 2 – report problems –3 - don’t use it 1 never used Grid 1 avoids using Grid and Crab for problems 1 need to run on unofficial data PROOF – is of interests to CMS. There are groups setting up PROOF at their institutions –access to CMS data structures via a loadable library. –working on a plugin for LFN lookup in DBS/DLS and PFN conversion.

Other CMS tools Software installation service –In CMS software is installed in VO area via grid jobs by software managers who have VOMS role cmssgm. –Also cmssgm run SW monitoring service –SW installation uses apt, rpm and scram. –Base releases are required to be installed at sites to run jobs –User bring only his own compiled code on top of the base release via grid sandbox. Dashboard –Collecting monitoring information (mostly jobs) and presenting through a web interface. –There is an effort to make use of Dashboard for all CMS monitoring –It is now beeing adopted by some other experiments –Dashboard relies on various grid components to get the state of a job, and thus it is usually inaccurate by ~15% ASAP –Automatic submission of analysis jobs using web interface. –Seems not to be used anymore. Used to give user an option to use his software release in a for of DAR (distribution archive).

Alice services - AliEn “Alice Environment’ - independent implementation of GRID middleware. Central services: –Job Queue –File catalog –Authorization Site services include: –Computing element With LCG/gLite support –Storage element – xrootd protocol –PackMan - SW installation –File Transfer Deamon with FTS support –ClusterMonitor –MonalLisa agent Services run at each site on a standard VO Box, optimized for low maintenance and unattended operations. –central and site services are managed by a set of experts including developers and regional responsible experts

Alice Services at a site AliEn Central Services VO Box Alien site services LCG CE Xrootd cluster JA submit, followConfiguration, Monitoring Reports get jobs $VO_ALICE_SW_DIR Install software Operator gsissh login

Job management Job Agent model –Alien CE (a site service) submits a job agent (‘pilot job’) to this site queue via local batch interface or LCG interface –Once such job agent starts on a WN, it contacts the central database (Job Queue, a certral service) and pick up real tasks in form of Alien JDL. Software needed by jobs (as specified in Alien JDL) is installed via PackMan site service. User input files are downloaded from remote xrootd servers to the local scratch area. Disadvantages of agent model in Alien –More agents are submitted then there are real jobs, and many job agents are exiting with out a real task after few minutes.

User tools Full featured user login shell – alien shell – in fact is the only official way to submit jobs in ALICE. –the same ALIEN tools are used by production and users. Data storage –global logical alien name space for production and home dirs accessible in the shell –backend is xrootd.

PROOF Main end-user analysis tool in ALICE PROOF is a part of ROOT and gives a mechanism to parallelize analysis of ROOT trees and ntuples. –Requires high speed access to data, not often possible in current setups WN and server bandwidth limitation –WN: 1GB per rack -> 1.5MB/s per cpu core with (4 cores * 24 WN/rack) –Servers: usually have 1Gb/s, but disk subsystem may perform better. –Deployment model endorsed by PROOF team: Large dedicated clusters with locally pre-loaded data –Not preferable at the moment at the large busy centers. Integration with Alien via own TFileAdapter –so a user gives it’s LFN, which is automatically converted to PFN by the adapter. So far only at CERN – CAF. At CC – work in progress (see next slide).

PROOF at CC-IN2P3 HPSS Local User SessionRemote session IFZ XROOTD analysis pool worker PROOF Master VO BOX GSI auth PROOF agents are run on the xrootd cluster and take advantage of the following: free cpu cycles due to low overhead of xrootd zero cost solution - no new hardware involved Direct access to data on disk, not using bandwidth 1GB node interconnect when inter-server access is required. transparent access to full data set stored at our T1 for all experiments via xrootd-dcache link deployed on this xrootd cluster and dynamic staging management of infrastructure conveniently fit into existing xrootd practices this setup is more close to possible 2008 PROOF solution because of 1GB node connection and large "scratch" space. this is a kind of setup that T2 sites may also considers to deploy

See also Thursday afternoon session on analysis centers.