Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES P. Saiz (IT-ES) AliEn job agents.

Slides:



Advertisements
Similar presentations
DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.
Advertisements

During the last three years, ALICE has used AliEn continuously. All the activities needed by the experiment (Monte Carlo productions, raw data registration,
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
ALICE Operations short summary and directions in 2012 Grid Deployment Board March 21, 2011.
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
AliEn Tutorial MODEL th May, May 2009 Installation of the AliEn software AliEn and the GRID Authentication File Catalogue.
DIANE Overview Germán Carrera, Alfredo Solano (CNB/CSIC) EMBRACE COURSE Monday 19th of February to Friday 23th. CNB-CSIC Madrid.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.
ALICE data access WLCG data WG revival 4 October 2013.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
- Distributed Analysis (07may02 - USA Grid SW BNL) Distributed Processing Craig E. Tull HCG/NERSC/LBNL (US) ATLAS Grid Software.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Site operations Outline Central services VoBox services Monitoring Storage and networking 4/8/20142ALICE-USA Review - Site Operations.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group.
Working with AliEn Kilian Schwarz ALICE Group Meeting April
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES A. Abramyan, S. Bagansco, S. Banerjee, L. Betev, F. Carminati,
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES A. Abramyan, S. Bagansco, S. Banerjee, L. Betev, F. Carminati,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
1 DIRAC Job submission A.Tsaregorodtsev, CPPM, Marseille LHCb-ATLAS GANGA Workshop, 21 April 2004.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
ANALYSIS TOOLS FOR THE LHC EXPERIMENTS Dietrich Liko / CERN IT.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
EGEE is a project funded by the European Union under contract IST Package Manager Predrag Buncic JRA1 ARDA 21/10/04
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The LCG interface Stefano BAGNASCO INFN Torino.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES L. Betev, A. Grigoras, C. Grigoras, P. Saiz, S. Schreiner AliEn.
EGEE 3 rd conference - Athens – 20/04/2005 CREAM JDL vs JSDL Massimo Sgaravatto INFN - Padova.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz The future of AliEn.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The AliEn File Catalogue Jamboree on Evolution of WLCG Data &
Current status WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
AliEn Tutorial ALICE workshop Sibiu 20 th August, 2008 Pablo Saiz.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
ALICE computing Focus on STEP09 and analysis activities ALICE computing Focus on STEP09 and analysis activities Latchezar Betev Réunion LCG-France, LAPP.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.
Storage discovery in AliEn
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES A. Abramyan, S. Bagnasco, L. Betev, D. Goyal, A. Grigoras, C.
Federating Data in the ALICE Experiment
ALICE and LCG Stefano Bagnasco I.N.F.N. Torino
L’analisi in LHCb Angelo Carbone INFN Bologna
U.S. ATLAS Grid Production Experience
UML diagrams for the AliEn job execution part and PackMan service
Running a job on the grid is easier than you think!
Running a job on the grid is easier than you think!
ALICE FAIR Meeting KVI, 2010 Kilian Schwarz GSI.
INFN-GRID Workshop Bari, October, 26, 2004
Simulation use cases for T2 in ALICE
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
LCG middleware and LHC experiments ARDA project
WLCG Collaboration Workshop;
Support for ”interactive batch”
Alice Software Demonstration
Presentation transcript:

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 203 Nov 11 Pablo Saiz Workload management TEG workshop Summary What does an AliEn Job Agent do? AliEn framework TaskQueue JobAgent Challenges Summary

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 303 Nov 11 Pablo Saiz Workload management TEG workshop All components to create a GRID File Catalogue –UNIX-like file system –Mapping to physical files –Metadata information –SE discovery Transfer Model –With different plugins TaskQueue –Job Agent & pull model –Automatic installation of software packages –Simulation, reconstruction, analysis... Developed by ALICE –Used also by PANDA and CBM (FAIR) AliEn

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 403 Nov 11 Pablo Saiz Workload management TEG workshop AliEn TaskQueue Distribution of jobs among CE With priorities and quotas According to job requirements –InputData, memory, partition... Installation of software packages Multiple backends –LSF, PBS, CREAMCE, CONDOR, FORK Scales up to at least 40k concurrent jobs (150k per day)

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 503 Nov 11 Pablo Saiz Workload management TEG workshop xrootd Job execution Job Manager JOB TASKQUEUE Job Broker CE CM Packman MonALISA xrootd Site A JOB CM Packman MonALISA xrootd Site B CM Packman MonALISA Site C File catalogue LFN GUID Meta data JOB CE

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 603 Nov 11 Pablo Saiz Workload management TEG workshop Job optimizers Splitting –By file, directory, storage, production Priority –Based on user quotas Automatic resubmission –Depending on type of error and thresholds Merging

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 703 Nov 11 Pablo Saiz Workload management TEG workshop Job Broker Multiple instances –And let the single database deal with concurrent processes Classad matching between site and waiting jobs ordered by priority Requirements on: –Data location, site/queue name, TTL, disk, memory, partition, user, (available software)... Extract most common fields –Reduce number of matching

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 803 Nov 11 Pablo Saiz Workload management TEG workshop AliEn CE Deployed on the vobox of each site Check amount of JobAgents running/queued Asks Broker for things to do –If match, send agents to the batch system: CREAMCE, LSF, PBS, CONDOR... Can install software packages –And the JobAgent can as well Can submit to several CREAMCE Possible improvements: –Bulk submission –Debug

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 903 Nov 11 Pablo Saiz Workload management TEG workshop AliEn Job Agent Pilot running on the worker node Asks for jobs to execute Prepares software packages, input files Executes and monitors payload Upload results And ask for another job

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1003 Nov 11 Pablo Saiz Workload management TEG workshop Torrent installation Already deployed in multiple sites: –CERN, RAL, CCIN2P3, Aalborg, UIB, UiO, KIAE, SUT... No need for shared file system Installing AliEn & all software components (300 MB) Aria torrent client Clean up after the job execution Challenge –Deploy independent trackers per site?

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1103 Nov 11 Pablo Saiz Workload management TEG workshop Torrent technology alitorrent.cern.ch Site A Site B

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1203 Nov 11 Pablo Saiz Workload management TEG workshop JobAgent monitoring Send heartbeat: Monitor TTL, space usage and memory consumption –If job misbehaves, stop it and report it RUNNINGZOMBIEEXPIRED 4 hours If heartbeat returns

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1303 Nov 11 Pablo Saiz Workload management TEG workshop system process AliEn JobAgent Check child monitor job: rsz, vsz, disk space alive dead Finish report & repeat Ok Kills Job Not Ok fork child JobAgent $err=system(user-command); will become $err=system(ulimit –S –v FASTKILL_MEMORY –c 0; user-command); AliEn JobAgent Child user job system command monitors job Finish Kills job exceeds allocation report & finish job runs ok Memory check Register output Jeff Porter LBNL

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1403 Nov 11 Pablo Saiz Workload management TEG workshop Write: Client Authen File Catalogue SERank Optimizer I’m in ‘Madrid’ Give me SEs! Try: CCIN2P3, CNAF, Kosice Similar process for read (limited to SE having the file) Can select number of SE, QoS, particular user, avoid SE... DEFAULT ARGUMENTS SHOULD BE USED WHENEVER POSSIBLE Writing the output

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1503 Nov 11 Pablo Saiz Workload management TEG workshop Challenges Remote data access –Access data over WAN? Multicore jobagent –One agent per core (overkill) or –One agent per machine (needs development) Interactive jobs –PoD File level brokering –Change JDL depending on who picks the job

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1603 Nov 11 Pablo Saiz Workload management TEG workshop File level brokering Site ASite BSite C File 1 File 2 File 3 File 4 File 5 Current schema Submit 4 jobs: Job 1: files 1,4, in Site A or B Job 2: file 2, in Site B or C Job 3: file 3, in Site A or C Job 4: file 5, site A, B or C File level brokering Submit 3 jobs: Job 1: for Site A Job 2: for Site B Job 3: for Site C Job analyzes all available files on site not processed yet

CERN IT Department CH-1211 Geneva 23 Switzerland t ES 1703 Nov 11 Pablo Saiz Workload management TEG workshop Summary AliEn Job Agents –Pull model –Can install everything (even AliEn) with bittorrent –Sanity checks –Monitors payload And kills it if need it –Automatic SE discovery for read/write