Analysis Trains Costin Grigoras Jan Fiete Grosse-Oetringhaus ALICE Offline Week, 04.10.12.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
The LEGO Train Framework
– Unfortunately, this problems is not yet fully under control – No enough information from monitoring that would allow us to correlate poor performing.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Computer Organization and Architecture
Trains status&tests M. Gheata. Train types run centrally FILTERING – Default trains for p-p and Pb-Pb, data and MC (4) Special configuration need to be.
ALICE Operations short summary and directions in 2012 Grid Deployment Board March 21, 2011.
ALICE Operations short summary LHCC Referees meeting June 12, 2012.
ALICE Operations short summary and directions in 2012 WLCG workshop May 19-20, 2012.
New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week –
Staging to CAF + User groups + fairshare Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE Offline week,
MC, REPROCESSING, TRAINS EXPERIENCE FROM DATA PROCESSING.
Publication Speed Task Force Roberta Arnaldi March 30 th 2015 Task force to determine possible reasons for delays in ALICE paper publication =============================================
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
The Functions of Operating Systems Interrupts. Learning Objectives Explain how interrupts are used to obtain processor time. Explain how processing of.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Update on replica management
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
PWG3 Analysis: status, experience, requests Andrea Dainese on behalf of PWG3 ALICE Offline Week, CERN, Andrea Dainese 1.
Andrei Gheata, Mihaela Gheata, Andreas Morsch ALICE offline week, 5-9 July 2010.
Analysis trains – Status & experience from operation Mihaela Gheata.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Robust Real Time Face Detection
Karsten Köneke October 22 nd 2007 Ganga User Experience 1/9 Outline: Introduction What are we trying to do? Problems What are the problems? Conclusions.
CERN – Alice Offline – Thu, 27 Mar 2008 – Marco MEONI - 1 Status of RAW data production (III) ALICE-LCG Task Force weekly.
CERN – Alice Offline – Thu, 20 Mar 2008 – Marco MEONI - 1 Status of Cosmic Reconstruction Offline weekly meeting.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Online System Status LHCb Week Beat Jost / Cern 9 June 2015.
PROOF and ALICE Analysis Facilities Arsen Hayrapetyan Yerevan Physics Institute, CERN.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
PWG-CF Jan Fiete Grosse-Oetringhaus Analysis Session Offline Week March 2012.
A. Gheata, ALICE offline week March 09 Status of the analysis framework.
AliRoot survey: Analysis P.Hristov 11/06/2013. Are you involved in analysis activities?(85.1% Yes, 14.9% No) 2 Involved since 4.5±2.4 years Dedicated.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
Data processing Offline review Feb 2, Productions, tools and results Three basic types of processing RAW MC Trains/AODs I will go through these.
M. Gheata ALICE offline week, October Current train wagons GroupAOD producersWork on ESD input Work on AOD input PWG PWG31 (vertexing)2 (+
PWG3 analysis (barrel)
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES L. Betev, A. Grigoras, C. Grigoras, P. Saiz, S. Schreiner AliEn.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
Dynamic staging to a CAF cluster Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
M. Gheata ALICE offline week, 24 June  A new analysis train macro was designed for production  /ANALYSIS/macros/AnalysisTrainNew.C /ANALYSIS/macros/AnalysisTrainNew.C.
ALICE Grid operations +some specific for T2s US-ALICE Grid operations review 7 March 2014 Latchezar Betev 1.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
LEGO train limits Offline week, 31/03/2016 LB 1. Incident of 22/03/2016 A train operator has submitted 107 LEGO trains in one go – These resulted in more.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Valencia Cluster status Valencia Cluster status —— Gang Qin Nov
The ALICE Christmas Production L. Betev, S. Lemaitre, M. Litmaath, P. Mendez, E. Roche WLCG LCG Meeting 14th January 2009.
Multiprogramming. Readings r Chapter 2.1 of the textbook.
Jan Fiete Grosse-Oetringhaus
Analysis trains – Status & experience from operation
DPG Activities DPG Session, ALICE Monthly Mini Week
Summary on PPS-pilot activity on CREAM CE
Status of the CERN Analysis Facility
INFN-GRID Workshop Bari, October, 26, 2004
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
fields of possible improvement
Operations in 2012 and plans for the LS1
Analysis Trains - Reloaded
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Performance optimizations for distributed analysis in ALICE
Presentation transcript:

Analysis Trains Costin Grigoras Jan Fiete Grosse-Oetringhaus ALICE Offline Week,

Analysis Trains - Jan Fiete Grosse-Oetringhaus2 LEGO Trains 42 trains configured (37 active) –5 CF, 4 GA, 1 PP, 8 JE, 5 DQ, 11 HF, 8 LF Submitted trains this year –213 CF, 35 DQ, 24 GA, 124 HF, 173 JE, 114 LF, 3 PP 1-5 train operators / train Operator mailing list TWiki page viewauth/ALICE/AnalysisTrains PWGJobs [k] Wall in years CF ,8 DQ ,1 GA852140,1 HF ,4 JE ,4 LF ,5 PP110,2 since on average 2400 jobs at any given time

Analysis Trains - Jan Fiete Grosse-Oetringhaus3 Running Statistics alidaq aliprod alitrain SUM

Analysis Trains - Jan Fiete Grosse-Oetringhaus4 Time until trains finish Time between train submission and submission of final merging job Average below 2 days (good!) but quite some spread Average per month per Train

Analysis Trains - Jan Fiete Grosse-Oetringhaus5 AliEn Upgrade The upgrade this Monday of parts to v2-20 had a few side-effects –General interruption from to midnight; during this period Costin & Pablo were continuously working on fixing the situation –Jobs (in particular) merging that got submitted during that time failed, and needed to be retried later  Mistake, LPM should have been disabled for the upgrade –New status FAILED which is not considered as a final state  lead to some delay for merging jobs, fixed today (parallel failure of CERN EOS makes submission very slow) –Bug in SE selection, some jobs go to FAILED  being fixed by Pablo at present I propose that planned upgrades are evaluated in particular with respect to the analysis trains and a plan is made how to recover failures from/during the period

Analysis Trains - Jan Fiete Grosse-Oetringhaus6 Planned Improvements

Analysis Trains - Jan Fiete Grosse-Oetringhaus7 Improve Merging Merging –Dedicated CE/SE for merging (at CERN)  being investigated –Merging job submission to be speeded up (at the moment dependent on number of waiting analysis jobs) Job Splitting –Investigate new AliEn option to select the input files once the job has started  increases number of files per job (less merging, more files for event mixing)

Analysis Trains - Jan Fiete Grosse-Oetringhaus8 Train Statistics Add consumed CPU and wall time for total and per job in run view 2.2y CPU total 3.2y Wall total 3.2h CPU / job 4.2h wall / job 4.7 files / job

Analysis Trains - Jan Fiete Grosse-Oetringhaus9 Dataset Selection Allow users on the interface to indicate on which dataset they would like to run –Operator marks dataset as "active" (similar to wagons) –User selects the desired datasets among those LHC10h_AOD086 LHC11h_AOD095 … Desired datasets

Analysis Trains - Jan Fiete Grosse-Oetringhaus10 Merging Test Test also the merging per wagon Merging test OK Failed

Analysis Trains - Jan Fiete Grosse-Oetringhaus11 Further Ideas Number of wagons Enabling/disabling by lists (of wagon numbers / names?) Saving / loading of train configurations Groups of wagons Ordering of wagons

Analysis Trains - Jan Fiete Grosse-Oetringhaus12 Demo …some new features…

Analysis Trains - Jan Fiete Grosse-Oetringhaus13 Summary The LEGO train system got very popular The average finishing time of a train is 2 days, but with quite some spread We have lots of improvements requests and ideas We have a lack of manpower (there is only Costin and me, both with many other tasks, too) which leads sometimes to large response times