ALICE moves into warp drive

Slides:



Advertisements
Similar presentations
Sander Klous on behalf of the ATLAS Collaboration Real-Time May /5/20101.
Advertisements

CWG10 Control, Configuration and Monitoring Status and plans for Control, Configuration and Monitoring 16 December 2014 ALICE O 2 Asian Workshop
Clara Gaspar on behalf of the LHCb Collaboration, “Physics at the LHC and Beyond”, Quy Nhon, Vietnam, August 2014 Challenges and lessons learnt LHCb Operations.
A tool to enable CMS Distributed Analysis
ALICE Electronic Logbook MEST-CT Vasco Barroso PH/AID.
Designing a HEP Experiment Control System, Lessons to be Learned From 10 Years Evolution and Operation of the DELPHI Experiment. André Augustinus 8 February.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
ALICE O 2 Plenary | October 1 st, 2014 | Pierre Vande Vyvre O2 Project Status P. Buncic, T. Kollegger, Pierre Vande Vyvre 1.
11 CTP Training A.Jusko, M. Krivda and R.Lietava..
1 Alice DAQ Configuration DB
V. Altini, T. Anticic, F. Carena, W. Carena, S. Chapeland, V. Chibante Barroso, F. Costa, E. Dénes, R. Divià, U. Fuchs, I. Makhlyueva, F. Roukoutakis,
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
André Augustinus 10 September 2001 DCS Architecture Issues Food for thoughts and discussion.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
Offline shifter training tutorial L. Betev February 19, 2009.
ALICE Computing Model The ALICE raw data flow P. VANDE VYVRE – CERN/PH Computing Model WS – 09 Dec CERN.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
Roberto Divià, CERN/ALICE 1 CHEP 2009, Prague, March 2009 The ALICE Online Data Storage System Roberto Divià (CERN), Ulrich Fuchs (CERN), Irina Makhlyueva.
ALICE, ATLAS, CMS & LHCb joint workshop on
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
André Augustinus 21 June 2004 DCS Workshop Detector DCS overview Status and Progress.
OFFLINE TRIGGER MONITORING TDAQ Training 5 th November 2010 Ricardo Gonçalo On behalf of the Trigger Offline Monitoring Experts team.
A.Golunov, “Remote operational center for CMS in JINR ”, XXIII International Symposium on Nuclear Electronics and Computing, BULGARIA, VARNA, September,
CMS pixel data quality monitoring Petra Merkel, Purdue University For the CMS Pixel DQM Group Vertex 2008, Sweden.
4 th Workshop on ALICE Installation and Commissioning January 16 th & 17 th, CERN Muon Tracking (MUON_TRK, MCH, MTRK) Conclusion of the first ALICE COSMIC.
ALICE Pixel Operational Experience R. Santoro On behalf of the ITS collaboration in the ALICE experiment at LHC.
ATLAS WAN Requirements at BNL Slides Extracted From Presentation Given By Bruce G. Gibbard 13 December 2004.
Online (GNAM) and offline (Express Stream and Tier0) monitoring produced results during cosmic/collision runs (Oct-Dec 2009) Shifter and expert level monitoring.
Pixel DQM Status R.Casagrande, P.Merkel, J.Zablocki (Purdue University) D.Duggan, D.Hidas, K.Rose (Rutgers University) L.Wehrli (ETH Zuerich) A.York (University.
Activity related CMS RPC Hyunchul Kim Korea university Hyunchul Kim Korea university 26th. Sep th Korea-CMS collaboration meeting.
O 2 Project Roadmap P. VANDE VYVRE 1. O2 Project: What’s Next ? 2 O2 Plenary | 11 March 2015 | P. Vande Vyvre TDR close to its final state and its submission.
Predrag Buncic ALICE Status Report LHCC Referee Meeting CERN
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
André Augustinus 18 March 2002 ALICE Detector Controls Requirements.
The ALICE data quality monitoring Barthélémy von Haller CERN PH/AID For the ALICE Collaboration.
P. Vande Vyvre – CERN/PH for the ALICE collaboration CHEP – October 2010.
CHEP 2010 – TAIPEI Robert Gomez-Reino on behalf of CMS DAQ group.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
«The past experience with JCOP in ALICE» «JCOP Workshop 2015» 4-5 November 2015 Alexander Kurepin.
Slow Control and Run Initialization Byte-wise Environment
Slow Control and Run Initialization Byte-wise Environment
Use of FPGA for dataflow Filippo Costa ALICE O2 CERN
CMS High Level Trigger Configuration Management
LHC experiments Requirements and Concepts ALICE
TPC Commissioning: DAQ, ECS aspects
Controlling a large CPU farm using industrial tools
ALICE – First paper.
Offline shifter training tutorial
New monitoring applications in the dashboard
IEEE NPSS Real Time Conference 2009
Commissioning of the ALICE HLT, TPC and PHOS systems
Hannes Sakulin, CERN/EP on behalf of the CMS DAQ group
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
ProtoDUNE SP DAQ assumptions, interfaces & constraints
Experience between AMORE/Offline and sub-systems
HLT & Calibration.
Offline shifter training tutorial
Data Quality Monitoring of the CMS Silicon Strip Tracker Detector
HLT Jet and MET DQA Summary
CMS Pixel Data Quality Monitoring
Commissioning of the ALICE-PHOS trigger
CTP offline meeting 16/03/2009 A.Jusko and R.Lietava
Design Principles of the CMS Level-1 Trigger Control and Hardware Monitoring System Ildefons Magrans de Abril Institute for High Energy Physics, Vienna.
DQM for the RPC subdetector
CMS Pixel Data Quality Monitoring
Pierluigi Paolucci & Giovanni Polese
Implementation of DHLT Monitoring Tool for ALICE
Offline framework for conditions data
Presentation transcript:

ALICE moves into warp drive Vasco Barroso on behalf of the ALICE Collaboration CHEP 2012, New York, 21-25 May

Content Introduction ALICE operations Data taking efficiency In-run recovery procedures EOR Reasons bookkeeping Reporting Future plans Conclusion Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Introduction Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

The ALICE experiment A Large Ion Collider Experiment Focused on heavy-ion collisions to study QGP Central barrel + forward muon spectrometer 17 installed sub-detectors 5 online systems (DAQ, DCS, TRG, HLT, ECS) Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

triggers detector readout ALICE data flow Trigger system 3 levels TRG High Level Trigger Data selection, compression HLT 485 x 2 Gbps optical links Event fragments Detectors Local Data Concentrators Readout Sub-events LDCs Global Data Collectors Event Building Events GDCs Transient Data Storage 650 TB Files TDS CERN CC 4 x 10 Gbps links PDS triggers detector readout 3.5 GBps 4 GBps 8 GBps 3 GBps Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

ALICE operations Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

A typical LHC year Jan Feb Mar Apr May Jun July Aug Sep Oct Nov Dec Shutdown for maintenance proton-proton collisions Heavy-ion collisions Unit is inverse, so less is more Why 2010 and 2011 HI data are almost the same ? Trigger config, HLT compression Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

ALICE operations A typical LHC fill in ALICE (0 - 30h) Beam Injection Stable beams Beam dump Why calibrate at the beginning and at the end ? Because some calibration runs take too long to do at the beginning. ALICE safe protects high voltage of gaseous detectors: reduced voltage. Super safe switches off ALICE safe Prepare trigger configuration Detector calibration Partial ALICE READY Full ALICE READY Data taking Detector calibration Ideally a single run Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

ALICE operations A typical ALICE run Start-of-Run Data taking Bookkeeping Data taking metadata Logbook A typical ALICE run Start-of-Run Config detectors electronics Start online systems Store data taking conditions Data taking Readout Event building Online data monitoring Online calibration data End-of-Run Export data taking conditions and calibration data to Offline Stop online systems DCS SOR example: storage of detector parameters for data analysis; reconfigure electronics DCS EOR example: export configuration info to offline Poster: “The ALICE DAQ Detector Algorithms framework”, Sylvain Chapeland, 24 May, 13:30 - 18:15 Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Reality is hard... 17 sub-detectors + 5 online systems 1 failure stops the run Analysis of the ALICE Electronic Logbook metadata concluded: Better downtime/efficiency diagnosis tools needed Number of runs per fill is high ~ 11 runs per fill during 2011 p-p Starting/stopping runs is a costly operation SOR ~ 3 min EOR ~ 80 sec Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Data taking efficiency Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Data taking efficiency Calculated per fill: Stored in ALICE Electronic Logbook Rd run data taking duration Rp run pause duration (trigger disabled) Fsb fill stable beams duration Fusb fill unusable stable beams duration Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Populating Logbook LHC publishes operational parameters via the Data Interchange Protocol (DIP) Dedicated ALICE software retrieves needed values and stores them in Logbook Logbook Daemon LHC ALICE DIP Client DIP At SOR Logbook DB At start/end of Fill Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

In-run recovery procedures Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

In-run recovery procedures To avoid stopping a run and lose beam time, thus increasing efficiency Recover from sub-detector issues 2 in-run procedures introduced: via Detector Control System (DCS) via DAQ using Detector Data Link (DDL) Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

In-run recovery via DCS A new state was introduced in the DCS logic: ERROR_RECOVER Example: TPC high voltage trips Detector in ERROR_RECOVER ECS stops the trigger and waits Detector is READY, restart trigger Detector is not READY or timeout, stop run Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

In-run recovery via DDL DDL is bi-directional, can be used to configure FEE New procedure: Pause And Configure (PAC) Example: Single Event Upset in detector FEE Currently triggered manually by shifter Shifter executes PAC ECS stops the trigger DAQ releases DDL DAQ executes config commands using DDL DAQ re-enables DDL for data taking ECS starts the trigger Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

EOR Reasons bookkeeping Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Typical EOR Reasons Runs can stop for a multitude of reasons: Decision by shift crew (manual operation) Change trigger configuration Add/remove detector Problem with online systems Process no longer running Configuration error Problem with detectors High voltage trip Front End Electronics (FEE) Corrupted data Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

EOR Reasons bookkeeping Up to mid-2011: Text based entry in Logbook Statistics done manually (time consuming, error prone) For abnormal stops, log search was needed For the 2011 HI run: Structured data in Logbook Automatic stops: inserted by Experiment Control System Manual stops: prompt shifter Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

ECS End-of-Run panel Whenever a shifter stops a run, he/she has to choose from a predefined list of EOR Reasons Evolving list, changed when needed Shifters training is important! Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Changing EOR Reasons a posteriori 100 % accuracy is very difficult: Symptoms vs causes Shifters mistakes Logbook GUI page to change EOR Reason Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Reporting Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Online reports in Logbook Fill Statistics Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Online reports in Logbook Fill Details Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Online reports in Logbook EOR Reasons Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Online reports in Logbook EOR Reasons Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Fill Summary slides A PPT slide with a summary of an LHC Fill Automatically generated every day and sent via email Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Future plans Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Future plans Integrate EOR Reasons with JIRA issue tracking system Logbook - JIRA interface being developed Extend in-run recovery procedure via DDL Automatic detector request via bit in event header New SYNC event to synchronize data sources Expert system for shifter support and automatic failure recovery Reduce load on on-call crew Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Conclusion Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Conclusion 2 years of successful operational experience Big effort put in efficiency monitoring, EOR Reasons identification Introduction of in-run recovery procedures reduced downtime thus increasing efficiency Reports automation Saved time Increased visibility and “stimulated” issues resolution Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Related presentations Poster: “The ALICE DAQ Detector Algorithms framework”, Sylvain Chapeland, 24 May, 13:30 - 18:15 Poster: “Orthos, an alarm system for the ALICE DAQ operations”, Sylvain Chapeland, 24 May, 13:30 - 18:15 Poster: “Preparing the ALICE DAQ upgrade”, Pierre Vande Vyvre, 24 May, 13:30 - 18:15 Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

ALICE presentations “ALICE HLT TPC Tracking of Heavy-Ion Events on GPUs”, David Rohr, now A review of data placement in WLCG Data compression in ALICE by on-line track reconstruction and space-point analysis The ALICE EMCal High Level Triggers Automated Inventory and Monitoring of the ALICE HLT Cluster Resources with the SysMES Framework Monitoring the data quality of the real-time event reconstruction in the ALICE High Level Trigger. Operational Experience with the ALICE High Level Trigger Flexible event reconstruction software chains with the ALICE High-Level Trigger Dynamic parallel ROOT facility clusters on the Alice Environment A new communication framework for the ALICE Grid AliEn JobBrokering Extreme Combining virtualization tools for a dynamic, distribution agnostic grid environment for ALICE grid jobs in Scandinavia ALICE Grid Computing at the GridKa Tier-1 center ALICE's detectors safety and efficiency optimization with automatic beam-driven operations Managing operational documentation in the ALICE Detector Control System An optimization of the ALICE XRootD storage cluster at the Tier-2 site in Czech Republic Certified Grid Job Submission in the ALICE Grid Services Rethinking particle transport in the many-core era Grid Computing at GSI(ALICE/FAIR) - present and future Employing peer-to-peer software distribution in ALICE Grid Services to enable opportunistic use of OSG resources Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

QUESTIONS ? Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Extra Slides Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Installation of services ALICE timeline First ideas HI@LHC 1990 1993 ALICE LoI Technical Proposal 1995 1997 ALICE is approved First TDR 1998 2000 Construction begins Last TDR 2005 2006 Installation of services Commissioning begins 2007 2008 LHC first beam LHC first HI run 2010 21/05/2012 Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012

ALICE operations 24/7 on-site shift crew Currently 4 shifters: Shift Leader DAQ + HLT + CTP (Central Trigger Processor) DQM (Data Quality Monitoring) +Offline DCS One of them is SLIMOS (Shift Leader in Matters of Safety) 24/7 on-call expert support for each subsystem Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012

Online reports in Logbook Fill Details Vasco Barroso - “ALICE moves into warp drive” - CHEP 2012 21/05/2012