Operating the ATLAS Data-Flow System with the First LHC Collisions

Slides:

Advertisements

Similar presentations

G ö khan Ü nel / CHEP Interlaken ATLAS 1 Performance of the ATLAS DAQ DataFlow system Introduction/Generalities –Presentation of the ATLAS DAQ components.

Advertisements

Sander Klous on behalf of the ATLAS Collaboration Real-Time May /5/20101.

CWG10 Control, Configuration and Monitoring Status and plans for Control, Configuration and Monitoring 16 December 2014 ALICE O 2 Asian Workshop

Clara Gaspar on behalf of the LHCb Collaboration, “Physics at the LHC and Beyond”, Quy Nhon, Vietnam, August 2014 Challenges and lessons learnt LHCb Operations.

GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC M. Della Pietra, P. Adragna,

LHCb Upgrade Overview ALICE, ATLAS, CMS & LHCb joint workshop on DAQ Château de Bossey 13 March 2013 Beat Jost / Cern.

October 20 th, 2000Lyon - DAQ2000HP Beck ATLAS Trigger & Data Acquisition Requirements and Concepts Hanspeter Beck LHEP - Bern for the ATLAS T/DAQ Group.

Kostas KORDAS INFN – Frascati XI Bruno Touschek spring school, Frascati,19 May 2006 Higgs → 2e+2  O (1/hr) Higgs → 2e+2  O (1/hr) ~25 min bias events.

1 The ATLAS Online High Level Trigger Framework: Experience reusing Offline Software Components in the ATLAS Trigger Werner Wiedenmann University of Wisconsin,

JCOP Workshop September 8th 1999 H.J.Burckhart 1 ATLAS DCS Organization of Detector and Controls Architecture Connection to DAQ Front-end System Practical.

CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.

1 Alice DAQ Configuration DB

Copyright © 2000 OPNET Technologies, Inc. Title – 1 Distributed Trigger System for the LHC experiments Krzysztof Korcyl ATLAS experiment laboratory H.

Network Architecture for the LHCb DAQ Upgrade Guoming Liu CERN, Switzerland Upgrade DAQ Miniworkshop May 27, 2013.

Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

The ATLAS Trigger: High-Level Trigger Commissioning and Operation During Early Data Taking Ricardo Gonçalo, Royal Holloway University of London On behalf.

Online Software 8-July-98 Commissioning Working Group DØ Workshop S. Fuess Objective: Define for you, the customers of the Online system, the products.

2003 Conference for Computing in High Energy and Nuclear Physics La Jolla, California Giovanna Lehmann - CERN EP/ATD The DataFlow of the ATLAS Trigger.

LHCb DAQ system LHCb SFC review Nov. 26 th 2004 Niko Neufeld, CERN.

CHEP March 2003 Sarah Wheeler 1 Supervision of the ATLAS High Level Triggers Sarah Wheeler on behalf of the ATLAS Trigger/DAQ High Level Trigger.

Experience with multi-threaded C++ applications in the ATLAS DataFlow Szymon Gadomski University of Bern, Switzerland and INP Cracow, Poland on behalf.

ATLAS TDAQ RoI Builder and the Level 2 Supervisor system R. E. Blair, J. Dawson, G. Drake, W. Haberichter, J. Schlereth, M. Abolins, Y. Ermoline, B. G.

Kostas KORDAS INFN – Frascati 10th Topical Seminar on Innovative Particle & Radiation Detectors (IPRD06) Siena, 1-5 Oct The ATLAS Data Acquisition.

LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.

1 Farm Issues L1&HLT Implementation Review Niko Neufeld, CERN-EP Tuesday, April 29 th.

Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Introduction to DAQ Architecture Niko Neufeld CERN / IPHE Lausanne.

ANDREA NEGRI, INFN PAVIA – NUCLEAR SCIENCE SYMPOSIUM – ROME 20th October

1 Nicoletta GarelliCPPM, 03/25/2011 Overview of the ATLAS Data-Acquisition System o perating with proton-proton collisions Nicoletta Garelli (CERN) CPPM,

The Evaluation Tool for the LHCb Event Builder Network Upgrade Guoming Liu, Niko Neufeld CERN, Switzerland 18 th Real-Time Conference June 13, 2012.

Jos VermeulenTopical lectures, Computer Instrumentation, Introduction, June Computer Instrumentation Introduction Jos Vermeulen, UvA / NIKHEF Topical.

14 Aug. 08DOE Review John Huth Future Harvard ATLAS Plans John Huth.

ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,

ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.

HTCC coffee march /03/2017 Sébastien VALAT – CERN.

First collisions in LHC

Giovanna Lehmann Miotto CERN EP/DT-DI On behalf of the DAQ team

Gu Minhao, DAQ group Experimental Center of IHEP February 2011

MPD Data Acquisition System: Architecture and Solutions

Ian Bird WLCG Workshop San Francisco, 8th October 2016

5/14/2018 The ATLAS Trigger and Data Acquisition Architecture & Status Benedetto Gorini CERN - Physics Department on behalf of the ATLAS TDAQ community.

Challenges in ALICE and LHCb in LHC Run3

U.S. ATLAS TDAQ FY06 M&O Planning

LCG Service Challenge: Planning and Milestones

Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017

Ricardo Gonçalo, RHUL BNL Analysis Jamboree – Aug. 6, 2007

Electronics Trigger and DAQ CERN meeting summary.

evoluzione modello per Run3 LHC

PC Farms & Central Data Recording

LHC experiments Requirements and Concepts ALICE

Enrico Gamberini, Giovanna Lehmann Miotto, Roland Sipos

Controlling a large CPU farm using industrial tools

RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne

Bernd Panzer-Steindel, CERN/IT

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

ProtoDUNE SP DAQ assumptions, interfaces & constraints

Toward a costing model What next? Technology decision n Schedule

ATLAS Canada Alberta Carleton McGill Montréal Simon Fraser Toronto

Example of DAQ Trigger issues for the SoLID experiment

12/3/2018 The ATLAS Trigger and Data Acquisition Architecture & Status Benedetto Gorini CERN - Physics Department on behalf of the ATLAS TDAQ community.

TDAQ commissioning and status Stephen Hillier, on behalf of TDAQ

John Harvey CERN EP/LBC July 24, 2001

1/2/2019 The ATLAS Trigger and Data Acquisition Architecture & Status Benedetto Gorini CERN - Physics Department on behalf of the ATLAS TDAQ community.

The LHCb High Level Trigger Software Framework

The CMS Tracking Readout and Front End Driver Testing

The Performance and Scalability of the back-end DAQ sub-system

The ATLAS Computing Model

LS 1 start date 12th June Schedule Extension 2012 run Extension of 2012 run approved by the DG on 3rd July 2012.

Presentation transcript:

Operating the ATLAS Data-Flow System with the First LHC Collisions Nicoletta Garelli CERN – Physics Department/ADT on behalf of the ATLAS TDAQ group

Outline Introduction TDAQ Working Conditions in 2010 Run Efficiency of ATLAS (A Toroidal LHC ApparatuS) ATLAS Trigger and Data AcQuisition (TDAQ) System TDAQ Working Conditions in 2010 Rates & Bandwidths Event Builder & Local Storage High Level Trigger (HLT) Farm TDAQ monitoring system and working point prediction TDAQ working beyond design specifications ATLAS Full/Partial Event Building Explore TDAQ Potential Fast!

ATLAS Recorded Luminosity LHC – to find the Higgs boson & new physics beyond the Standard Model Nominal working condition p-p beams: √s=14 TeV; L=1034 cm-2s-1; Bunch Cross every 25 ns Pb-Pb beams: √s=5.5 TeV; L=1027 cm-2s-1 SPS PS LHC LHCb Alice ATLAS CMS ATLAS Recorded Luminosity 2010 √s=7 TeV (first collisions on March, 30th) Commissioning bunch train 150 ns, up to 233 colliding bunches in ATLAS (October) Peak L~1032 cm-2s-1 (October) First ion collisions scheduled in November October, 14th 2010: 20.6 pb-1 Delivered Stable 18.94 pb-1 Ready Recorded

ATLAS Run Efficiency Key functionality for maximizing efficiency ATLAS Efficiency @Stable Beams at √s = 7 TeV (not luminosity weighted) Run Efficiency 96.5% (green): fraction of time in which ATLAS is recording data, while LHC is delivering stable beams Run Efficiency Ready 93% (grey): fraction of time in which ATLAS is recording physics data with innermost detectors at nominal voltages (safety aspect) 752.7 h of stable beams (March, 30th - Oct, 11th) Key functionality for maximizing efficiency Data taking starts at the beginning of the LHC fill Stop-less removal/recovery: automated removal/ recovery of channels which stopped the trigger Dynamic resynchronization: automated procedure to resynchronize channels which lost synchronization with LHC clock, w/o stopping the trigger Definition of livetime, list which are the reasons why you cannot have always 100% short explanation of the warm start/stop concept

Data Collection Network TDAQ Design ATLAS Data Calo/Muon Detectors Data- Flow ATLAS Event 1.5 MB/25 ns Trigger DAQ High Level Trigger ROI data (~2%) ROI Requests ~4 sec EF Accept ~200 Hz ~ 200 Hz ~ 3 kHz Event Filter Level 2 L2 Accept ~3 kHz SubFarmOutput SubFarmInput ~4.5 GB/s ~ 300 MB/s Detector Read-Out Level 1 FE <2.5 s Other Detectors Regions Of Interest L1 Accept 75 (100) kHz 40 MHz ~40 ms 112 (150) GB/s Trigger Info CERN Data Storage Event Builder ROD Event Filter Network ReadOut System Data Collection Network

~98% fully operational in 2010 ATLAS TDAQ System CERN computer center [~5 ] [~1600] [~100] [~ 500] [26] Local Storage SubFarm Outputs (SFOs) Event Filter (EF) farm Event Builder SubFarm Inputs (SFIs) Level 2 farm Control+ Configuration Data Storage [48] [1] Monitoring SDX1 DataFlow Manager Network Switches [70] [4] File Servers Level 2 Super- visors surface underground Event data requests & Delete commands USA15 Requested event data ATLAS Data Trigger Info [# nodes] Control, Configuration and Monitoring Network not shown For location and # of nodes VME bus [~ 150] ~1600 Read-Out Links UX15 Read- Out Drivers (RODs) Read-Out Subsystems (ROSes) ROI Builder Level 1 trigger Region Of Interest (ROI) Timing Trigger Control (TTC) ~90M channels ~98% fully operational in 2010

TDAQ Farm Status 27 xpu racks ~800 xpu nodes Component Installed Comments Online&Monitoring 100% ~60 nodes ROSes ~150 nodes ROIB & L2SVs HLT (L2+EF) ~50% ~800 xpu nodes; ~300 EF nodes Event Builder ~60 nodes (exploiting multi-core) SFO Headroom for high instantaneous throughput Networking Redundancy deployed in critical areas 27 xpu racks ~800 xpu nodes XPU = L2 or EF Processing Unit  on a “run by run” basis can be configured to run either as L2 or EF Possibility to move processing power between the L2 and the EF allows high flexibility to meet the trigger needs Functional node assignment (L2 or EF) not automated 2 Gbps/rack  maximum possible EF bandwidth with xpu racks ~6 GB/s Other farm anticipated the LHC performance

Event Builder and SubFarm Output EB collects L2 accepted events into single data structure @ a single place 3 kHz (L2 accept rate) x 1.5 MB (event size)  4.5GB/s (EB input) EB able to handle a wide range of event size from O(100 kB) to O(10 MB) EB sends built events to EF farm Events accepted by EF sent to the SFO SFO effective throughput (240 MB/s per node): 1.2 GB/s vs. Aim Event distribution into data files follows data stream assignment (express, physics, calibration, debug) Data files asynchronously transferred to the mass storage EB Input Bandwidth L2 commissioning = high acceptance 1 Dashed lines = nominal working points design spec. for EB input ~4.5 GB/s design spec. for SFO input ~300 MB/s EF Output Bandwidth 2 Choose the plots of the PEB EF commissioning = high acceptance

SubFarm Output Beyond the Design Summer 2010: SFO Farm 5+1 nodes, each 3 HW raid arrays, total capacity of ~50 TB (=2 days of disk buffer in case of mass storage failure) Round-Robin policy to avoid concurrent I/O 4 Gbps connection to CERN data storage LHC Van-der-Meer scan SFO input rate up to 1.3 GB/s. Data transfer to mass storage ~920 MB/s for about 1 hour LHC Tertiary Collimator Setup SFO input rate ~1GB/s Data transfer to mass storage ~950 MB/s for about 2 hours ~1 GB/s input throughput sustained during special runs to allow low event rejection TRAFFIC TO SFO TRAFFIC TO CERN DATA STORAGE

Partial Event Building (PEB) & Event Stripping Calibration events require only part of the full event info  event size ≤ 500 kB Dedicated triggers for calibration + events selected for physics & calibration PEB: calibration events are built based on a list of detector identifiers Event Stripping: events selected by L2 for physics and calibration are completely built. The subset of the event useful for the calibration is copied @ EF or SFO, depending on the EF result Output bandwidth @ EB: calibration bandwidth ~2.5% of physics bandwidth High rate, using few TDAQ resources for assembling, conveyance, and logging of calibration events Beyond design feature, extensively used! Continuous calibration complementary to dedicated calibration runs Emphasis on the fact that we take data for calibration and physics at the same time

Network Monitoring Tool (poster PO-WED-035) Scalable, flexible, integrated system for collection and display with same look&feel: Network Statistics (200 network devices and 8500 ports) Computer Statistics Environmental Conditions Data Taking Parameters Tuned for Network monitoring, but assuring a transparent access to data collected by any number of monitoring tools: correlate network utilization with TDAQ performance Constant monitoring of network performance, anticipation of possible limits

HLT Resources Monitoring Tool Dedicated calibration stream for monitoring HLT performance (CPU consumption, decision time, rejection factor) L2: ~300 selection chains, decision time ~40 ms, rejection factor ~5 EF: ~290 selection chains, decision time ~300 ms, rejection factor ~10 Note: rejection factors still low during commissioning phase Example: September, 24th L ≥ 5 1031 cm-2s-1 need for CPU power @ HLT (HLT Resources Monitoring Tool) need for bandwidth @ EB output (Network Monitoring Tool) 9 dedicated EF racks with 10 Gbps/rack enabled (~11 GB/s additional bandwidth), bandwidth sufficient even with saturated EB HLT Resources Before After 24.09. Max Bandwith 4.5 GB/s 15 GB/s # L2 Racks 9 xpu 12 xpu # EF Racks 18 xpu 15 xpu, 9 EF

High Rate Test with Random Triggers TDAQ 2010 1.5 MB/150 ns ~350 Hz ~3.5 kHz 20 kHz ~1 MHz ~20 kHz ~ 350 Hz ~30 GB/s ~550 MB/s ~5.5 GB/s ~40 ms ~300 ms High Rate Test with Random Triggers ~65 kHz

Trigger Commissioning: Balancing L2 and EF BEYOND Design Trigger Commissioning: Balancing L2 and EF 4.5 kHz 7 GB/s

BEYOND Design LHC Van-der-Meer Scans Up to 2 GB/s Up to 1.3 kHz

Outlook Exploring Data-Flow Phase Space Assumed L1 rate of 100 kHz TODAY Design Assumed L1 rate of 100 kHz # L2 XPU racks # EF specific racks EF processing time L2 processing time

Conclusions In 2010 ATLAS TDAQ system has operated beyond design requirements to meet changing working conditions, trigger commissioning and understanding of detector, accelerator and physics performance Monitoring tools in place to predict the working conditions and thus establish the HLT resource balancing Partial Event Building to continuously tune the detectors, commission the trigger, maximize the rate at which calibration events can be continuously collected, and prompt analysis of the Beam Spot. Data Logging farm regularly used beyond design specifications Evolved number of EF nodes to meet required EB rate and processing power of 2010, i.e. L= O(1032) cm-2s-1 More CPU power will be installed to meet evolving needs High Run Efficiency for Physics of 93% Ready for steady running in 2011 with further increase of L

Backup Slides

EB Performance Red & green lines: building without EF, with 90 SFI. 2 different network protocol. Blue line: throughput with the final configuration. Violet line: Last year configuration using as much as possible SFI. Today

High Rate Test with Random Triggers BEYOND Expectations High Rate Test with Random Triggers 65 kHz 100 GB/s

Data Collection Network BEYOND Design Full Scan for B-Physics ReadOut System Full Scan @21 kHz (max ~23 kHz) Data Collection Network

Explore possibilities of TDAQ Curves = maximum EF processing time sustainable with racks used as EF Evaluate TDAQ limits considering: # L2/EF XPU racks L2/EF processing time L2/EF rejection power We aim to have: max EB throughput ~5 GB/s  Green dashed line, which implies an EF rejection of ~10 and an L2 rejection power of ~18 L2 average time ~ 40 ms (dashed black line)  # L2 racks ~8 L2/EF processing time, L2 rejection power and # available racks still far from limit. 12; 6.2 GB/s 16; 8 GB/s 14; 7.2 GB/s 10; 5.4 GB/s 8; 4.4 GB/s 7; 3.6 GB/s 5; 2.8 GB/s TDAQ ready for operation with L=1032cm-2s-1 Adjust HLT Requirements Vertical Dashed Lines = EB bandwidth & corresponding EF rejection power needed to have SFO bandwidth ~ 500 MB/s