Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu,

Slides:

Advertisements

Similar presentations

The Control System for the ATLAS Pixel Detector

Advertisements

Tellabs Internal and Confidential Implementing Soak Testing for an Access Network Solution Presented by: Timo Karttunen Supervisor: Raimo Kantola.

GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC M. Della Pietra, P. Adragna,

® IBM Software Group © 2006 IBM Corporation Rational Software France Object-Oriented Analysis and Design with UML2 and Rational Software Modeler 04. Other.

CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.

1 Databases in ALICE L.Betev LCG Database Deployment and Persistency Workshop Geneva, October 17, 2005.

ATLAS DAQ Configuration Databases CHEP 2001September 3 - 7, 2001 Beijing, P.R.China I.Alexandrov 1, A.Amorim 2, E.Badescu 3, D.Burckhart-Chromek 4, M.Caprini.

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.

Mike Azocar Sr. Developer Technical Specialist Microsoft Corporation

ACAT 2002, Moscow June 24-28thJ. Hernández. DESY-Zeuthen1 Offline Mass Data Processing using Online Computing Resources at HERA-B José Hernández DESY-Zeuthen.

CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/20 New Experiences with the ALICE High Level Trigger Data Transport.

CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/27 A Control Software for the ALICE High Level Trigger Timm.

Uli Schäfer Discussions with H.B. after last meeting… All ATLAS standard racks and crates will be monitored. Helfrieds group will equip them with ELMBs.

The Need for Packages How do you break down a large system into smaller systems? Structured methods use functional decomposition Functions represent something.

VC Sept 2005Jean-Sébastien Graulich Report on DAQ Workshop Jean-Sebastien Graulich, Univ. Genève o Introduction o Monitoring and Control o Detector DAQ.

March 2003 CHEP Online Monitoring Software Framework in the ATLAS Experiment Serguei Kolos CERN/PNPI On behalf of the ATLAS Trigger/DAQ Online Software.

Automated Tests in NICOS Nightly Control System Alexander Undrus Brookhaven National Laboratory, Upton, NY Software testing is a difficult, time-consuming.

L. Granado Cardoso, F. Varela, N. Neufeld, C. Gaspar, C. Haen, CERN, Geneva, Switzerland D. Galli, INFN, Bologna, Italy ICALEPCS, October 2011.

Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.

1 The ATLAS Online High Level Trigger Framework: Experience reusing Offline Software Components in the ATLAS Trigger Werner Wiedenmann University of Wisconsin,

Control and monitoring of on-line trigger algorithms using a SCADA system Eric van Herwijnen Wednesday 15 th February 2006.

First year experience with the ATLAS online monitoring framework Alina Corso-Radu University of California Irvine on behalf of ATLAS TDAQ Collaboration.

Automating Linux Installations at CERN G. Cancio, L. Cons, P. Defert, M. Olive, I. Reguero, C. Rossi IT/PDP, CERN presented by G. Cancio.

Load Test Planning Especially with HP LoadRunner >>>>>>>>>>>>>>>>>>>>>>

2/10/2000 CHEP2000 Padova Italy The BaBar Online Databases George Zioulas SLAC For the BaBar Computing Group.

The D0 Monte Carlo Challenge Gregory E. Graham University of Maryland (for the D0 Collaboration) February 8, 2000 CHEP 2000.

Framework for Automated Builds Natalia Ratnikova CHEP’03.

MSS, ALICE week, 21/9/041 A part of ALICE-DAQ for the Forward Detectors University of Athens Physics Department Annie BELOGIANNI, Paraskevi GANOTI, Filimon.

Artdaq Introduction artdaq is a toolkit for creating the event building and filtering portions of a DAQ. A set of ready-to-use components along with hooks.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

JCOP Workshop September 8th 1999 H.J.Burckhart 1 ATLAS DCS Organization of Detector and Controls Architecture Connection to DAQ Front-end System Practical.

MiniBooNE Computing Description: Support MiniBooNE online and offline computing by coordinating the use of, and occasionally managing, CD resources. Participants:

Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.

Databases E. Leonardi, P. Valente. Conditions DB Conditions=Dynamic parameters non-event time-varying Conditions database (CondDB) General definition:

André Augustinus 10 October 2005 ALICE Detector Control Status Report A. Augustinus, P. Chochula, G. De Cataldo, L. Jirdén, S. Popescu the DCS team, ALICE.

ALICE Computing Model The ALICE raw data flow P. VANDE VYVRE – CERN/PH Computing Model WS – 09 Dec CERN.

Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.

ALICE, ATLAS, CMS & LHCb joint workshop on

Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

Clara Gaspar, March 2005 LHCb Online & the Conditions DB.

Navigation Timing Studies of the ATLAS High-Level Trigger Andrew Lowe Royal Holloway, University of London.

Overview of DAQ at CERN experiments E.Radicioni, INFN MICE Daq and Controls Workshop.

Storage cleaner: deletes files on mass storage systems. It depends on the results of deletion, files can be set in states: deleted or to repeat deletion.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

U.S. ATLAS Executive Committee August 3, 2005 U.S. ATLAS TDAQ FY06 M&O Planning A.J. Lankford UC Irvine.

September 2007CHEP 07 Conference 1 A software framework for Data Quality Monitoring in ATLAS S.Kolos, A.Corso-Radu University of California, Irvine, M.Hauschild.

CHEP March 2003 Sarah Wheeler 1 Supervision of the ATLAS High Level Triggers Sarah Wheeler on behalf of the ATLAS Trigger/DAQ High Level Trigger.

EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.

23/2/2000Status of GAUDI 1 P. Mato / CERN Computing meeting, LHCb Week 23 February 2000.

Development of the CMS Databases and Interfaces for CMS Experiment: Current Status and Future Plans D.A Oleinik, A.Sh. Petrosyan, R.N.Semenov, I.A. Filozova,

TDAQ Experience in the BNL Liquid Argon Calorimeter Test Facility Denis Oliveira Damazio (BNL), George Redlinger (BNL).

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

14 th IEEE-NPSS Real Time Stockholm - June 9 th 2005 P. F. Zema The GNAM monitoring system and the OHP histogram presenter for ATLAS 14 th IEEE-NPSS Real.

M. Caprini IFIN-HH Bucharest DAQ Control and Monitoring - A Software Component Model.

Online Software November 10, 2009 Infrastructure Overview Luciano Orsini, Roland Moser Invited Talk at SuperB ETD-Online Status Review.

L1Calo DBs: Status and Plans ● Overview of L1Calo databases ● Present status ● Plans Murrough Landon 20 November 2006.

Gu Minhao, DAQ group Experimental Center of IHEP February 2011

U.S. ATLAS TDAQ FY06 M&O Planning

Srećko Morović Institute Ruđer Bošković

Enrico Gamberini for the GTK WG TDAQ WG Meeting June 01, 2016

CMS High Level Trigger Configuration Management

System Design and Modeling

Controlling a large CPU farm using industrial tools

Grid Canada Testbed using HEP applications

DAQ Architecture Design of Daya Bay Reactor Neutrino Experiment

LHCb Detector Description Framework Radovan Chytracek CERN Switzerland

The Performance and Scalability of the back-end DAQ sub-system

LHCb Detector Description Framework Radovan Chytracek CERN Switzerland

Presentation transcript:

Large Scale and Performance Tests of the ATLAS Online Software CERN ATLAS TDAQ Online Software System D.Burckhart-Chromek, I.Alexandrov, A.Amorim, E.Badescu, M.Caprini, M.Dobson, R.Hart, R.Jones, A.Kazarov, S.Kolos, V.Kotov, D.Liko, L.Lucio, L.Mapelli, M.Mineev, L.Moneta, M.Nassiakou, L.Pedro, A.Ribeiro, Y.Ryabov, D.Schweiger, I.Soloviev, H. Wolters CHEP2001 Beijing China

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 2 Content Content The Online System in ATLAS TDAQ Testing in the Online System Aims of the large Scale and Performance Tests Approach Test Series and their Setup Test Configurations Results Experience and tips for doing large scale tests Future tests and Conclusions

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 3 TDAQ System/Context Detector ~ 200 nodes Detector Control System Physics & Event Selection Architecture (PESA) Event Store Offline Computing Online Software LVL1 Trigger Dataflow: ~800 nodes Readout System Data Collection High Level Trigger Reconstruction Framework (Athena) Selected Events HLT Strategy Algorithms LVL1 Result Detector DataLVL1 Input Configuration, Run Control, Process Control, Inter Process Communication Message Reporting, Info Service, Monitoring Detector ~ 200 nodes LVL1 Trigger Dataflow: ~800 nodes Readout System Data Collection High Level Trigger : elements running the online software

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 4 Aims of the large Scale and Performance Tests Verify Scalability Verify Scalability of the online system to a large configuration Study Interaction Study Interaction between the online components in a large configuration Measure Performance Measure Performance take timing values of the various setup, run control transition and shutdown phases Understand System Limits Understand System Limits Push the system to a very large size Perform selected Fault Tolerance tests Perform selected Fault Tolerance tests

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 5 Testing in the Online System Component Testing Component Testing Formal Inspection of Components Unit Tests of Components Nightly Builds with component check System Integration Testing System Integration Testing Nightly Builds with basic check on integration Last Successful Nightly Build available to developers Planned Public Releases Planned Public Releases 3-5 times a year Remote Test Centers to test the Pre-Release retrieving the system from a tar file or from CD-ROM Deployment Deployment in Test Beam Operation gives feedback

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 6 Approach for the large Scale and Performance Tests Test Preparation Test Preparation Test Plan prepared beforehand defines aims, scope, configurations, resources and describes the tests Testware Testware use of existing example programs for controllers and monitoring, use of standard setup script utility scripts to establish the configuration, and to start/stop process manager daemons Functionality of other systems emulated where necessary During the Tests During the Tests automatically produced test results and log files immediate logging and follow up of issues found fixes and enhancements verified in the next iteration

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 7 DAQ Configuration The ATLAS Detector Each sub-detector has a large number of readout nodes/crates The Online System Control Tree connects the sub-detectors Online system is responsible for Configuration Database Run Control Process Management Inter Process Communication Message Reporting Information Service Monitoring Control of a multi-detector system The configuration database describes a partition : information on all processes and their relationships the run control hierarchy in the online system startup and shutdown dependencies

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 8 Test: run each base partition separately run each base partition separately run base partitions in parallel run base partitions in parallel Detector Controller per crate/node: one run controller one monitoring sampler read out crates are linked to a detector controller Test Set-Up Hardware and Network Hardware and Network 6 test series on 3 Test clusters, 2 days - 1 week: 16, 65, 112 PCs, Linux 6.1, Mhz, MB afs, nfs, local network Base Partition Base Partition 10 independent partitions created 11 PCs per partition one process manager daemon

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 9 Test: run the 10 test partitions sequentially 10 configurations 10 configurations build from the base partitions up to 10 base partitions + 1 root controller + 1 monitoring factory one monitoring sampler per crate controller up to 112 PCs in a 3-level hierarchy Root Controller Detector controller 10crate controllers Test Configuration-3 Level Partitions Separate Partitions are combined Example for 112 nodes Monitoring factory

Nested Partitions in Configuration data file CHEP Large Scale Performance Tests of the ATLAS Online System - D. Burckhart-Chromek See contribution for this conference: Atlas DAQ Configuration Databases by Igor Soloview

100 controller partition CHEP Large Scale Performance Tests of the ATLAS Online System - D. Burckhart-Chromek

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 12 Timing tests: Logical View of Transitions

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 13 Setup/Boot/Shutdown/Close IT-Cluster Slow increase with larger configuration Constant Expected increase with number of processes Dependency problem discovered

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 14 Scalability and Performance RC state transitions IT-Cluster Heavy load of communication Single state transition Single state transition plus 1s 3 state transitions

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 15 Results in numbers For the large test partitions For the large test partitions on 112 PCs were ~ 340 processes running: 111 controllers, 100 monitoring samplers, 112 pmg daemons, ~10 servers, 1 monitoring factory ~ 850 entries in the database data file (250 sw, 600 hw) First large scale test: First large scale test: 45 issues found (bugs, problems, improvement suggestions) 52 days in equivalent of 8h working days for an elapsed time of 3 weeks test preparation and 1 week testing, excluding analysis, for 1-3 testers tons of log files Following iterations: Following iterations: re-use original test plan and add brief update preparation time reduced radically to ~ 2-3 days test runs mostly done automatically

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 16 Experience and Tips Preparation: Preparation: Require Unit Tests of components Prepare a detailed Test Plan beforehand Run large Scale Tests on a tested and frozen release Foresee expandable, flexible configuration and test infrastructure Encourage precise information logging for problem tracing Organization Organization Store the testware in the software repository Run the testware regularly/automatically to verify it is up to date Re-use test items like test structure, testware, scripts, checklists Network Network Use NFS not AFS Run on isolated network & monitor activity

CHEP Large Scale and Perfomance Tests of the ATLAS Online System - D. Burckhart-Chromek 17 Conclusions and Future The online system can run a partition consisting of > 100 PCs The online system can run a partition consisting of > 100 PCs The online system can run partitions in parallel The online system can run partitions in parallel Scalability tests spot problems you can’t see in another manner Shielding from Cern network has a very positive effect 4 level hierarchy is behavior very similar to 3-level Very large scale Stress Tests help studying process communication Future Future Run basic integration test at each successful nightly build Repeat Tests on a regular basis for each major release building on existing material Push scale further to uncover new effects Automate the tests further Gradually include more SW items and components from other systems Many thanks to CMS and to CERN/IT for giving us access to their PC clusters