TDAQ ATLAS Reimplementation of the ATLAS Online Event Monitoring Subsystem Ingo Scholtes Summer Student University of Trier Supervisor: Serguei Kolos.

Slides:



Advertisements
Similar presentations
Mobile Agents Mouse House Creative Technologies Mike OBrien.
Advertisements

RPC & LVL1 Mu Barrel Online Monitoring during LS1 M. Della Pietra.
Silberschatz and Galvin  Operating System Concepts Module 16: Distributed-System Structures Network-Operating Systems Distributed-Operating.
GENI Experiment Control Using Gush Jeannie Albrecht and Amin Vahdat Williams College and UC San Diego.
March 24-28, 2003Computing for High-Energy Physics Configuration Database for BaBar On-line Rainer Bartoldus, Gregory Dubois-Felsmann, Yury Kolomensky,
GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC M. Della Pietra, P. Adragna,
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
COMMMONWEALTH OF AUSTRALIA Do not remove this notice.
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
1 PLuSH – Mesh Tree Fast and Robust Wide-Area Remote Execution Mikhail Afanasyev ‧ Jose Garcia ‧ Brian Lum.
Synchronization in Distributed Systems. Mutual Exclusion To read or update shared data, a process should enter a critical region to ensure mutual exclusion.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
March 2003 CHEP Online Monitoring Software Framework in the ATLAS Experiment Serguei Kolos CERN/PNPI On behalf of the ATLAS Trigger/DAQ Online Software.
Barracuda Networks Confidential1 Barracuda Backup Service Integrated Local & Offsite Data Backup.
New Features of APV-SRS-LabVIEW Data Acquisition Program Eraldo Oliveri on behalf of Riccardo de Asmundis INFN Napoli [Certified LabVIEW Developer] NYC,
Riccardo de Asmundis INFN Napoli [Certified LabVIEW Developer]
1 The Google File System Reporter: You-Wei Zhang.
Use of ROOT in the D0 Online Event Monitoring System Joel Snow, D0 Collaboration, February 2000.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
YOUR LOGO HERE Slide Master View © (reproduced with permission) Incident Management All course material is copyright.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
1 Telematics/Networkengineering Confidential Transmission of Lossless Visual Data: Experimental Modelling and Optimization.
UAB Dynamic Monitoring and Tuning in Multicluster Environment Genaro Costa, Anna Morajko, Paola Caymes Scutari, Tomàs Margalef and Emilio Luque Universitat.
Visual Linker Final presentation.
Resilient Peer-to-Peer Streaming Presented by: Yun Teng.
LECC2003 AmsterdamMatthias Müller A RobIn Prototype for a PCI-Bus based Atlas Readout-System B. Gorini, M. Joos, J. Petersen (CERN, Geneva) A. Kugel, R.
Copyright © 2000 OPNET Technologies, Inc. Title – 1 Distributed Trigger System for the LHC experiments Krzysztof Korcyl ATLAS experiment laboratory H.
A performance evaluation approach openModeller: A Framework for species distribution Modelling.
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.
Gnam Monitoring Overview M. Della Pietra, D. della Volpe (Napoli), A. Di Girolamo (Roma1), R. Ferrari, G. Gaudio, W. Vandelli (Pavia) D. Salvatore, P.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Classification and Analysis of Distributed Event Filtering Algorithms Sven Bittner Dr. Annika Hinze University of Waikato New Zealand Presentation at CoopIS.
UAB Dynamic Tuning of Master/Worker Applications Anna Morajko, Paola Caymes Scutari, Tomàs Margalef, Eduardo Cesar, Joan Sorribes and Emilio Luque Universitat.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
2003 Conference for Computing in High Energy and Nuclear Physics La Jolla, California Giovanna Lehmann - CERN EP/ATD The DataFlow of the ATLAS Trigger.
Security in Cloud Computing Zac Douglass Chris Kahn.
CHEP March 2003 Sarah Wheeler 1 Supervision of the ATLAS High Level Triggers Sarah Wheeler on behalf of the ATLAS Trigger/DAQ High Level Trigger.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Fall 2008CS 334: Computer SecuritySlide #1 Design Principles Thanks to Matt Bishop.
TDAQ Experience in the BNL Liquid Argon Calorimeter Test Facility Denis Oliveira Damazio (BNL), George Redlinger (BNL).
Outline Introduction Bluetooth Low Energy (BLE)
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
WP1 WP2 WP3 WP4 WP5 COORDINATOR WORK PACKAGE LDR RESEARCHER ACEOLE MID TERM REVIEW CERN 3 RD AUGUST 2010 Magnoni Luca Early Stage Researcher WP5 - ATLAS.
The ATLAS DAQ System Online Configurations Database Service Challenge J. Almeida, M. Dobson, A. Kazarov, G. Lehmann-Miotto, J.E. Sloper, I. Soloviev and.
Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.
G. Anders, G. Avolio, G. Lehmann Miotto, L. Magnoni CERN, Geneva, Switzerland The Run Control System and the Central Hint and Information Processor of.
Event Management. EMU Graham Heyes April Overview Background Requirements Solution Status.
1 DAQ.IHEP Beijing, CAS.CHINA mail to: The Readout In BESIII DAQ Framework The BESIII DAQ system consists of the readout subsystem, the.
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
14 th IEEE-NPSS Real Time Stockholm - June 9 th 2005 P. F. Zema The GNAM monitoring system and the OHP histogram presenter for ATLAS 14 th IEEE-NPSS Real.
A Validation System for the Complex Event Processing Directives of the ATLAS Shifter Assistant Tool G. Anders (CERN), G. Avolio (CERN), A. Kazarov (PNPI),
The ATLAS Run 2 Expert System Giuseppe Avolio - CERN.
June 1, 2004© Matt Bishop [Changed by Hamid R. Shahriari] Slide #13-1 Chapter 13: Design Principles Overview Principles –Least Privilege –Fail-Safe.
M. Caprini IFIN-HH Bucharest DAQ Control and Monitoring - A Software Component Model.
Online Software November 10, 2009 Infrastructure Overview Luciano Orsini, Roland Moser Invited Talk at SuperB ETD-Online Status Review.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
Gu Minhao, DAQ group Experimental Center of IHEP February 2011
Use of FPGA for dataflow Filippo Costa ALICE O2 CERN
How OPNFV Should Act Beyond Breaking Points
DAQ for ATLAS SCT macro-assembly
Software Architecture in Practice
Communication and Memory Efficient Parallel Decision Tree Construction
Systems Analysis Overview.
Introduction to locality sensitive approach to distributed systems
The Performance and Scalability of the back-end DAQ sub-system
Parallel Programming in C with MPI and OpenMP
Implementation of DHLT Monitoring Tool for ALICE
Presentation transcript:

TDAQ ATLAS Reimplementation of the ATLAS Online Event Monitoring Subsystem Ingo Scholtes Summer Student University of Trier Supervisor: Serguei Kolos

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier2 Outline Online Monitoring Current implementation and its drawbacks My reimplementation Performance comparison Conclusion

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier3 Getting the context: Atlas Online Software System of the Atlas Trigger DAQ Project Main purpose: configure, control and monitor data acquisition system Provides a GUI, which allows to control the data acquisition system “Glue” of several TDAQ sub-systems Open Source project Configuration Monitoring ControlTutorial Online Software > Information Service > Histogram Distribution > Error Reporting <<subsystem>>Monitoring > Event Sample Monitoring

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier4 Criteria D Criteria A Criteria B Criteria A A brief overview of the ATLAS Online Event Monitoring Subsystem Detector Electronics Read-Out-Driver (ROD) ROD Read-Out-System (ROS) Event Sampler Monitor Task Event Sampler Event Builder ROS ROD USER CODE USER CODE USER CODE USER CODE USER CODE USER CODE A+B DD CD C Physicists DataFlow Event Sampler USER CODE D DD

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier5 The current implementation in detail Monitor Task Distributor Sampler Detector1/Crate1 Buffer Sampler Detector1/Crate2 Monitor Task Machine A Machine B Machine C Machine D Machine E Machine F SELECT Detector1/Crate1 START ITERATOR ITERATOR SELECT Detector1/Crate1 SELECT Detector1/Crate2 START ITERATOR

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier6 Drawbacks… Scalability problem due to bottleneck in machine A Monitors will not notice if sampler crashed They just stop receiving events… Users will have to worry about thread management in sampler Start thread on StartSampling Cleanly exit it on StopSampling Often causes problems

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier7 …and what we learn from them Core of scalability problem: central distributor Bottleneck due to… routing of events through central distributor multiple distribution of identical events to different monitors

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier8 Implementation requirements Platform independent C++ Using Online Monitoring IPC based on CORBA (omniORB 4) minimal and deterministic effect on the data flow system performance High scalability Get rid of all drawbacks… ;-)

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier9 What is really crucial? Sampler has to decide about criteria  saves a lot of bandwidth Sampler has to send each event once (per selection criteria) Distributor necessary to protect sampler from inrushing monitors (gatekeeper function)

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier10 Basic improvement ideas Get rid of distributor for communication  P2P Moving load to monitors for means of scalability Current: share bandwidth, accumulate load Idea: share load, accumulate bandwidth ;-) Distributor only for connection management and error recovery Keeping only crucial things in sampler Criteria decisions One-time sending of each event (  at least one connection per sampler/criteria) Sampler thread management Start sampling thread with first subscription End sampling thread with loss of last subscription User code not aware of threads

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier11 Getting rid of the distributor bottleneck can be obtained by using P2P paradigm One monitor per sampler Multiple monitors per sampler? EM ES EM ES EM We need a way to prevent bottleneck here! EM ES EM ES EM

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier12 Introducing the monitor tree Monitor Distributor Sampler  Bartering time/bandwidth requirements with latency  Costs of distribution to monitors independent of number of monitors  Configured type (unary=list, binary, …) influences latency/bandwidth tradeoff  But: new problems arise with this structure Example: Binary Monitor tree

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier13 Problem 1: exit of monitors Each monitor acts as a sampler for his children Exits/crashes of monitors critical… We have to distinguish between different types of exits Leaf monitor  trivial Monitor with outdegree > 0  more complicated Root monitor  critical

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier14 Solutions – Leaf monitor exit Event Distributor Event Monitor Event Monitor Event Monitor Event Sampler  Trivial operation, just delete the monitor from the tree!

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier15 Event Monitor Solutions – Monitor with outdegree > 0 exits Event Distributor Event Monitor Event Monitor Event Monitor Event Sampler Event Monitor  More complicated, but the distributor can do it, as he has knowledge of the whole tree, O(C) complexity with C being constant maximum number of children

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier16 Solutions – Root monitor exit Event Distributor Event Monitor Event Monitor Event Monitor Event Sampler  Critical operation, as sampler is involved, but possible to do it transparently for other monitors, again O(1) complexity

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier17 Problem 2: error recovery Crash of sampler Distributor pings all samplers in reasonable intervals  can notify monitors about crash Crash of arbitrary monitors Detected like normal exit!  no problem Crash of distributor No influence on ongoing data exchange Just restart…

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier18 Comparing performance… Current implementation Reimplementation Sampler i Monitor i Distributor

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier19 ROD Profile ( K)

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier20 ROD Profile ( K)

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier21 ROS profile ( K)

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier22 ROS profile ( K)

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier23 EB profile (100 2MB)

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier24 EB profile (100 2MB)

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier25 Conclusion Reimplementation fulfills all needs Improved speed As seen: Optimal scalability (constant!) Enhanced error recovery Configurable tradeoff between latency and CPU/bandwidth requirements (tree type unary, binary, …) Users do not need to care about thread management

TDAQ ATLAS Ingo Scholtes - Summer Student, University of Trier26 Thanks… …for your attention! Questions? Criticism?