Applications of Advanced Data Analysis and Expert System Technologies in the ATLAS Trigger-DAQ Controls Framework G. Avolio – University of California,

Slides:



Advertisements
Similar presentations
Chapter 19: Network Management Business Data Communications, 5e.
Advertisements

June 2010 At A Glance The Room Alert Adapter software in conjunction with AVTECH Room Alert™ devices assists in monitoring computer room environments as.
GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC GNAM and OHP: Monitoring Tools for the ATLAS Experiment at LHC M. Della Pietra, P. Adragna,
T-FLEX DOCs PLM, Document and Workflow Management.
Chapter 19: Network Management Business Data Communications, 4e.
Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.
Component-Level Design
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Slide 1 ISTORE: System Support for Introspective Storage Appliances Aaron Brown, David Oppenheimer, and David Patterson Computer Science Division University.
Software Architecture Patterns (2). what is architecture? (recap) o an overall blueprint/model describing the structures and properties of a "system"
Building Knowledge-Driven DSS and Mining Data
Winter Retreat Connecting the Dots: Using Runtime Paths for Macro Analysis Mike Chen, Emre Kıcıman, Anthony Accardi, Armando Fox, Eric Brewer
Course Instructor: Aisha Azeem
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
First year experience with the ATLAS online monitoring framework Alina Corso-Radu University of California Irvine on behalf of ATLAS TDAQ Collaboration.
Sepandar Sepehr McMaster University November 2008
Microsoft ® Official Course Monitoring and Troubleshooting Custom SharePoint Solutions SharePoint Practice Microsoft SharePoint 2013.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Włodzimierz Funika, Filip Szura Automation of decision making for monitoring systems.
Chapter 6 – Architectural Design Lecture 2 1Chapter 6 Architectural design.
Overview of SQL Server Alka Arora.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Designing a HEP Experiment Control System, Lessons to be Learned From 10 Years Evolution and Operation of the DELPHI Experiment. André Augustinus 8 February.
Computer and Automation Research Institute Hungarian Academy of Sciences Presentation and Analysis of Grid Performance Data Norbert Podhorszki and Peter.
WITSML Service Platform - Enterprise Drilling Information
© 2007 Tom Beckman Features:  Are autonomous software entities that act as a user’s assistant to perform discrete tasks, simplifying or completely automating.
Module 7: Fundamentals of Administering Windows Server 2008.
Atomate It! End-user Context- Sensitive Automation using Heterogeneous Information Sources on the Web Max Van Kleek et el. MIT Presented by Sangkeun Lee,
Integrated Development Environment for Policies Anjali B Shah Department of Computer Science and Electrical Engineering University of Maryland Baltimore.
School of Computer Science and Technology, Tianjin University
The Network Performance Advisor J. W. Ferguson NLANR/DAST & NCSA.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Page 1 WWRF Briefing WG2-br2 · Kellerer/Arbanowski · · 03/2005 · WWRF13, Korea Stefan Arbanowski, Olaf Droegehorn, Wolfgang.
Control in ATLAS TDAQ Dietrich Liko on behalf of the ATLAS TDAQ Group.
Jess: A Rule-Based Programming Environment Reporter: Yu Lun Kuo Date: April 10, 2006 Expert System.
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
Automating Context-Aware Application Development Ted McFadden and Karen Henricksen CRC for Enterprise Distributed Systems Technology (DSTC) Jadwiga Indulska.
Network Computing Laboratory A programming framework for Stream Synthesizing Service.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
DAM-Alarming Data Analytics from Monitoring, for alarming Summer Student Project 2015 A. Martin, C. Cristovao, G. Domenico thanks to Luca Magnoni IT-SDC-MI.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
June 13-15, 2007Policy 2007 Infrastructure-aware Autonomic Manager for Change Management H. Abdel SalamK. Maly R. MukkamalaM. Zubair Department of Computer.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
WP1 WP2 WP3 WP4 WP5 COORDINATOR WORK PACKAGE LDR RESEARCHER ACEOLE MID TERM REVIEW CERN 3 RD AUGUST 2010 Magnoni Luca Early Stage Researcher WP5 - ATLAS.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
CS223: Software Engineering
TM & DVS status 29/10/2004. TM requirements (June 98) UR TMGR-1: TM shall have the ability to execute individual tests. UR TMGR-2: TM shall run on all.
G. Anders, G. Avolio, G. Lehmann Miotto, L. Magnoni CERN, Geneva, Switzerland The Run Control System and the Central Hint and Information Processor of.
July 19, 2004Joint Techs – Columbus, OH Network Performance Advisor Tanya M. Brethour NLANR/DAST.
Control-Theoretic Approaches for Dynamic Information Assurance George Vachtsevanos Georgia Tech Working Meeting U. C. Berkeley February 5, 2003.
A Validation System for the Complex Event Processing Directives of the ATLAS Shifter Assistant Tool G. Anders (CERN), G. Avolio (CERN), A. Kazarov (PNPI),
The ALICE data quality monitoring Barthélémy von Haller CERN PH/AID For the ALICE Collaboration.
The ATLAS Run 2 Expert System Giuseppe Avolio - CERN.
M. Caprini IFIN-HH Bucharest DAQ Control and Monitoring - A Software Component Model.
Costin Ionita, CERN for the ALICE DAQ collaboration ALICE Expert System ACAT 2013, Beijing, May 16 th – 21 st, 2013.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Controlling a large CPU farm using industrial tools
Scott D. Metzler California Institute of Technology
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
About Thetus Thetus develops knowledge discovery and modeling infrastructure software for customers who: Have high value data that does not neatly fit.
AIMS Equipment & Automation monitoring solution
Presentation transcript:

Applications of Advanced Data Analysis and Expert System Technologies in the ATLAS Trigger-DAQ Controls Framework G. Avolio – University of California, Irvine

Outline The ATLAS Trigger and Data Acquisition (TDAQ) system – General schema – Computing infrastructure Intelligent systems and automation – Why in TDAQ? Error management in TDAQ – Error Management System Online Recovery Diagnostic and Verification System (DVS) – Complex Event Processing The “DAQ Assistant” Conclusions 21/05/20122Giuseppe Avolio - CHEP NY

From detector to data storage Trigger and Data Acquisition System The Trigger and Data Acquisition (TDAQ) system is responsible for filtering and transferring data from the detector to the mass storage – 40 millions particle interactions per second – More than 1.5 MB of data for each event – Most of the generated events are totally uninteresting A filter mechanism is needed in order to select and collect the more interesting ones 21/05/20123Giuseppe Avolio - CHEP NY

TDAQ Computing Infrastructure 21/05/20124Giuseppe Avolio - CHEP NY More than 12k cores and 20k applications GbE network(s) with more than 10k channels 1600 point-to-point connections from detector to TDAQ

Why Intelligent Systems in TDAQ? The main goal operating the system is to maximize the data taking efficiency – Dealing fast and effectively with errors and failures The system is operated by a non-expert shift crew assisted by experts providing knowledge for specific components – Inefficiency may come from human interventions Automating error detection, diagnosis and recovery is a key feature – Effective analysis and monitoring system 21/05/20125Giuseppe Avolio - CHEP NY

ATLAS Data Taking Efficiency in 2011 In 2011 ATLAS was able to record 93.5% of the total luminosity provided by the accelerator Excellent result! – But about 50% of the inefficiency due to situations involving the human intervention 21/05/2012Giuseppe Avolio - CHEP NY6

Error Management in TDAQ 21/05/2012Giuseppe Avolio - CHEP NY7 Error Management System Meant to detect failures and able to perform automatic recovery procedures Rule-based (IF-THEN knowledge base) CLIPS framework (Rete algorithm) Actions triggered in case of predefined conditions Complex Event Processing (CEP) System Meant to deal effectively with problems requiring the human intervention Based on the ESPER CEP engine Patterns detected among stream of events Powerful Event Processing Language (EPL) Online Recovery, DVS DAQ Assistant

ONLINE RECOVERY & DIAGNOSTIC AND VERIFICATION SYSTEM (DVS) 21/05/2012Giuseppe Avolio - CHEP NY8

Online Recovery and DVS Online Recovery – Analyze and recover from errors during the data taking One global server dealing with system-wide errors and procedures Local units handling errors that can be dealt with at a sub-system level Diagnostic and Verification System (DVS) – Asses the correct functionality of the system – Detect and diagnose eventual problems Test Manager – Framework allowing to develop and configure tests for any component in the system Expert System – Based on the CLIPS toolkit – “if-then” rules 21/05/2012Giuseppe Avolio - CHEP NY9

CLIPS Originally developed by NASA Available as open-source Stand-alone application or embeddable as a library Different programming paradigms – IF-THEN rules and a forward-chaining inference engine – Object oriented constructs “COOL” language – Traditional algorithmic constructs Rete algorithm 21/05/2012Giuseppe Avolio - CHEP NY10

Online Recovery in Action The system is described and configured via the Configuration Service – Object-oriented schema The set of objects for the actual configuration is loaded by the Expert System – Using the COOL object-oriented language provided by CLIPS The Expert System engine uses information coming from errors, messages and tests to match the loaded rules The KB is parsed at run-time – Easy customization of recovery procedures 21/05/2012Giuseppe Avolio - CHEP NY11

DVS The DVS is a framework that allows to – Configure a test for any component in the system – Have a graphical representation of the testable components and of the test results Via an user-friendly GUI – Automate testing of the system – Provide the operator with diagnosis and recovery advices in case of failures – Add knowledge for testing sequences and error diagnostics 21/05/2012Giuseppe Avolio - CHEP NY12

Recovery Scenarios Recoveries cover a wide range of possible scenarios – Simple local actions Restarting a dead application Ignore problems from non-critical applications – System wide actions Disable (and eventually re-enable) a busy read-out channel without stopping the run Re-configure a sub-system during the data taking 21/05/2012Giuseppe Avolio - CHEP NY13 if running system state is running, and absent application App1 status is absent, and S1 application App1 has supervisor S1, and in application App1 membership in then ignore notify S1 ignore App1 out set membership App1 out Simple recovery action for a dead application

THE DAQ ASSISTANT 21/05/2012Giuseppe Avolio - CHEP NY14

The TDAQ Assistant 21/05/2012Giuseppe Avolio - CHEP NY15 What A tool meant at guiding the operator in his daily work Diagnosing problematic situations and suggesting action to take Remind the operator he should (not) do something Aim Reduce and simplify shifter tasks Help shifters with more detailed and pertinent information Be more efficient, avoid repetition Formalize knowledge from experts Objectives Automate checks and controls in real- time Process and analyze heterogeneous streams of information Receive instructions from TDAQ experts on what to do and how to react Promptly notify operators of problems and failures

Complex Event Processing A set of technologies to process events and discover complex patterns among streams of events – Used in financial analysis, wireless sensor networks, business process management A cross between Data Base Management System and Rule Engines 21/05/201216Giuseppe Avolio - CHEP NY Main characteristics – Continuous stream processing – Support for time/size windows, aggregation and grouping events – SQL-like pattern languages Augmented with constructs to express event relationships (time, cause and aggregation) Streams replacing tables in a continuous evaluation model

Challenges in TDAQ 21/05/2012Giuseppe Avolio - CHEP NY17 Information gathering Many information sources Several technologies Heterogeneous data Information processing Building Knowledge Base Discover complex patterns Dynamic system conditions Present results Several outputs: web pages, s, SMS … and all with thousands of information updates per second!

Directives and Alerts Directives – Encode the knowledge from experts – XML structured KB – Can be modified at run time via a web-based admin interface Alerts – Effective and timeliness notification – Intelligent processing Thanks to CEP the number of false- positive situations is drastically reduced – Carry all the information needed for debug and fault diagnosis 21/05/2012Giuseppe Avolio - CHEP NY18 Directives Define what to detect and how the system has to react (produce alerts, statistics,…) Alerts Problem description, reaction, severity, domain, pattern details

Detecting Patterns 21/05/2012Giuseppe Avolio - CHEP NY19

Architecture Data gathered and feed into the engine EPL statements (from directives) are evaluated against data (continuous query) Generating alerts, notifications, statistics as soon as incoming events meet the constraints of the pattern 21/05/2012Giuseppe Avolio - CHEP NY20

Web-Based Visualization Message driven alert distribution – Based on Apache ActiveMQ Web page for interactive visualization of alerts – Alerts grouped per categories/user preferences – User interaction Mark alerts as read when the problem is solved Mask alerts – Alert history Django project with some SQLite and jQuery goodies 21/05/2012Giuseppe Avolio - CHEP NY21

Conclusions Effective monitoring, fault diagnosis and automation of recovery procedures have shown to really help improving the ATLAS data taking efficiency The DAQ Assistant is in production since June 2011 – A message-driven architecture and CEP techniques allowed to build an intelligent and automated monitoring tool – Used to assist the data acquisition operators From simple reminders to the detection of complex error conditions Shift crew reduced by one unit (DAQ shifter) – Integrated with the EMS system in order to trigger automated recovery actions Looking forward to a successful 2012 for ATLAS – And the goal is to always improve the data taking efficiency 21/05/2012Giuseppe Avolio - CHEP NY22