1 ATLAS experience running ATHENA and TDAQ software Werner Wiedenmann University of Wisconsin Workshop on Virtualization and Multi-Core Technologies for.

Slides:

Advertisements

Similar presentations

Threads, SMP, and Microkernels

Advertisements

Operating System.

Slide 2-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 2 Using the Operating System 2.

Chapter 4: Threads. Overview Multithreading Models Threading Issues Pthreads Windows XP Threads.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 5: Threads Overview Multithreading Models Threading Issues Pthreads Solaris.

Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.

OS Fall ’ 02 Introduction Operating Systems Fall 2002.

CS 104 Introduction to Computer Science and Graphics Problems

3.5 Interprocess Communication

Trigger and online software Simon George & Reiner Hauser T/DAQ Phase 1 IDR.

1 The ATLAS Online High Level Trigger Framework: Experience reusing Offline Software Components in the ATLAS Trigger Werner Wiedenmann University of Wisconsin,

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

14.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 4: Threads.

Tanenbaum 8.3 See references

Chapter 3 Memory Management: Virtual Memory

Protection and the Kernel: Mode, Space, and Context.

Chapter 4: Threads. From Processes to Threads 4.3 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th edition, Jan 23, 2005 Threads.

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Threads, Thread management & Resource Management.

Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.

INTRODUCTION SOFTWARE HARDWARE DIFFERENCE BETWEEN THE S/W AND H/W.

Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

Future Framework John Baines for the Future Framework Requirements Group 1.

Navigation Timing Studies of the ATLAS High-Level Trigger Andrew Lowe Royal Holloway, University of London.

TDAQ Upgrade Software Plans John Baines, Tomasz Bold Contents: Future Framework Exploitation of future Technologies Work for Phase-II IDR.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

Lecture 5: Threads process as a unit of scheduling and a unit of resource allocation processes vs. threads what to program with threads why use threads.

Processor Architecture

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

Chapter 4: Multithreaded Programming. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts What is Thread “Thread is a part of a program.

Operating Systems: Internals and Design Principles

Threads, Thread management & Resource Management.

Threads. Readings r Silberschatz et al : Chapter 4.

Experience with multi-threaded C++ applications in the ATLAS DataFlow Szymon Gadomski University of Bern, Switzerland and INP Cracow, Poland on behalf.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Lecturer 3: Processes multithreaded Operating System Concepts Process Concept Process Scheduling Operation on Processes Cooperating Processes Interprocess.

CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.

Contents 1.Overview 2.Multithreading Model 3.Thread Libraries 4.Threading Issues 5.Operating-system Example 2 OS Lab Sun Suk Kim.

Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Multi-threading and other parallelism options J. Apostolakis Summary of parallel session. Original title was “Technical aspects of proposed multi-threading.

Mini-Workshop on multi-core joint project Peter van Gemmeren (ANL) I/O challenges for HEP applications on multi-core processors An ATLAS Perspective.

Introduction to Operating Systems Concepts

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Processes and threads.

Multicore Computing in ATLAS

Chapter 2: System Structures

Chapter 4: Threads.

Chapter 15, Exploring the Digital Domain

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Threads and Concurrency

Threads Chapter 4.

Multithreaded Programming

The LHCb High Level Trigger Software Framework

Operating System Introduction.

Chapter 4: Threads & Concurrency

Chapter 4: Threads.

CS510 Operating System Foundations

Chapter 2 Operating System Overview

Chapter 4: Threads.

Lecture Topics: 11/1 Hand back midterms

CSE 542: Operating Systems

Presentation transcript:

1 ATLAS experience running ATHENA and TDAQ software Werner Wiedenmann University of Wisconsin Workshop on Virtualization and Multi-Core Technologies for LHC CERN, April 14-16, 2008

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 2 Introduction  Atlas software environment  Experience with multi-threading  Experience with multiple process instances  Questions on further technology evolution  Summary and Plans Many thanks for contributions to: A. dos Anjos, A. Bogaerts, P. Calafiura, H. von der Schmitt, S. Snyder

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 3 From Detector Data to Offline Reconstruction 75 kHz ~ 2 kHz ~ 200 Hz Rate Target processing time ~ 1 s ~ 10 ms 2.5 μs Level-1 Hardware trigger Event Reconstruction High Level Triggers (HLT) Level-2 + Event Filter Software trigger Offline Recon- struction Farms with multi- core machines

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 4 High Level Triggers and Offline use Athena Framework  Level-2 (75 kHz→2 kHz, 10 ms target processing time)  Partial event reconstruction in Regions of Interest (RoI)  RoI data are requested over network from readout system  HLT selection software runs in the Level-2 Processing Unit (L2PU) in worker threads  Each thread processes one event (event parallelism)  Executable size ~ 1 Gb  Event Filter (2 kHz→200 Hz, 1-2 s target processing time)  Full event reconstruction (seeded by Level-2 result) with offline-type algorithms  Independent Processing Tasks (PT) run selection software on Event Filter (EF) farm nodes  Executable size ~ 1.5 Gb  Offline Reconstruction  Full event reconstruction with best calibration  Executable size ~ Gb  HLT Selection Software and Offline Reconstruction Software  Both use Athena/Gaudi framework interfaces and Services  Share many services and algorithms, e.g. detector description, track reconstruction tools etc.

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 5 LVL2 and EF Data Flow : Threads and Processes Data Collection L2PU/EFPT Steering Controller HLT Event Selection Software “Offline” Framework ATHENA / GAUDI Online/Data Collection Framework Level-2 Event processing in Worker threads Event Filter Event processing in Processing Tasks

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 6 HLT Event Selection and Offline Software Level2 HLT Data Flow Software HLT Selection Software  Framework ATHENA/GAUDI  Reuse offline components  Common to Level-2 and EF Offline algorithms used in EF

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 7 Multi-core machines in ATLAS  Multi-core machines are used in all big trigger and reconstruction farms in Atlas (e.g. Dual Quad-core processors for trigger farm)  Large number of CPU cores calls for more parallelism  Event parallelism inherent to typical high energy physics selection and reconstruction programs  Parallelization inside applications may provide huge speed ups but requires typically also careful (re)design of code  may be demanding for are large existing code basis  Exploit Parallelism with  Multi-threading  Multiple processes  ATLAS has large code basis mostly written and designed in “pre-multi- core era”  HLT reconstruction code and offline reconstruction code mainly process based and single threaded  However many multi-threaded applications available in TDAQ framework  Experimented with multi-threading and multiple processes (and mixture of both)  Existing code basis implies boundary conditions for future developments

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 8 Multi-threading  Code sharing  Small context switch times  “lightweight processes”  Automatic sharing of many hardware resources  Example: Trigger L2 Processing Unit  Event processing in multiple worker threads  HLT selection software is controlled by TDAQ framework  Special version of Gaudi/Athena framework to create selection algorithm instances for worker threads  Development and MT tests started on dual processor single core machines, long before multi-core machines were available

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 9 Multi-threading support in Athena framework  Multi-threading: All services and algorithms which modify data have to be thread specific (e.g. StoreGate/EventStore). Threads may however also use “read only” services which are common to all threads (e.g. GeometrySvc, DetectorStore)  All thread specific instances of services and algorithms are distinguished by type and (generic name)__(thread ID). E.g. create an algorithm of type "TriggerSteering" and generic name "TrigStr" for 2 threads:  TriggerSteering/TrigStr__0  TriggerSteering/TrigStr__1  Assumption :  Algorithms (SubAlgorithms) are always thread specific, i.e. for each thread an algorithm copy is generated automatically  If a Service is run thread specific or common for all threads has to be specified in the configuration  Modified Athena can also be used for normal offline running (i.e. no thread ID will be appended, number of threads = 0)

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 10 Experiences with Multi-threading (1)  Created different event selection slices which could run multithreaded  Some technical issues are historical now but interesting, e.g.  Implementation of STL elements with different compiler versions  memory allocation model not optimal for “event parallelism”  Thread safe external libraries

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 11 Multi-threading Performance  Standard Template Library (STL) and multi-threading  L2PU: independent event processing in each worker thread  Default STL memory allocation scheme (common memory pool) for containers is inefficient for L2PU processing model → frequent locking  L2PU processing model favors independent memory pools for each thread  Use pthread allocator/DF_ALLOCATOR in containers  Solution for strings = avoid them  Needs changes in offline software + their external sw  need to insert DF_ALLOCATOR in containers  utility libraries need to be compiled with DF_ALLOCATOR  design large containers to allocate memory once and reset data during event processing.  Evaluation of problem with gcc 3.4 and icc 8  Results with simple test programs (  were also used to understand original findings) indicate considerable improvement (also for strings) in libraries shipped with new compilers  Insertion of special allocator in offline code may be avoided when new compilers are used InitializationEvent processing Worker threads blocked L2PU with 3 worker threads

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 12 Multithreading: Compiler Comparison (vector, list, string) Gcc 2.95 not valid, string not thread safe Need technology tracking  Compilers  Debuggers  Performance assessment tools

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 13 Experiences with Multi-threading (2)  Software development  Developers have to be familiar with thread programming  Need special training and knowledge  Developers have to take into account for their code the multi threading model of L2  event parallelism  Created emulator athenaMT as a development tool/environment for Lvl2 code  Synchronization problems for multi-threaded code are tedious to debug  Need good tools to assist developers for debugging and optimizing multi-threaded programs  Typically selection code changes rapidly due to physics needs  constant need for re-optimization  Problem: Preserve thread safe and optimized code over release cycles and in a large heterogeneous developer community (  coupling of different software communities with different goals)  Presently we run n (= # cores) instances of L2 Processing Unit on a multi-core machine with one worker thread

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 14 Multiple Processes  Run n process instances on machine with n cores  Easy to do with existing code  a priori no code changes required  Observe good scaling with number of cores  Disadvantages: Resource sharing and optimization  Resource requirements are multiplied with number of process instances  Memory size  OS resources: file descriptors, network sockets,….  On trigger farms  Number of controlled applications  Number of network connections to readout system  Transfer of same configuration data n times to the same machine  Recalculation of the same configuration data n times  Optimal CPU utilization  use CPU for event processing while waiting for input data One Machine with 8 cores in total

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 15 Reduce Memory Size P/T Core 1 P/T Core 2 P/T Core 3 P/T Core 4 Memory Controller P/T Core 1 P/T Core 2 P/T Core 3 P/T Core 4 Memory Controller Data per P/T Constant data sharable Memory Typically in HEP applications all processes use a large amount of constant configuration and detector description data. Two approaches tried in prototypes

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 16 A: Memory sharing via fork (1)  Idea and prototype by Scott Snyder (BNL)  &resId=0&materialId=slides&confId= &resId=0&materialId=slides&confId=5060  Following is from Scott’s slides  Basic Ideas  Run multiple Athena reconstruction jobs, sharing as much memory as possible  Minimize number of required code changes, let the OS do most of the work  Use fork()  Fork()  Fork() clones a process, including its entire address space  Modern OS, fork() uses “Copy On Write”  memory is shared up to the point a process writes to it. Memory will be copied and the affected changes will become unshared.  Fork is done after the first event is processed, but before any output is written  As much memory as possible is automatically shared between processes. Memory which is modified will become unshared. Static configuration data will remain shared.

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 17 A: Memory sharing via fork (2)  Advantages  All memory that can be shared will be  Code changes restricted to few framework packages, bulk of the code remains untouched  Don’t need to worry about locking  Disadvantages  Memory cannot be re-shared after it became unshared  maybe e.g. problem for conditions data Conditions A Conditions B Events

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 18 A: Memory sharing via fork (3)  First prototype results encouraging with a standard reconstruction job from a recent release  Input Z  ee Monte Carlo data  All detectors used  Total job size ~ 1.7 Gb  Data after running a few events (crude estimate from /proc/…/smaps)  Total heap size: 1149 Mb  Total heap resident size: 1050 Mb  Shared memory: 759 Mb  Private memory: 292 Mb  ~ 2 / 3 of memory remains shared  With real data frequent conditions data updates may change the results (  see previous slide)

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 19 B: Detector Data in Shared Memory (1)  Idea and prototype by Hans von der Schmitt (MPI for Physics, Munich)  &resId=0&materialId=slides&confId= &resId=0&materialId=slides&confId=5060  Following is from Hans’s slides  DetectorStore sharable  DetectorStore fill once and share  Reduce configuration time by avoiding same calculations multiple times  Memory saving ~ O(100 Mb), more possible in future

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 20 B: Detector Data in Shared Memory (2)  Athena Storage Manager architecture very well suited to support shm  administration of storable objects on the heap and in shm  In shm the store can not only serve algorithms within one Unix process but also between processes  Need additional logic to fill, lock and attach shared memory write- protected to a process  ideally handled in a state change for the reconstruction program / trigger code  In a prototype the Athena Storage Manager “StoreGate” was modified to be able to use also shared memory to store and retrieve objects  Quite some code changes to present implementation  Tested with some simple examples  Roll out/in of complete shm to file seems fine  Implementation details can be found in presentation mentioned above

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 21 Questions on further Technology Evolution  What strategy is best for future hardware platforms ?  Could mixed scenarios be required ?  run multiple processes each multithreaded ?  Does one have to worry about CPU affinity ?  Is there a danger to lock in to a certain hardware technology ?  …

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 22 Questions  all CPU sockets access/share common memory system  Pure multithreading  Run one application instance per machine with n worker threads, n >= #CPU sockets * #cores per CPU  Multiple processes and one shared memory segment P/T Core 1 P/T Core 2 P/T Core 3 P/T Core 4 Memory Controller P/T Core 1 P/T Core 2 P/T Core 3 P/T Core 4 Memory Controller Data per P/T Constant data sharable Memory

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 23 Questions  each CPU socket has its own memory subsystem attached  Multiple processes, each process multithreaded  Run one application instance per CPU socket with n worker threads, n >= #cores per CPU  Multiple processes and multiple shared memory segments P/T Core 1 P/T Core 2 P/T Core 3 P/T Core 4 Memory Controller Data per P/T Constant data sharable P/T Core 1 P/T Core 2 P/T Core 3 P/T Core 4 Memory Controller Data per P/T Constant data sharable

Werner Wiedenmann, Virtualization and Multi-Core Technologies for LHC 2008/04/15 24 Summary and Plans  Due to the large code basis written as single threaded applications it is probably best in the near future to explore the multi applications approach first for multi-core machines  Most important : reduce memory requirement  Investigate resource optimization strategies  Compiler support  OS support (scheduler, resource sharing, … )  Explore performance assessment and optimization tools  Multi-threading may offer big performance gains but is more difficult to realize for a large code basis written over a long time and needs  Good support from compilers, external libraries and programming tools  Developer training  In both cases, multi-process and multi-threaded, we would like to have more general support libraries for typical recurring code requirements in HEP application on multi-core machines e.g.  Access/manage “logical” file descriptors shared or split among different processes / threads  High level “functions” to manage shared memory  Guidelines  ….