Flexible data transport for the online reconstruction of FAIR experiments Mohammad Al-Turany Dennis Klein Alexey Rybalchenko (GSI-IT) 5/17/13M. Al-Turany,

Slides:

Advertisements

Similar presentations

DISTRIBUTED COMPUTING PARADIGMS

Advertisements

MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.

Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,

Introduction CSCI 444/544 Operating Systems Fall 2008.

Distributed Processing, Client/Server, and Clusters

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.

ALFA: The new ALICE-FAIR software framework

The Publisher-Subscriber Interface Timm Morten Steinbeck, KIP, University Heidelberg Timm Morten Steinbeck Technical Computer Science Kirchhoff Institute.

CHEP03 - UCSD - March 24th-28th 2003 T. M. Steinbeck, V. Lindenstruth, H. Tilsner, for the Alice Collaboration Timm Morten Steinbeck, Computer Science.

Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.

OPERATING SYSTEMS Introduction

Status and roadmap of the AlFa Framework Mohammad Al-Turany GSI-IT/CERN-PH-AIP.

ALFA - a common concurrency framework for ALICE and FAIR experiments

FairRoot framework Mohammad Al-Turany (GSI-Scientific Computing)

GSI Operating Software – Migration OpenVMS to Linux Ralf Huhmann PCaPAC 2008 October 20, 2008.

Computer System Architectures Computer System Software

Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.

Chapter 9 Message Passing Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Operating Systems, by Dhananjay Dhamdhere2 Introduction.

Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.

Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?

 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.

High performance I/O with the ZeroMQ (ØMQ) messaging library thematic CERN School of Computing Aram Santogidis › May 2015.

 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

DISTRIBUTED COMPUTING PARADIGMS. Paradigm? A MODEL 2for notes

ATCA based LLRF system design review DESY Control servers for ATCA based LLRF system Piotr Pucyk - DESY, Warsaw University of Technology Jaroslaw.

ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.

Types of Operating Systems

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Server to Server Communication Redis as an enabler Orion Free

Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

Computing Division Requests The following is a list of tasks about to be officially submitted to the Computing Division for requested support. D0 personnel.

Data Acquisition Backbone Core J. Adamczewski-Musch, N. Kurz, S. Linev GSI, Experiment Electronics, Data processing group.

Hwajung Lee.  Interprocess Communication (IPC) is at the heart of distributed computing.  Processes and Threads  Process is the execution of a program.

CS533 - Concepts of Operating Systems 1 The Mach System Presented by Catherine Vilhauer.

Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

Simulations and Software CBM Collaboration Meeting, GSI, 17 October 2008 Volker Friese Simulations Software Computing.

Types of Operating Systems 1 Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

1 Software Design Lecture What’s Design It’s a representation of something that is to be built. i.e. design  implementation.

FairRoot Status and plans Mohammad Al-Turany 6/25/13M. Al-Turany, ALICE Offline Meeting1.

Chapter 2 Process Management. 2 Objectives After finish this chapter, you will understand: the concept of a process. the process life cycle. process states.

ALFA - a common concurrency framework for ALICE and FAIR experiments Mohammad Al-Turany GSI-IT/CERN-PH.

Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Mitglied der Helmholtz-Gemeinschaft FairMQ with FPGAs and GPUs Simone Esch –

ICE-DIP Project: Research on data transport for manycore processors for next generation DAQs Aram Santogidis › 5/12/2014.

Flexible data transport for online reconstruction M. Al-Turany Dennis Klein A. Rybalchenko 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 1.

Flexible data transport for the online reconstruction in FairRoot Mohammad Al-Turany Dennis Klein Anar Manafov Alexey Rybalchenko 6/25/13M. Al-Turany,

ALFA - a common concurrency framework for ALICE and FAIR experiments Mohammad Al-Turany GSI-IT/CERN-PH.

Using ZeroMQ for GEP. 2 About ZeroMQ The “zero” in ZeroMQZeroMQ  Zero Broker  Zero Latency (Low Latency)  Zero Administration  Zero Cost – Cross Platform.

KID - KLOE Integrated Dataflow

Data Transport for Online & Offline Processing

Alternative system models

Controlling a large CPU farm using industrial tools

Multiple Processor Systems

Software models - Software Architecture Design Patterns

Introduction to Operating Systems

Prof. Leonardo Mostarda University of Camerino

Presentation transcript:

Flexible data transport for the online reconstruction of FAIR experiments Mohammad Al-Turany Dennis Klein Alexey Rybalchenko (GSI-IT) 5/17/13M. Al-Turany, ACAT 2013, Beijing1

Contents Introduction to FAIR o CBM and Panda Experiments Why to use Message Queues Current status Results Summary 5/17/13M. Al-Turany, ACAT 2013, Beijing2

FAIR Darmstadt (Germany) GSI Facility for Antiproton and Ion Research Observer: EU, United States, Hungary, Georgia, Saudi Arabia 5/17/13M. Al-Turany, ACAT 2013, Beijing3

5/17/13M. Al-Turany, ACAT 2013, Beijing4

5/17/135 ~400 Physicists 50 Institutes 15 countries ~400 Physicists 50 Institutes 15 countries Physics program: Equation of state at high baryonic density QCD phase transition Critical end point of QCD phase diagram Restoration of Chiral Symmetry Physics program: Equation of state at high baryonic density QCD phase transition Critical end point of QCD phase diagram Restoration of Chiral Symmetry M. Al-Turany, ACAT 2013, Beijing

Anti-Proton ANnihilation at DArmstadt 5/17/136 Physics Programm: Charmonium Spectroscopy Search for Glueballs & Hebrides Hyper nucleus Drell-Yan Physics Electromagnetic form factors Rare decays Physics Programm: Charmonium Spectroscopy Search for Glueballs & Hebrides Hyper nucleus Drell-Yan Physics Electromagnetic form factors Rare decays Data rate ~10 7 Event/s Data volume ~10 PB/yr ~450 Physicists 52 Institutes 17 Countries ~450 Physicists 52 Institutes 17 Countries M. Al-Turany, ACAT 2013, Beijing

The Online Reconstruction and analysis 5/17/ GB/s 20M Evt/s 300 GB/s 20M Evt/s < 1 GB/s 25K Evt/s < 1 GB/s 25K Evt/s How to distribute the processes? How to manage the data flow? How to recover processes when they crash? How to monitor the whole system? …… How to distribute the processes? How to manage the data flow? How to recover processes when they crash? How to monitor the whole system? …… 1 TB/s 1 GB/s > CPU-core or Equivalent GPU, FPGA, … > CPU-core or Equivalent GPU, FPGA, … M. Al-Turany, ACAT 2013, Beijing7

Trigger-less Data Acquisition 5/17/13M. Al-Turany, ACAT 2013, Beijing8

Design constrains Highly flexible: o different data paths should be modeled. Adaptive: o Sub-systems are continuously under development and improvement Should works for simulated and real data: o developing and debugging the algorithms It should support all possible hardware where the algorithms could run (CPU, GPU, FPGA) It has to scale to any size! With minimum or ideally no effort. 5/17/13M. Al-Turany, ACAT 2013, Beijing9

Before Re-inventing the Wheel What is available on the market and in the community? o A very promising package: ZeroMQ is available since 2 years Do we intend to separate online and offline? NO Multithreaded concept or a message queue based one? o Message based systems allow us to decouple producers from consumers. o We can spread the work to be done over several processes and machines. o We can manage/upgrade/move around programs (processes) independently of each other. 5/17/13M. Al-Turany, ACAT 2013, Beijing10

A socket library that acts as a concurrency framework. Carries messages across inproc, IPC, TCP, and multicast. Connect N-to-N via fanout, pubsub, pipeline, request-reply. Asynch I/O for scalable multicore message-passing apps. 30+ languages including C, C++, Java,.NET, Python. Most OS’s including Linux, Windows, OS X, PPC405/PPC440. Large and active open source community. LGPL free software with full commercial support from iMatix. 5/17/1311M. Al-Turany, ACAT 2013, Beijing

Why ØMQ? A messaging library, which allows you to design a complex communication system without much effort Presents abstraction on higher level than MPI  programming model is easier Is suitable for loosely coupled and more general distributed systems 5/17/13M. Al-Turany, ACAT 2013, Beijing12

Zero in ØMQ Originally the zero in ØMQ was meant as "zero broker" and (as close to) "zero latency" (as possible). In the meantime it has come to cover different goals: zero administration, zero cost, zero waste. More generally, "zero” refers to the culture of minimalism that permeates the project. Adding power by removing complexity rather than exposing new functionality. 5/17/13M. Al-Turany, ACAT 2013, Beijing13

ZeroMQ sockets provide efficient transport options Inter-thread Inter-process Inter-node – which is really just interprocess across nodes communication 5/17/13 PMG : Pragmatic General Multicast (a reliable multicast protocol) Named Pipe: Piece of random access memory (RAM) managed by the operating system and exposed to programs through a file descriptor and a named mount point in the file system. It behaves as a first in first out (FIFO) buffer PMG : Pragmatic General Multicast (a reliable multicast protocol) Named Pipe: Piece of random access memory (RAM) managed by the operating system and exposed to programs through a file descriptor and a named mount point in the file system. It behaves as a first in first out (FIFO) buffer M. Al-Turany, ACAT 2013, Beijing14

The built-in core ØMQ patterns are: Request-reply, which connects a set of clients to a set of services. (remote procedure call and task distribution pattern) Publish-subscribe, which connects a set of publishers to a set of subscribers. (data distribution pattern) Pipeline, which connects nodes in a fan-out / fan-in pattern that can have multiple steps, and loops. (Parallel task distribution and collection pattern) Exclusive pair, which connect two sockets exclusively 5/17/13M. Al-Turany, ACAT 2013, Beijing15

Current Status The Framework deliver some components which can be connected to each other in order to construct a processing pipeline(s). All component share a common base called Device (ZeroMQ Class). Devices are grouped by three categories: o Source: Sampler o Message-based Processor: Sink, BalancedStandaloneSplitter, StandaloneMerger, Buffer o Content-based Processor: Processor 5/17/13M. Al-Turany, ACAT 2013, Beijing16

Design 5/17/13 Experiment/dete ctor specific code Framework classes that can be used directly M. Al-Turany, ACAT 2013, Beijing17

Example for CBM online reconstruction hierarchy 5/17/13 STS data Mvd data Clusterer Tracker Global Event Reconstruction Trigger event selection TRD data TOF data Clusterer Pre-Processing TRD Tracker M. Al-Turany, ACAT 2013, Beijing18

Example for CBM online reconstruction hierarchy Simulation 5/17/13 STS Sampler Mvd Sampler Clusterer Clustrer Tracker Global Event Reconstruction Trigger event selection TRD Sampler TOF Sampler Clustrer Pre-Procceing TRD Tracker M. Al-Turany, ACAT 2013, Beijing19

Message-based Processor All message-based processors inherit from Device and operate on messages without interpreting their content. Four message-based processors have been implemented so far 5/17/13M. Al-Turany, ACAT 2013, Beijing20

5/17/13 TRD data Clusterer TRD Tracker TRD data FairMQBalancedStandaloneSplitter Clustrer FairMQStandaloneMerger Tracker Example for Fan-out/Fan-in the data path for load balancing M. Al-Turany, ACAT 2013, Beijing21

Content-based Processor The Processor device has at least one input and one output socket. A task is meant for accessing and potentially changing the message content. 5/17/13M. Al-Turany, ACAT 2013, Beijing22

Detector Simulation Detector Simulation Example for CBM online reconstruction hierarchy (scenario) STS data Mvd data Clusterer REQ REP Tracker REQ SUB PUB SUB Parameter database PUB SUB PUB SUB REP 5/17/13M. Al-Turany, ACAT 2013, Beijing23

Results Throughput of 940 Mbit/s was measured which is very close to the theoretical limit of the TCP/IPv4/GigabitEthernet The throughput for the named pipe transport between two devices on one node has been measured around 1.7 GB/s 5/17/13M. Al-Turany, ACAT 2013, Beijing24 Each message consists of digits in one panda event for one detector, with size of few kBytes

Payload in Mbyte/s as function of message size 5/17/13M. Al-Turany, ACAT 2013, Beijing25 ZeroMQ works on InfiniBand but using IP over IB

Native InfiniBand/RDMA is faster than IP over IB 5/17/13M. Al-Turany, ACAT 2013, Beijing26 Implementing ZeroMQ over IB verbs will improve the performance.

ZeroMQ Root (Event loop) 5/17/13 FairRootManager FairRunAna FairTasks Init() Re-Init() Exec() Finish() FairTasks Init() Re-Init() Exec() Finish() FairMQProcessorTask Init() Re-Init() Exec() Finish() FairMQProcessorTask Init() Re-Init() Exec() Finish() ROOT Files, Lmd Files, Remote event server, … Integrating the existing software: M. Al-Turany, ACAT 2013, Beijing27

Summary ZeroMQ communication layer is integrated into our offline framework (FairRoot) On the short term we will keep both options ROOT based event loop and concurrent processes communicating with each other via ZeroMQ. On long Term we are moving away from single event loop to distributed processes. Thanks you ! 5/17/13M. Al-Turany, ACAT 2013, Beijing28

Device Each processing stage of a pipeline is occupied by a process which executes an instance of the Device class 5/17/13M. Al-Turany, ACAT 2013, Beijing29

Sampler Devices with no inputs are categorized as sources A sampler loops (optionally: infinitely) over the loaded events and send them through the output socket. A variable event rate limiter has been implemented to control the sending speed 5/17/13M. Al-Turany, ACAT 2013, Beijing30

Message format (Protocol) Potentially any content-based processor or any source can change the application protocol. Therefore, the framework provides a generic Message class that works with any arbitrary and continuous junk of memory (FairMQMessage). One has to pass a pointer to the memory buffer, the size in bytes, and can optionally pass a function pointer to a destructor, which will be called once the message object is discarded. 5/17/13M. Al-Turany, ACAT 2013, Beijing31

New simple classes without ROOT are used in the Sampler (This enable us to use non-ROOT clients) and reduce the messages size. 5/17/13M. Al-Turany, ACAT 2013, Beijing32