Flexible data transport for online reconstruction M. Al-Turany Dennis Klein A. Rybalchenko 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 1.

Slides:



Advertisements
Similar presentations
DISTRIBUTED COMPUTING PARADIGMS
Advertisements

COM vs. CORBA.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
SDN and Openflow.
ALFA: The new ALICE-FAIR software framework
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
The Publisher-Subscriber Interface Timm Morten Steinbeck, KIP, University Heidelberg Timm Morten Steinbeck Technical Computer Science Kirchhoff Institute.
CHEP03 - UCSD - March 24th-28th 2003 T. M. Steinbeck, V. Lindenstruth, H. Tilsner, for the Alice Collaboration Timm Morten Steinbeck, Computer Science.
Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
OPERATING SYSTEMS Introduction
Timm M. Steinbeck - Kirchhoff Institute of Physics - University Heidelberg 1 Timm M. Steinbeck HLT Data Transport Framework.
Status and roadmap of the AlFa Framework Mohammad Al-Turany GSI-IT/CERN-PH-AIP.
Introduction to client/server architecture
ALFA - a common concurrency framework for ALICE and FAIR experiments
FairRoot framework Mohammad Al-Turany (GSI-Scientific Computing)
Scalability By Alex Huang. Current Status 10k resources managed per management server node Scales out horizontally (must disable stats collector) Real.
Computer System Architectures Computer System Software
Flexible data transport for the online reconstruction of FAIR experiments Mohammad Al-Turany Dennis Klein Alexey Rybalchenko (GSI-IT) 5/17/13M. Al-Turany,
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
DISTRIBUTED COMPUTING PARADIGMS. Paradigm? A MODEL 2for notes
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
CCNA 1 v3.0 Module 11 TCP/IP Transport and Application Layers.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
7. CBM collaboration meetingXDAQ evaluation - J.Adamczewski1.
Server to Server Communication Redis as an enabler Orion Free
Performance measurement with ZeroMQ and FairMQ
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Processes Introduction to Operating Systems: Module 3.
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
Hwajung Lee.  Interprocess Communication (IPC) is at the heart of distributed computing.  Processes and Threads  Process is the execution of a program.
4/19/20021 TCPSplitter: A Reconfigurable Hardware Based TCP Flow Monitor David V. Schuehler.
Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Types of Operating Systems 1 Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Measuring the Capacity of a Web Server USENIX Sympo. on Internet Tech. and Sys. ‘ Koo-Min Ahn.
Information-Centric Networks10b-1 Week 10 / Paper 2 Hermes: a distributed event-based middleware architecture –P.R. Pietzuch, J.M. Bacon –ICDCS 2002 Workshops.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
FairRoot Status and plans Mohammad Al-Turany 6/25/13M. Al-Turany, ALICE Offline Meeting1.
ALFA - a common concurrency framework for ALICE and FAIR experiments Mohammad Al-Turany GSI-IT/CERN-PH.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Background Computer System Architectures Computer System Software.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.
Mitglied der Helmholtz-Gemeinschaft FairMQ with FPGAs and GPUs Simone Esch –
Flexible data transport for the online reconstruction in FairRoot Mohammad Al-Turany Dennis Klein Anar Manafov Alexey Rybalchenko 6/25/13M. Al-Turany,
Using ZeroMQ for GEP. 2 About ZeroMQ The “zero” in ZeroMQZeroMQ  Zero Broker  Zero Latency (Low Latency)  Zero Administration  Zero Cost – Cross Platform.
Data Transport for Online & Offline Processing
Controlling a large CPU farm using industrial tools
RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne
#01 Client/Server Computing
Chapter 2: The Linux System Part 1
Multiple Processor Systems
Software models - Software Architecture Design Patterns
PRESENTATION COMPUTER NETWORKS
Multithreaded Programming
Chapter 2: Operating-System Structures
Prof. Leonardo Mostarda University of Camerino
LO2 – Understand Computer Software
Chapter 2: Operating-System Structures
Chapter 6 – Distributed Processing and File Systems
#01 Client/Server Computing
Presentation transcript:

Flexible data transport for online reconstruction M. Al-Turany Dennis Klein A. Rybalchenko 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 1

This talk: Introduction o Design requirement o Zero MQ o Socket Pattern Current Status Results 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 2

The Online Reconstruction and analysis 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa GB/s 20M Evt/s 300 GB/s 20M Evt/s > CPU-core or Equivalent GPU, FPGA, … < 1 GB/s 25K Evt/s < 1 GB/s 25K Evt/s How to manage the data flow on such a huge cluster? How to recover single/multiple processes? How to monitor it? …… How to manage the data flow on such a huge cluster? How to recover single/multiple processes? How to monitor it? ……

Design constrains Highly flexible: different data paths should be modeled. Adaptive: Sub-system are continuously under development and improvement Should work for simulated and real data: developing and debugging the algorithms It should support all possible hardware where the algorithms could run (CPU, GPU, FPGA) It has to scale to any size! With minimum or ideally no effort. 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 4

Before Re-inventing the Wheel What is available on the market and in the community? o ALICE, ATLAS, CMS, LHCb, … o Financial and weather application have also huge data to deal with Do we intend to separate online and offline? Multithreaded concept or a message queue based one? o Message based systems allow us to decouple producers from consumers. o We can spread the work to be done over several processes and machines. o We can manage/upgrade/move around programs (processes) independently of each other. 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 5

ØMQ (ZeroMQ) Available since 2011 A socket library that acts as a concurrency framework. Faster than TCP, for clustered products and supercomputing. Carries messages across inproc, IPC, TCP, and multicast. Connect N-to-N via fanout, pubsub, pipeline, request-reply. A synch I/O for scalable multicore message-passing apps. 30+ languages including C, C++, Java,.NET, Python. Most OSes including Linux, Windows, OS X, PPC405/PPC440. Large and active open source community. LGPL free software with full commercial support from iMatix. 12/05/126 M. Al-Turany, Panda Collaboration Meeting, Goa

Zero in ØMQ Originally the zero in ØMQ was meant as "zero broker" and (as close to) "zero latency" (as possible). In the meantime it has come to cover different goals: zero administration, zero cost, zero waste. More generally, "zero” refers to the culture of minimalism that permeates the project. Adding power by removing complexity rather than exposing new functionality. 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 7

ZeroMQ sockets provide efficient transport options Inter-thread Inter-process Inter-node – which is really just inter- process across nodes communication 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 8 PMG : Pragmatic General Multicast (a reliable multicast protocol) Named Pipe: Piece of random access memory (RAM) managed by the operating system and exposed to programs through a file descriptor and a named mount point in the file system. It behaves as a first in first out (FIFO) buffer PMG : Pragmatic General Multicast (a reliable multicast protocol) Named Pipe: Piece of random access memory (RAM) managed by the operating system and exposed to programs through a file descriptor and a named mount point in the file system. It behaves as a first in first out (FIFO) buffer

The built-in core ØMQ patterns are: Request-reply, which connects a set of clients to a set of services. This is a remote procedure call and task distribution pattern. Publish-subscribe, which connects a set of publishers to a set of subscribers. This is a data distribution pattern. Pipeline, which connects nodes in a fan-out / fan-in pattern that can have multiple steps, and loops. This is a parallel task distribution and collection pattern. Exclusive pair, which connect two sockets exclusively 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 9

Request-Reply Pattern 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 10 Socket typeREQREP Compatible peer socketsREP, ROUTERREQ, DEALER DirectionBidirectional Send/receive patternSend, Receive Outgoing routing strategyRound-robinLast peer Incoming routing strategyLast peerFair-queued Action in mute stateBlockDrop

Publish-Subscribe Pattern 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 11 Socket typePUBSUB Compatible peer socketsSUB, XSUBPUB, XPUB DirectionUnidirectional Send/receive patternSend OnlyReceive only Outgoing routing strategyFan-outN/A Incoming routing strategyN/AFair-queued Action in mute stateDrop

Pipeline Pattern 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 12 Socket typePUSHPULL Compatible peer socketsPULLPUSH DirectionUnidirectional Send/receive patternSend OnlyReceive only Outgoing routing strategyRound-RobinN/A Incoming routing strategyN/AFair-queued Action in mute stateBlock

Example of sending control commands A worker process can manages two sockets (a PULL socket getting tasks, and a SUB socket getting control commands) Could be very useful for calibration and alignment parameter 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 13

Data Transfer Framework as Extension to FairRoot! Why? Modeling the pipeline processing within the online analysis Enable concurrency in FairRoot for offline analysis Reliable and efficient data transport through message queuing technology The long term plan is to have the same framework for online and offline 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 14

Simulation Online Reconstruction and Analysis Data flow example 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 15 Sampler 1 Sampler 2 Sampler 3 Merger 1 Merger 2 Merger 3 Processor 1 Processor 2 Processor 3 Merger Experiment Sub-detector 1 Sub-detector 2 Sub-detector 3

Current Status The Framework deliver some components which can be connected to each other in order to construct a processing pipeline. All component share a common base called Device (ZeroMQ Class). All devices are grouped by three categories: o Source: Sampler o Message-based Processor: Sink, BalancedStandaloneSplitter, StandaloneMerger, Buffer o Content-based Processor: Processor 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 16

Sampler Devices with no inputs are categorized as sources During RUN state the sampler loops infinitely over the loaded events and send them through the output socket. A variable event rate limiter has been implemented to control the sending speed 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 17

Message-based Processor All message-based processors inherit from Device and operate on messages without interpreting their content. Four message-based processors have been implemented so far 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 18

Content-based Processor The Processor device has one input and one output socket. A task is meant for accessing and potentially changing the message content. 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 19

Design 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 20 Detector specific code

New simple classes without ROOT are used in the Sampler (This enable us to use non-ROOT clients) and reduce the messages size. 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 21

Device Each processing stage of a pipeline is occupied by a process which executes an instance of the Device class 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 22

Message format (Protocol) Potentially any content-based processor or any source can change the application protocol. The framework provides a generic Message class that works with any arbitrary and continuous junk of memory. One has to pass a pointer to the memory buffer and the size in bytes, and can optionally pass a function pointer to a destructor, which will be called once the message object is discarded. 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 23

Test setup and results 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 24

CPU: Intel Xeon 2.13 GHz Memory: 24 GiB, 6 4 GiB Network: Intel 82574L Gigabit Network Connection, speed=1Gbit/s Operating system GNU/Linux x86_64, Debian 7.0 ZeroMQ FairRoot PandaRoot oct12 release, Fairsoft development version from /05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 25 Four identical nodes were connected to a GigabitEthernet switch for testing

The last two measured values at 10 MB and 100 MB message size appear to be less efficient due to non-fractional output of the benchmark program. The near to maximum throughput for these last two values has been confirmed by monitoring the throughput with the linux tool iftop 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 26 ZeroMQ reaches the upper TCP throughput for message sizes larger than a hundred bytes.

Results TCP throughput of 117.6MB/s was measured which is very close to the theoretical limit of MB/s for the TCP/IPv4/GigabitEthernet stack. This was achieved using the Linux default values for the Ethernet MTU (1500 B) and TCP buffer size (85.3 KB). The throughput for the named pipe transport between two devices on one node has been measured around 1.7 GB/s 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 27

Thanks! 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 28

Backup slides 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 29

Broker Architecture of most messaging systems is distinctive by the messaging server ("broker") in the middle. Every application is connected to the central broker. No application is speaking directly to the other application. All the communication is passed through the broker. 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 30

Advantages Applications don't have to have any idea about location of other applications. The only address they need is the network address of the broker. Message sender and message receiver lifetimes don't have to overlap. Sender application can push messages to the broker and terminate. The messages will be available for the receiver application any time later. Broker model is to some extent resistant to the application failure. So, if the application is buggy and prone to failure, the messages that are already in the broker will be retained even if the application fails. 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 31

Drawbacks It requires excessive amount of network communication. The fact that all the messages have to be passed through the broker can result in broker turning out to be the bottleneck of the whole system. 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 32

Broker pattern pipelined fashion service-oriented architectures (SOA) 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 33

Examples of No Broker model in ZeroMQ No BrokerBroker as a Directory Service 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 34

More Models Distributed directory serviceDistributed broker 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 35

12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 36 PUB/SUB (publish/subscribe) In most broker-based systems consumers subscribe for messages with the broker, however, there's no way for broker to subscribe for messages with the publisher. So, even if there is no consumer interested in the message it is still passed from the publisher to the broker, just to be dropped there

12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 37 Messages should travel only through those lattices in the message distribution tree that lead to consumers interested in the message

Subscription Forwarding 12/05/12 M. Al-Turany, Panda Collaboration Meeting, Goa 38 XPUB is similar to a PUB socket, except that you can receive messages from it. The messages you receive are the subscriptions traveling upstream. XSUB is similar to SUB except that you subscribe by sending a subscription message to it. Subscription messages are composed of a single byte, 0 for un- subscription and 1 for subscription