Timm Morten Steinbeck, Computer Science/Computer Engineering Group Kirchhoff Institute f. Physics, Ruprecht-Karls-University Heidelberg Alice High Level.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Multiple Processor Systems
Multiple Processor Systems
Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
Distributed Processing, Client/Server and Clusters
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Distributed Processing, Client/Server, and Clusters
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.
Timm Morten Steinbeck, Computer Science/Computer Engineering Group Kirchhoff Institute f. Physics, Ruprecht-Karls-University Heidelberg A Framework for.
CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/20 New Experiences with the ALICE High Level Trigger Data Transport.
CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/27 A Control Software for the ALICE High Level Trigger Timm.
HLT Online Monitoring Environment incl. ROOT - Timm M. Steinbeck - Kirchhoff Institute of Physics - University Heidelberg 1 HOMER - HLT Online Monitoring.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Timm M. Steinbeck - Kirchhoff Institute of Physics - University Heidelberg - DPG 2005 – HK New Test Results for the ALICE High Level Trigger.
The Publisher-Subscriber Interface Timm Morten Steinbeck, KIP, University Heidelberg Timm Morten Steinbeck Technical Computer Science Kirchhoff Institute.
CHEP03 - UCSD - March 24th-28th 2003 T. M. Steinbeck, V. Lindenstruth, H. Tilsner, for the Alice Collaboration Timm Morten Steinbeck, Computer Science.
Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
M. Richter, University of Bergen & S. Kalcher, J. Thäder, T. M. Steinbeck, University of Heidelberg 1 AliRoot - Pub/Sub Framework Analysis Component Interface.
Timm M. Steinbeck - Kirchhoff Institute of Physics - University Heidelberg 1 Timm M. Steinbeck HLT Data Transport Framework.
Xuan Guo Chapter 1 What is UNIX? Graham Glass and King Ables, UNIX for Programmers and Users, Third Edition, Pearson Prentice Hall, 2003 Original Notes.
PRASHANTHI NARAYAN NETTEM.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Local Area Networks (LAN) are small networks, with a short distance for the cables to run, typically a room, a floor, or a building. - LANs are limited.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
I/O Systems I/O Hardware Application I/O Interface
The High-Level Trigger of the ALICE Experiment Heinz Tilsner Kirchhoff-Institut für Physik Universität Heidelberg International Europhysics Conference.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
MapReduce How to painlessly process terabytes of data.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Chapter 2 Applications and Layered Architectures Sockets.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Chapter 13: I/O Systems. 13.2/34 Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Hwajung Lee.  Interprocess Communication (IPC) is at the heart of distributed computing.  Processes and Threads  Process is the execution of a program.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
Chapter 13: I/O Systems Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 13: I/O Systems Overview I/O Hardware Application.
1 Farm Issues L1&HLT Implementation Review Niko Neufeld, CERN-EP Tuesday, April 29 th.
Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Two New UML Diagram Types Component Diagram Deployment Diagram.
Module 12: I/O Systems I/O hardware Application I/O Interface
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Controlling a large CPU farm using industrial tools
RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Multiple Processor Systems
Chapter 2: Operating-System Structures
Prof. Leonardo Mostarda University of Camerino
The LHCb High Level Trigger Software Framework
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 2: Operating-System Structures
Chapter 13: I/O Systems.
MapReduce: Simplified Data Processing on Large Clusters
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

Timm Morten Steinbeck, Computer Science/Computer Engineering Group Kirchhoff Institute f. Physics, Ruprecht-Karls-University Heidelberg Alice High Level Trigger Data Transport Framework Software

High Level Trigger Farm ● HLT Linux-PC farm ( nodes) with fast network ● Nodes are arranged hierarchically ● First stage reads data from detector and performs first level processing ● Each stage sends produced data to next stage ● Each further stage performs processing or merging on input data

High Level Trigger The HLT needs a software framework to transport data through the PC farm. ● Requirements for the framework: – Efficiency: ● The framework should not use too much CPU cycles (which are needed for data analysis) ● The framework should transport the data as fast as possible – Flexibility: ● The framework should consist of components which can be plugged together in different configurations ● The framework should allow reconfiguration at runtime

The Data Flow Framework ● Components are single processes ● Communication based on publisher- subscriber principle: ● One publisher can serve multiple subscribers ● Subscribe once, receive new events ● Subscriber informs when done working on event ● Abstract object-oriented interface hides communication Publisher Subscriber Subscribe Subscriber Subscribe Publisher Subscriber New Event Subscriber New Event Publisher Subscriber Subscribe Publisher Subscriber New Event Publisher Subscriber Event Done Publisher Subscriber New Event

Framework Architecture Framework consists of mutually dependant packages Utility Classes Communication Classes Publisher & Subscriber Base Classes Data Source, Sink, & Processing Components Data Source, Sink, & Processing Templates Data Flow Components

Utility Classes ● General helper classes (e.g. timer, thread) ● Rewritten classes: – Thread-safe string class – Faster vector class

Communication Classes ● Two abstract base classes: – For small message transfers – For large data blocks ● Derived classes for each network technology ● Currently existing: Message and block classes for TCP and SCI (Shared Memory Interconnect) Message Base Class TCP Message Class SCI Message Class Block Base Class TCP Block Class SCI Block Class

Communication Classes ● Implementations foreseen for Atoll network (University of Mannheim) and Scheduled Transfer Protocol ● API partly modelled after socket API (Bind, Connect) ● Both implicit and explicit connection possible Implicit Connection User CallsSystem Calls Connectconnect Sendtransfer data Disconnectdisconnect Explicit Connection User CallsSystem Calls Sendconnect, transfer data, disconnect

Publisher Subscriber Classes ● Implement publisher- subscriber interface ● Abstract interface for communication mechanism between processes/modules ● Currently named pipes or shared memory available ● Multi-threaded implementation Publisher Subscriber Subscribe Subscriber Subscribe Publisher Subscriber New Event Subscriber New Event Publisher Subscriber Subscribe Publisher Subscriber New Event Publisher Subscriber Event Done Publisher Subscriber New Event

Publisher Subscriber Classes ● Efficiency: Event data not sent to subscriber components ● Publisher process places data into shared memory (ideally during production) ● Descriptors are sent to subscribers holding location of data in shared memory ● Requires buffer management in publisher Publisher Event M Data Block n Data Block 0 New Event Event M Descriptor Block n Block 0 Subscriber Shared Memory

Event Descriptor Details ● Event descriptors contain ● The ID of the event ● The number of data blocks described ● For each data block: ● The size of the block ● The shared memory ID ● The starting offset in the shared memory ● The type of data ● The ID of the originating node Shared Memory Event M Data Block 0 Event M Data Block n Event ID: M Block count: n Block n ● Shm ID ● Offset ● Size ● Datatype ● Producer ID Block 0 ● Shm ID ● Offset ● Size ● Datatype ● Producer ID

Persistent and Transient Two kinds of subscribers supported: Persistent and Transient ● Persistent subscribers get all events. ● Publisher frees events only when all persistent subscribers are finished. ● Transient subscribers can specify event selection criteria (Event number modulo, trigger words). ● Transient subscribers can have events canceled by the publisher. Subscriber Publisher Subscriber Publisher Cancel Event New Event Publisher Subscriber Publisher Subscriber Publisher Event Done New Event Free EventFree Event Publisher Free EventFree Event

Data Flow Components ● To merge data streams belonging to one event ● To split and rejoin a data stream (e.g. for load balancing) EventMerger Event N, Part 1 Event N, Part 2 Event N, Part 3 Event N Several components to shape the data flow in a chain EventScattererEventGatherer Event 1 Event 2 Event 3 Event 1 Event 3 Processing Event 1 Event 2 Event 3

Data Flow Components ● To transparently transport data over the network to other computers (Bridge) ● SubscriberBridgeHead has subscriber class for incoming data, PublisherBridgeHead uses publisher class to announce data Several components to shape the data flow in a chain SubscriberBridgeHead Node 1 Network PublisherBridgeHead Node 2

Event Scatterer ● One subscriber input. ● Multiple publisher outputs. ● Incoming events are distributed to output publishers in round-robin fashion. Subscriber Publisher New Event 2 Subscriber Publisher New Event 1

Event Gatherer ● Multiple subscribers attached to different publishers. ● One output publisher. ● Each incoming event is published unchanged by output publisher. Publisher Subs New Event 2 Publisher Subs New Event 1

Event Merger ● Multiple subscribers attached to different publishers. ● One output publisher. ● Data blocks from incoming events are merged into one outgoing event. Publisher Merging Code Subs Shared Memory Event M Block 1 Event M Block 0 Event M Block 0 Event M Block 1 Event M Block 0 Block 1

Bridge ● Bridge consists of two programs: ● SubscriberBridgeHead ● Contains subscriber. ● Gets input data from publisher. ● Sends data over network. ● PublisherBridgeHead ● Reads data from network. ● Publishes data again. Node B Data consumer Publisher Subscriber Bridge Network Code New Event Node A Data producer Publisher Subscriber Bridge Network Code New Event Network data transport

Component Template Templates for user components are provided ● To read out data from source and insert it into a chain (Data Source Template) ● To accept data from the chain, process, and reinsert it (Analysis Template) ● To accept data from the chain and process it, e.g. store it (Data Sink Template) Data Source Addressing & Handling Buffer Management Data Announcing Data Analysis & Processing Output Buffer Mgt. Output Data Announcing Accepting Input Data Input Data Addressing Data Sink Addressing & Writing Accepting Input Data Input Data Addressing

HLT Fault Tolerance ● PCs in the Alice HLT (or more). ● Round the clock operation ● Cost for one hour of operation of Alice > CHF. What to do when one of the PCs fails????

HLT Fault Tolerance Each level of processing in the (TPC-) HLT splits its output and distributes it among multiple nodes of the next procesing stage. Optional normally unused (hot) spare nodes may be available.

HLT Fault Tolerance If one of the nodes fails the data is distributed among the remaining nodes. Or one of the spare nodes is activated and takes the part of the faulty node.

HLT Fault Tolerance Test Test of the HLT Fault Tolerance Capability One node distributes output data to three further worker nodes (plus one spare). Output data of the three nodes is collected by a sixth node. For one of the three worker nodes the network cable is unplugged.s

HLT Fault Tolerance Test After the "failure" of the node of the node the data is processed by the spare node which has been activated. No loss of data or performance. (Only temporary performance loss until switched) Switch is done completely automatic, without any user intervention needed. Test of the HLT Fault Tolerance Capability - Results

HLT Fault Tolerance

Benchmarks ● Performance tests of framework ● Dual Pentium 3, w. 733 MHz-800MHz ● Tyan Thunder 2500 or Thunder HEsl motherboard, Serverworks HE/HEsl chipset ● 512MB RAM, >20 GB disk space, system, swap, tmp ● Fast Ethernet, switch w. 21 Gb/s backplane ● SuSE Linux 7.2 w kernel ● Benchmarks run w. unoptimized debugging code on kernel with multiple debugging options.  Can only get better

Benchmarks Benchmark of publisher-subscriber interface ● Publisher process ● 16 MB output buffer, event size 128 B ● Publisher does buffer management, copies data into buffer, subscriber just replies to each event ● Maximum performance more than 12.5 kHz

Benchmarks Benchmark of TCP message class ● Client sending messages to server on other PC ● TCP over Fast Ethernet ● Message size 32 B ● Maximum message rate more than 45 kHz

Benchmarks Publisher-Subscriber Network Benchmark ● Publisher on node A, subscriber on node B ● Connected via Bridge, TCP over Fast Ethernet ● 31 MB buffer in publisher and receiving bridge ● Data size from 128 B to 1 MB SubscriberBridgeHead Node A Network PublisherBridgeHead Node B Subscriber Publisher

Benchmarks Publisher-Subscriber Network Benchmark Notes: Dual CPU nodes 100% ≙ 2 CPUs Theor. Rate: 100 Mb/s / Evt. Size

Benchmarks ● CPU load/event rate decrease with larger blocks ● Receiver more loaded than sender ● Minimum CPU load 20% sender, 30% receiver ● Maximum CPU 128 B events 90% receiver, 80% sender Management of > events in system!!

Benchmarks Publisher-Subscriber Network Benchmark

Benchmarks ● 32 kB event size: Network bandwidth limits ● 32 kB event size: More than 10 MB/s ● At maximum event size: 12.3*10 6 B of 12.5*10 6 B

"Real-World" Test ● 13 node test setup ● Simulation of read-out and processing of 1/36 (slice) of Alice Time Projection Chamber (TPC) ● Simulated piled-up (overlapped) proton-proton events ● Target processing rate: 200 Hz (Maximum read out rate of TPC)

"Real-World" Test EG ES CF AU FP CF AU FP CF AU FP CF AU FP CF AU FP CF AU FP EG ES T T EG ES EG ES T T EG ES T T T T EG ES FP CF AU T T EG ES T T ADC Unpacker Event Scatterer Tracker (2x) EM PM EM SM EM PM EM SM Patch Merger Slice Merger Patches TT Event Merger Event Gatherer Cluster Finder File Publisher

"Real-World" Test CF AU FP 1, 2, 123, 255, 100, 30, 5, 4*0, 1, 4, 3*0, 4, 1, 2, 60, 130, 30, 5, , 2, 123, 255, 100, 30, 5, 0, 0, 0, 0, 1, 4, 0, 0, 0, 4, 1, 2, 60, 130, 30, 5, EG ES T T EM PM EM SM

"Real-Word" Test ● Third (line of) node(s) connects track segments to form complete tracks (Track Merging) ● Second line of nodes finds curved particle track segments going through charge space-points ● First line of nodes unpacks zero- suppressed, run-length-encoded ADC values and calculates 3D space-point coordinates of charge depositions CF AU FP EG ES T T EM PM EM SM

Test Results CF AU FP EG ES T T EM PM EM SM Rate: 270 Hz ● CPU load: 2 * 100% ● Network: 6 * 130 kB/s event data ● CPU load: 2 * 75% - 100% ● Network: 6 * 1.8 MB/s event data ● CPU load: 2 * 60% - 70%

Conclusion & Outlook ● Framework allows flexible creation of data flow chains, while still maintaining efficiency ● No dependencies created during compilation ● Applicable to wide range of tasks ● Performance already good enough for many applications using TCP on Fast Ethernet Future work: Use dynamic configuration ability for fault tolerance purposes, improvements of performance More information: