Timm Morten Steinbeck, Computer Science/Computer Engineering Group Kirchhoff Institute f. Physics, Ruprecht-Karls-University Heidelberg A Framework for Building Distributed Data Flow Chains in Clusters
Requirements Alice: A relativistic heavy ion physics experiment Very large multiplicity: particles/event Full event size > 70 MB Last trigger stage (High Level Trigger, HLT) is first stage with complete event data Data rate into HLT up to 25 GB/s
High Level Trigger HLT has to process large volume of data to go from this to this: HLT has to perform reconstruction of particle tracks from raw ADC data. 1, 2, 123, 255, 100, 30, 5, 1, 4, 3, 2, 3, 4, 5, 3, 4, 60, 130, 30, 5,
High Level Trigger HLT consists of Linux-PC farm ( nodes) with fast network Nodes are arranged hierarchically First stage reads data from detector and performs first level processing Each stage sends produced data to next stage Each further stage performs processing or merging on input data
High Level Trigger The HLT needs a software framework to transport data through the PC farm. Requirements for the framework: Efficiency: The framework should not use too much CPU cycles (which are needed for data analysis) The framework should transport the data as fast as possible Flexibility: The framework should consist of components which can be plugged together in different configurations The framework should allow reconfiguration at runtime
The Data Flow Framework Components are single processes Communication based on publisher-subscriber principle: One publisher can serve multiple subscribers Publisher Subscriber Subscribe Subscriber Subscribe Subscriber Subscribe Publisher Subscriber New Data Subscriber New Data Subscriber New Data
Framework Architecture Framework consists of mutually dependant packages Utility Classes Communication Classes Publisher & Subscriber Base Classes Data Source, Sink, & Processing Components Data Source, Sink, & Processing Templates Data Flow Components
Utility Classes General helper classes (e.g. timer, thread) Rewritten classes: Thread-safe string class Faster vector class
Communication Classes Two abstract base classes: For small message transfers For large data blocks Derived classes for each network technology Currently existing: Message and block classes for TCP and SCI (Shared Memory Interconnect) Message Base Class TCP Message Class SCI Message Class Block Base Class TCP Block Class SCI Block Class
Communication Classes Implementations foreseen for Atoll network (University of Mannheim) and Scheduled Transfer Protocol API partly modelled after socket API (Bind, Connect) Both implicit and explicit connection possible Implicit Connection User CallsSystem Calls Connectconnect Sendtransfer data Disconnectdisconnect Explicit Connection User CallsSystem Calls Sendconnect, transfer data, disconnect
Publisher Subscriber Classes Implement publisher- subscriber interface Abstract interface for communication mechanism between processes/modules Currently named pipes or shared memory available Multi-threaded implementation Publisher Subscriber Subscribe Subscriber Subscribe Subscriber Subscribe Publisher Subscriber New Data Subscriber New Data Subscriber New Data
Publisher Subscriber Classes Efficiency: Event data not sent to subscriber components Publisher process places data into shared memory (ideally during production) Descriptors are sent to subscribers holding location of data in shared memory Requires buffer management in publisher Publisher Event M Data Block n Data Block 0 New Event Event M Descriptor Block n Block 0 Subscriber Shared Memory
Data Flow Components To merge data streams belonging to one event To split and rejoin a data stream (e.g. for load balancing) EventMerger Event N, Part 1 Event N, Part 2 Event N, Part 3 Event N Several components to shape the data flow in a chain EventScattererEventGatherer Event 1 Event 2 Event 3 Event 1 Event 3 Processing Event 1 Event 2 Event 3
Data Flow Components To transparently transport data over the network to other computers (Bridge) SubscriberBridgeHead has subscriber class for incoming data, PublisherBridgeHead uses publisher class to announce data Several components to shape the data flow in a chain SubscriberBridgeHead Node 1 Network PublisherBridgeHead Node 2
Component Template Templates for user components are provided To read out data from source and insert it into a chain (Data Source Template) To accept data from the chain, process, and reinsert it (Analysis Template) To accept data from the chain and process it, e.g. store it (Data Sink Template) Data Source Addressing & Handling Buffer Management Data Announcing Data Analysis & Processing Output Buffer Mgt. Output Data Announcing Accepting Input Data Input Data Addressing Data Sink Addressing & Writing Accepting Input Data Input Data Addressing
Benchmarks Performance tests of framework Dual Pentium 3, w. 733 MHz-800MHz Tyan Thunder 2500 or Thunder HEsl motherboard, Serverworks HE/HEsl chipset 512MB RAM, >20 GB disk space, system, swap, tmp Fast Ethernet, switch w. 21 Gb/s backplane SuSE Linux 7.2 w kernel
Benchmarks Benchmark of publisher-subscriber interface Publisher process 16 MB output buffer, event size 128 B Publisher does buffer management, copies data into buffer, subscriber just replies to each event Maximum performance more than 12.5 kHz
Benchmarks Benchmark of TCP message class Client sending messages to server on other PC TCP over Fast Ethernet Message size 32 B Maximum message rate more than 45 kHz
Benchmarks Publisher-Subscriber Network Benchmark Publisher on node A, subscriber on node B Connected via Bridge, TCP over Fast Ethernet 31 MB buffer in publisher and receiving bridge Message size from 128 B to 1 MB SubscriberBridgeHead Node A Network PublisherBridgeHead Node B Subscriber Publisher
Benchmarks Publisher-Subscriber Network Benchmark Notes: Dual CPU nodes 100% 2 CPUs Theor. Rate: 100 Mb/s / Evt. Size
Benchmarks CPU load/event rate decrease with larger blocks Receiver more loaded than sender Minimum CPU load 20% sender, 30% receiver Maximum CPU 128 B events 90% receiver, 80% sender Management of > events in system!!
Benchmarks Publisher-Subscriber Network Benchmark
Benchmarks 32 kB event size: Network bandwidth limits 32 kB event size: More than 10 MB/s At maximum event size: 12.3*10 6 B of 12.5*10 6 B
"Real-World" Test 13 node test setup Simulation of read-out and processing of 1/36 (slice) of Alice Time Projection Chamber (TPC) Simulated piled-up (overlapped) proton-proton events Target processing rate: 200 Hz (Maximum read out rate of TPC)
"Real-World" Test EG ES CFAUFP CF AU FP CF AU FP CF AU FP CF AU FP CF AU FP EG ES T T EG ES EG ES T T EG ES T T T T EGESFPCFAU T T EGES T T ADC Unpacker Event Scatterer Tracker (2x) EM PM EM SM EM PM EM SM Patch Merger Slice Merger Patches TT Event Merger Event Gatherer Cluster Finder File Publisher
"Real-World" Test CF AUFP 1, 2, 123, 255, 100, 30, 5, 4*0, 1, 4, 3*0, 4, 1, 2, 60, 130, 30, 5, , 2, 123, 255, 100, 30, 5, 0, 0, 0, 0, 1, 4, 0, 0, 0, 4, 1, 2, 60, 130, 30, 5, EG ES T T EM PM EM SM
"Real-Word" Test Third (line of) node(s) connects track segments to form complete tracks (Track Merging) Second line of nodes finds curved particle track segments going through charge space-points First line of nodes unpacks zero- suppressed, run-length-encoded ADC values and calculates 3D space-point coordinates of charge depositions CF AUFP EG ES T T EM PM EM SM
Test Results CF AUFP EG ES T T EM PM EM SM Rate: 270 Hz CPU load: 2 * 100% Network: 6 * 130 kB/s event data CPU load: 2 * 75% - 100% Network: 6 * 1.8 MB/s event data CPU load: 2 * 60% - 70%
Conclusion & Outlook Framework allows flexible creation of data flow chains, while still maintaining efficiency No dependencies created during compilation Applicable to wide range of tasks Performance already good enough for many applications using TCP on Fast Ethernet Future work: Use dynamic configuration ability for fault tolerance purposes, improvements of performance More information: