Data Recording Model at XFEL CRISP 2 nd Annual meeting March 18-19, 2013 Djelloul Boukhelef 1Djelloul Boukhelef - XFEL.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Operating System.
Lecture 12 Reduce Miss Penalty and Hit Time
File Systems.
CS162 Section Lecture 9. KeyValue Server Project 3 KVClient (Library) Client Side Program KVClient (Library) Client Side Program KVClient (Library) Client.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
G Robert Grimm New York University Disco.
CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/20 New Experiences with the ALICE High Level Trigger Data Transport.
Communications in ISTORE Dan Hettena. Communication Goals Goals: Fault tolerance through redundancy Tolerate any single hardware failure High bandwidth.
Home: Phones OFF Please Unix Kernel Parminder Singh Kang Home:
A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,
Figure 1.1 Interaction between applications and the operating system.
PRASHANTHI NARAYAN NETTEM.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Operating System. Architecture of Computer System Hardware Operating System (OS) Programming Language (e.g. PASCAL) Application Programs (e.g. WORD, EXCEL)
1 Computer System Overview Chapter 1. 2 n An Operating System makes the computing power available to users by controlling the hardware n Let us review.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
Contact Information Office: 225 Neville Hall Office Hours: Monday and Wednesday 12:00-1:00 and by appointment.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
10GE network tests with UDP
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
DYNES Storage Infrastructure Artur Barczyk California Institute of Technology LHCOPN Meeting Geneva, October 07, 2010.
7. CBM collaboration meetingXDAQ evaluation - J.Adamczewski1.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
High Speed Detectors at Diamond Nick Rees. A few words about HDF5 PSI and Dectris held a workshop in May 2012 which identified issues with HDF5: –HDF5.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
C LUSTER OF R ESEARCH I NFRASTRUCTURES F OR S YNERGIES IN P HYSICS Prototype for High-Speed Data Acquisition at European XFEL CRISP 3 rd Annual meeting.
Management of the LHCb DAQ Network Guoming Liu *†, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Part IVI/O Systems Chapter 13: I/O Systems. I/O Hardware a typical PCI bus structure 2.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Remigius K Mommsen Fermilab CMS Run 2 Event Building.
CRISP WP18, High-speed data recording Krzysztof Wrona, European XFEL PSI, 18 March 2013.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
CMSC 611: Advanced Computer Architecture
WP18, High-speed data recording Krzysztof Wrona, European XFEL
The demonstration of Lustre in EAST data system
Operating System.
Diskpool and cloud storage benchmarks used in IT-DSS
Distributed Network Traffic Feature Extraction for a Real-time IDS
Alternative system models
CS 286 Computer Organization and Architecture
Computing Infrastructure for DAQ, DM and SC
Network Core and QoS.
Chapter 13: I/O Systems.
Network Core and QoS.
Presentation transcript:

Data Recording Model at XFEL CRISP 2 nd Annual meeting March 18-19, 2013 Djelloul Boukhelef 1Djelloul Boukhelef - XFEL

Outline Purpose and scope Hardware setup Software architecture Experiments & results –Network –Storage Summary & outlook 2Djelloul Boukhelef - XFEL

Purpose and present scope Build a prototype of a fully featured DAQ/DM/SC system –Select/install adequate h/w, develop s/w, and test all system’s properties: Control, DAQ, DM, and SC systems Current prototype focuses on: –Data acquisition, pre-processing, formatting and storage –Assess the performance and stability of the h/w + s/w Network: bandwidth (10Gbps), UDP packets loss, TCP behavior… Processing: concurrent read, processing, write operations, … Storage: performance of disk (write), concurrent IO operations, … Software development –Application architecture: processing pipeline, communication, … –Design for performance, robustness, scalability, flexibility, … 3Djelloul Boukhelef - XFEL

Hardware setup 4 Details were presented in the IT&DM meeting in October Djelloul Boukhelef - XFEL 2D detector generates ~10GB of image data per second Data is multiplexed on 16 channels (10GbE) 1GB/1.6 sec = 640MB/s per channel Lots of other slow data streams

SOFTWARE ARCHITECTURE 5Djelloul Boukhelef - XFEL

Overview Current prototype consists of three software components –Data feeder No TB board (Train builder emulator) Feed the PCL with train data With TB board: Feed TB with detector data –PC Layer software Acquire, pre-process, reduce, monitor, format, send data to storage and SC –Storage service Device/DeviceServer model to build a distributed control system –PSR, Flexibility and Configurability 6 PC Layer Storage service Storage node 1 Storage node S TCP PCL node 1PCL node N Train Builder Emulator Data Feeder MData Feeder 1 UDP Djelloul Boukhelef - XFEL

Timer server Train Builder Emulator (Master) Data Feeder 1 Data Feeder 2 Data Feeder M PCL node 1 PCL node 2 PCL node N Storage node 1 Storage node 2 Storage node N TCP UDP TCP Train Builder Layer PC Layer Storage Layer Groups (Id, rate,...) Net-timer PCL nodes Net-timer PCL nodes TB-Emulator (Master) Train Metadata Data files TB-Emulator (Master) Train Metadata Data files Storage nodes Train Metadata Storage nodes Train Metadata CPU, Network CPU, Queues, Network Folder Naming convention Folder Naming convention CPU Network Timer Architecture overview Data-driven model Configurations (xml files) 7 Djelloul Boukhelef - XFEL

Processing pipeline Distributed and parallel processing 8 T1 T2 Receive Write T1 T2 Receive Process Format T1 T2 Generate Build Send T1 T2 Send Acquisition Train builder Processing PC Layer Storage Online analysis SC Pipelining and multithreading on multi-core Djelloul Boukhelef - XFEL

Train Builder Image Data Generator Detector Data Generator Packetizer Timer server Detector data queue (pointers) Image data queue (pointers) Train data queue (tokens) DAQ request Raw data buffer Data files (offline) Generate & Store (CImg) Load Trains buffer Clock-ticks (TCP) Packets (UDP) PCL nodes Train Builder Emulator Detector emulator Images Data Feeder 9 Djelloul Boukhelef - XFEL

De-packetizer Trains buffer Processing Pipeline (Monitoring, reduction, …) Packets (UDP) Train builder Process queue (train id) Format queue (train id) Formatter In memory files Writer Write queue (file name) TCP stream Online storage Statistics, … PC-Layer node Simple processing: checksum Need real algorithms 10Djelloul Boukhelef - XFEL

Disk array Ring buffer Storage server Reader Writer Free slot Get free slot Fill slot Flush slot TCP network read write Sync Memory ring buffer –Buffer size (16G) –Slot (chunk) size (1G) Threads pool –Reader: Read data stream (TCP) and store it into the memory buffer –Writer: Write filled buffers slots into disk (cache) –Sync: Flush cache to disk (physical writing) IO modes –Cached: issues IO command via OS –Direct: talk directly to the IO device –Splice: data transfer within kernel space 11Djelloul Boukhelef - XFEL

NETWORK PERFORMANCE 12Djelloul Boukhelef - XFEL

Train Transfer Protocol (TTP) Train data format –Header, images & descriptors, detector data, trailer Train Transfer Protocol (TTP) –Based on UDP: designed for fast data transfer where TCP is not implemented or not suitable (overhead, delays) –Transfers are identified by unique identifiers (frame number) –Packetization: bundle the data block (frame) into small packets that are tagged with increasing packet numbers –Flags: SoF, EoF, Padding –Packet trailer mode 13 Data 8kbytes # Frame 4b # Packet 3b 1b SoF, EoF Header Images descriptors Images data Detector specific data Trailer 0 X sof1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X eof Train data Djelloul Boukhelef - XFEL

Previous results (reminder) Two types of programs run in parallel on all machines –Feeders: generate train data, packetize it into packets, send them using UDP –Receivers: reconstruct train data from packets, store them in memory (overwrite) Run length (#packets/total time): –Typical: 3.5×10 8 ~ 2.5×109 packets (few hours) –Maximum: 5×10 9 packets (16h37m) Time profile –XFEL: 10 MHz for 16 channels  Send 1 train (~ packets) within 1.6sec –Continuous: no waiting time between trains sending PCL Node 2 PCL Node 1 Unidirectional stream Concurrent send/receive Djelloul Boukhelef - XFEL

Previous results (reminder) Network transfer rate: 1GB train  0.87sec ≈ 9.9Gbps CPU usage (ie. receiver core)  40% Packets loss –Few packets (tens to hundreds) are sometimes lost per run –It happens only at the beginning of some runs (not train) –Observed sometimes on all machines, some machines only. We have run with no packet loss on any machine, also Ignoring first lost packets which affect only the first train –Typical run (3.5×10 8 )  less than 3.7 out of trains –Long run (5×10 9 )  less than 26 out of one million trains 15Djelloul Boukhelef - XFEL

Train switching In previous experiments: –Each feeder is configured to feed one PC layer node (one-to-one) –Packet loss appears at the start of a run –In TB, trains are sent out through different channels every time 10 trains  16 channels per second. Question: –What if the feeder sends train data to a different PC layer node every time? 16 Feeder 2 Feeder 3 Feeder 1 PCLayer 1 PCLayer 2 10GbE Switch st101st102st103 st104 st105 Sub-net 1 TTP Feeder 2 Feeder 1 PCLayer 1 PCLayer 2 10GbE Switch st101st102 st104 st105 Sub-net 1 TTP Djelloul Boukhelef - XFEL

Train switching Test configuration –3 feeders nodes Pre-load images from disk Build train data (header, trailer,…) Calculate checksum (Adler32) Packetize train data (TTP) –2 PC layer nodes Depacketize (TTP) No processing is performed Format to HDF5 file Stream files through TCP (splice) –2 storage nodes Write files to shared memory (splice) 17 Timer Train builder Feeder 2 Feeder 3 Feeder 1 PCLayer 1 PCLayer 2 Storage 1Storage 2 10GbE Switch st401 st101st102st103 st104 st105 st106 st107 st401 Sub-net 1 Sub-net 2 TTP TCP Djelloul Boukhelef - XFEL

Train switching 3 Feeders feed 2 PC layer nodes in round robin manner –Rate: 2 trains every 1.6 second A PC layer node receives 1 train every 1.6 sec, each time from a different feeder A Feeder sends out 1 train every 2.4 sec, each time to a different IP address –Packetizer checks the send buffer in order to avoid overwriting previous (not sent yet) packets, eg. every 100 packets. –All Feeder-to-PCLayer data transfers are done on the same sub-network –Train transfer time is.88 sec, ie. there is an overlap between two consecutive trains of 0.08sec (9% of the time) 18 FeederPCLTime … Feeder 2 Feeder 3 Feeder 1 PCLayer 1 PCLayer 2 10GbE Switch st101st102st103 st104st105 Sub-net 1 TTP Djelloul Boukhelef - XFEL

Experiment Total run time –Short time: less than ½ hour –Long time: 18 hours (81657 trains ~ 80TB) Observations: –6 trains were affected at the beginning of the run for each pc layer node, than they continue smoothly with no train loss until 1am. –… 8am, 2 more trains were affected at one PC layer (probably due to the nightly update). No train loss on the other node. –… than the run continues very stable until the end –Network send and receive (TTP and TCP) were very stable –Formatting time was not stable al the time 19Djelloul Boukhelef - XFEL

Summary Results: –Trains sent: (27219 trains per feeder) –PCLayer01 (received:40820, affected: 8) –PCLayer02 (received:40823, affected: 6) –Train size: = 1GB (image data) MB (header, descriptors, detector data, and trailer) –# packets per train: –Transfer time: sec –Transfer rate: GBytes/sec = Gbps Sustainable and stable network bandwidth 20Djelloul Boukhelef - XFEL

STORAGE PERFORMANCE 21Djelloul Boukhelef - XFEL

 Cached IO  Issue read/write commands via OS kernel, which will execute the IO. Data are copied to/from page cache.  Zero-copy operation  Splice socket and file descriptors. Perform data transfer within the kernel space (transparent).  Problem using Linux splice function with IBM/Mellanox  Couldn’t figure out the reason!!  Direct IO  Performs read/write directly to/from the device, bypassing page cache  DMA: Memory alignment (512)  RAID: strip size (64K), # of disks  Per file vs. per partition (Linux) Network Direct IO vs Cached IO Device driver Page remapping read write flush Direct IO Kernel space User land Device driver Page cache Kernel buffer Disk Application buffer Hardware layer Cached IO 22Djelloul Boukhelef - XFEL

File number Time (sec) File number Time (sec) File size: 1GB Time period: 1.6s Disk: empty File size: 2GB Time period: 3.2s Disk: empty Cached IO Run length: 2+ hours Buffer size: 4GB Method: read all file data into one buffer, write it once, then sync 23Djelloul Boukhelef - XFEL

Run configuration Two types of programs run in parallel on four machines –Sender: open in memory data file, stream its content using TCP, close file. –Receiver: read file data from socket to RAM buffer (16GB), write it to disk. Run length –Typical: 4.5 TB (2 hours) –Maximum: until disk is full 9TB (4 hours) and 28TB(12 hours) –Disks are cleaned before every run Time profile –1G  1.6sec per box 24 PCL Node 101 Storage 201 Storage 202 Storage 203 Storage 204 PCL Node 102 PCL Node 103 PCL Node 104 External disksInternal disks Dell machines IBM machines Size (GB) Time (sec) Djelloul Boukhelef - XFEL

Direct IO – external storage Storage extension box: 3TB 7.2Krpm 6Gbps NL SAS, RAID6 25 File (GB) Network (Gbps) Storage (GB/s) Net read (sec) Disk write (sec) Overall (sec) avg. maxavg.maxavg.maxavg Djelloul Boukhelef - XFEL

Direct IO – internal storage Internal disks: 14x900GB 10Krpm 6Gbps SAS RAID6 26 File (GB) Network (Gbps) Storage (GB/s) Net read (sec) Disk write (sec) Overall (sec) avg. maxavg.maxavg.maxavg Djelloul Boukhelef - XFEL

Long run experiments 27 Internal storage: 9TB, 918 files, 4 hours External storage: 28TB, 2792 files, 12 hours Network readDisk writeOverall time Network readDisk writeOverall time Average Buffer size: 16GB, Slot size: 1GB, File size: 10GB, Time profile: 16sec Djelloul Boukhelef - XFEL

Statistics from Ganglia Long run experiment (5:24am - 5:29pm) –Host: exflst201 (with external storage) Disk write: MB/s Network bandwidth: 676.8MB/s CPU usage: syst: 5.81% user: 0.39 % –Reader: syst: 49.54% user: 0.75% –Writer: syst: 7.03%user: 0.010% 28 Network Disk Reader (core 0) Writer (core 1) CPU Djelloul Boukhelef - XFEL

Result summary We need to write 1GB data file within 1.6s per storage box –Both internal and external storage configurations are able to achieve this rate (1.1GB/s, 0.97GB/s, resp.) –16 storage boxes are needed to handle 10GB/s train data stream High network bandwidth and low CPU load (stable) Direct IO: –Network read and disk write operations are overlapped 97%  Low overall time per file –Application buffer: For big files, the bigger the slot size the better disk IO performance (as long as DMA allows) To do: –Concurrent IO operations: write/write, write/read, file merging –Storage manager: file indexing, disk space management 29Djelloul Boukhelef - XFEL

SUMMARY & OUTLOOK 30Djelloul Boukhelef - XFEL

Summary First half of slice test hardware is configured and running Testing and tuning network and i/o performance using –System/community tools: netperf, iozone, … –PCL software –Train builder board TB (Emulator)  PCL software (Dell) –Bandwidth: 9.9 Gbps (99% of the wire speed) –Low UDP packet loss rate: only few packets loss at the start of runs (3.5×10 8 ~ 5×10 9 )  less than 3.7~0.26 per 10 5 trains can be affected at most PC Layer (Dell)  Storage boxes (IBM) –TCP data streaming: ~9.8 Gbps –Write terabytes of data to disk at 0.97 to 1.1 GB/s speed 31Djelloul Boukhelef - XFEL

Outlook Fully featured DAQ system –Data readout, pre-processing, monitoring, storage –Feed the system with real data and apply real algorithms (processing, monitoring, scientific computing) –Deployment, configuration and control: upload libraries, initiate devices, start/stop and monitor runs  Device composition Soak and stress testing –Test performance (CPU, IO, network), behavior (bugs, memory leaks), reliability (error handling, failure), and stability of the system Significant workload applied over long period of time –Parallel tasks: forwarding data to online analysis or scientific computing, multiple streams into the same storage server, … Cluster file system vs. DDN vs. local storage system Data management: structure, access control, metadata, … 32Djelloul Boukhelef - XFEL

Thanks! 33Djelloul Boukhelef - XFEL