Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Recording Model at XFEL CRISP 2 nd Annual meeting March 18-19, 2013 Djelloul Boukhelef 1Djelloul Boukhelef - XFEL.

Similar presentations


Presentation on theme: "Data Recording Model at XFEL CRISP 2 nd Annual meeting March 18-19, 2013 Djelloul Boukhelef 1Djelloul Boukhelef - XFEL."— Presentation transcript:

1 Data Recording Model at XFEL CRISP 2 nd Annual meeting March 18-19, 2013 Djelloul Boukhelef 1Djelloul Boukhelef - XFEL

2 Outline Purpose and scope Hardware setup Software architecture Experiments & results –Network –Storage Summary & outlook 2Djelloul Boukhelef - XFEL

3 Purpose and present scope Build a prototype of a fully featured DAQ/DM/SC system –Select/install adequate h/w, develop s/w, and test all system’s properties: Control, DAQ, DM, and SC systems Current prototype focuses on: –Data acquisition, pre-processing, formatting and storage –Assess the performance and stability of the h/w + s/w Network: bandwidth (10Gbps), UDP packets loss, TCP behavior… Processing: concurrent read, processing, write operations, … Storage: performance of disk (write), concurrent IO operations, … Software development –Application architecture: processing pipeline, communication, … –Design for performance, robustness, scalability, flexibility, … 3Djelloul Boukhelef - XFEL

4 Hardware setup 4 Details were presented in the IT&DM meeting in October Djelloul Boukhelef - XFEL 2D detector generates ~10GB of image data per second Data is multiplexed on 16 channels (10GbE) 1GB/1.6 sec = 640MB/s per channel Lots of other slow data streams

5 SOFTWARE ARCHITECTURE 5Djelloul Boukhelef - XFEL

6 Overview Current prototype consists of three software components –Data feeder No TB board (Train builder emulator) Feed the PCL with train data With TB board: Feed TB with detector data –PC Layer software Acquire, pre-process, reduce, monitor, format, send data to storage and SC –Storage service Device/DeviceServer model to build a distributed control system –PSR, Flexibility and Configurability 6 PC Layer Storage service Storage node 1 Storage node S TCP PCL node 1PCL node N Train Builder Emulator Data Feeder MData Feeder 1 UDP Djelloul Boukhelef - XFEL

7 Timer server Train Builder Emulator (Master) Data Feeder 1 Data Feeder 2 Data Feeder M PCL node 1 PCL node 2 PCL node N Storage node 1 Storage node 2 Storage node N TCP UDP TCP Train Builder Layer PC Layer Storage Layer Groups (Id, rate,...) Net-timer PCL nodes Net-timer PCL nodes TB-Emulator (Master) Train Metadata Data files TB-Emulator (Master) Train Metadata Data files Storage nodes Train Metadata Storage nodes Train Metadata CPU, Network CPU, Queues, Network Folder Naming convention Folder Naming convention CPU Network Timer Architecture overview Data-driven model Configurations (xml files) 7 Djelloul Boukhelef - XFEL

8 Processing pipeline Distributed and parallel processing 8 T1 T2 Receive Write T1 T2 Receive Process Format T1 T2 Generate Build Send T1 T2 Send Acquisition Train builder Processing PC Layer Storage Online analysis SC Pipelining and multithreading on multi-core Djelloul Boukhelef - XFEL

9 Train Builder Image Data Generator Detector Data Generator Packetizer Timer server Detector data queue (pointers) Image data queue (pointers) Train data queue (tokens) DAQ request Raw data buffer Data files (offline) Generate & Store (CImg) Load Trains buffer Clock-ticks (TCP) Packets (UDP) PCL nodes Train Builder Emulator Detector emulator Images Data Feeder 9 Djelloul Boukhelef - XFEL

10 De-packetizer Trains buffer Processing Pipeline (Monitoring, reduction, …) Packets (UDP) Train builder Process queue (train id) Format queue (train id) Formatter In memory files Writer Write queue (file name) TCP stream Online storage Statistics, … PC-Layer node Simple processing: checksum Need real algorithms 10Djelloul Boukhelef - XFEL

11 Disk array Ring buffer Storage server Reader Writer Free slot Get free slot Fill slot Flush slot TCP network read write Sync Memory ring buffer –Buffer size (16G) –Slot (chunk) size (1G) Threads pool –Reader: Read data stream (TCP) and store it into the memory buffer –Writer: Write filled buffers slots into disk (cache) –Sync: Flush cache to disk (physical writing) IO modes –Cached: issues IO command via OS –Direct: talk directly to the IO device –Splice: data transfer within kernel space 11Djelloul Boukhelef - XFEL

12 NETWORK PERFORMANCE 12Djelloul Boukhelef - XFEL

13 Train Transfer Protocol (TTP) Train data format –Header, images & descriptors, detector data, trailer Train Transfer Protocol (TTP) –Based on UDP: designed for fast data transfer where TCP is not implemented or not suitable (overhead, delays) –Transfers are identified by unique identifiers (frame number) –Packetization: bundle the data block (frame) into small packets that are tagged with increasing packet numbers –Flags: SoF, EoF, Padding –Packet trailer mode 13 Data 8kbytes # Frame 4b # Packet 3b 1b SoF, EoF Header Images descriptors Images data Detector specific data Trailer 0 X sof1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 X eof Train data Djelloul Boukhelef - XFEL

14 Previous results (reminder) Two types of programs run in parallel on all machines –Feeders: generate train data, packetize it into packets, send them using UDP –Receivers: reconstruct train data from packets, store them in memory (overwrite) Run length (#packets/total time): –Typical: 3.5×10 8 ~ 2.5×109 packets (few hours) –Maximum: 5×10 9 packets (16h37m) Time profile –XFEL: 10 MHz for 16 channels  Send 1 train (~131100 packets) within 1.6sec –Continuous: no waiting time between trains sending 14 15 2 3 4 6 7 8 PCL Node 2 PCL Node 1 Unidirectional stream Concurrent send/receive Djelloul Boukhelef - XFEL

15 Previous results (reminder) Network transfer rate: 1GB train  0.87sec ≈ 9.9Gbps CPU usage (ie. receiver core)  40% Packets loss –Few packets (tens to hundreds) are sometimes lost per run –It happens only at the beginning of some runs (not train) –Observed sometimes on all machines, some machines only. We have run with no packet loss on any machine, also Ignoring first lost packets which affect only the first train –Typical run (3.5×10 8 )  less than 3.7 out of 10000 trains –Long run (5×10 9 )  less than 26 out of one million trains 15Djelloul Boukhelef - XFEL

16 Train switching In previous experiments: –Each feeder is configured to feed one PC layer node (one-to-one) –Packet loss appears at the start of a run –In TB, trains are sent out through different channels every time 10 trains  16 channels per second. Question: –What if the feeder sends train data to a different PC layer node every time? 16 Feeder 2 Feeder 3 Feeder 1 PCLayer 1 PCLayer 2 10GbE Switch st101st102st103 st104 st105 Sub-net 1 TTP Feeder 2 Feeder 1 PCLayer 1 PCLayer 2 10GbE Switch st101st102 st104 st105 Sub-net 1 TTP Djelloul Boukhelef - XFEL

17 Train switching Test configuration –3 feeders nodes Pre-load images from disk Build train data (header, trailer,…) Calculate checksum (Adler32) Packetize train data (TTP) –2 PC layer nodes Depacketize (TTP) No processing is performed Format to HDF5 file Stream files through TCP (splice) –2 storage nodes Write files to shared memory (splice) 17 Timer Train builder Feeder 2 Feeder 3 Feeder 1 PCLayer 1 PCLayer 2 Storage 1Storage 2 10GbE Switch st401 st101st102st103 st104 st105 st106 st107 st401 Sub-net 1 Sub-net 2 TTP TCP Djelloul Boukhelef - XFEL

18 Train switching 3 Feeders feed 2 PC layer nodes in round robin manner –Rate: 2 trains every 1.6 second A PC layer node receives 1 train every 1.6 sec, each time from a different feeder A Feeder sends out 1 train every 2.4 sec, each time to a different IP address –Packetizer checks the send buffer in order to avoid overwriting previous (not sent yet) packets, eg. every 100 packets. –All Feeder-to-PCLayer data transfers are done on the same sub-network –Train transfer time is.88 sec, ie. there is an overlap between two consecutive trains of 0.08sec (9% of the time) 18 FeederPCLTime 110.0 220.8 311.6 122.4 213.2 324.0 114.8 … Feeder 2 Feeder 3 Feeder 1 PCLayer 1 PCLayer 2 10GbE Switch st101st102st103 st104st105 Sub-net 1 TTP Djelloul Boukhelef - XFEL

19 Experiment Total run time –Short time: less than ½ hour –Long time: 18 hours (81657 trains ~ 80TB) Observations: –6 trains were affected at the beginning of the run for each pc layer node, than they continue smoothly with no train loss until 1am. –… 8am, 2 more trains were affected at one PC layer (probably due to the nightly update). No train loss on the other node. –… than the run continues very stable until the end –Network send and receive (TTP and TCP) were very stable –Formatting time was not stable al the time 19Djelloul Boukhelef - XFEL

20 Summary Results: –Trains sent: 81657 (27219 trains per feeder) –PCLayer01 (received:40820, affected: 8) –PCLayer02 (received:40823, affected: 6) –Train size: 1073754173 = 1GB (image data) + 12.06MB (header, descriptors, detector data, and trailer) –# packets per train: 131202 –Transfer time: 0.877579 sec –Transfer rate: 1.1395 GBytes/sec = 9.116 Gbps Sustainable and stable network bandwidth 20Djelloul Boukhelef - XFEL

21 STORAGE PERFORMANCE 21Djelloul Boukhelef - XFEL

22  Cached IO  Issue read/write commands via OS kernel, which will execute the IO. Data are copied to/from page cache.  Zero-copy operation  Splice socket and file descriptors. Perform data transfer within the kernel space (transparent).  Problem using Linux splice function with IBM/Mellanox  Couldn’t figure out the reason!!  Direct IO  Performs read/write directly to/from the device, bypassing page cache  DMA: Memory alignment (512)  RAID: strip size (64K), # of disks  Per file vs. per partition (Linux) Network Direct IO vs Cached IO Device driver Page remapping read write flush Direct IO Kernel space User land Device driver Page cache Kernel buffer Disk Application buffer Hardware layer Cached IO 22Djelloul Boukhelef - XFEL

23 File number Time (sec) File number Time (sec) File size: 1GB Time period: 1.6s Disk: empty File size: 2GB Time period: 3.2s Disk: empty Cached IO Run length: 2+ hours Buffer size: 4GB Method: read all file data into one buffer, write it once, then sync 23Djelloul Boukhelef - XFEL

24 Run configuration Two types of programs run in parallel on four machines –Sender: open in memory data file, stream its content using TCP, close file. –Receiver: read file data from socket to RAM buffer (16GB), write it to disk. Run length –Typical: 4.5 TB (2 hours) –Maximum: until disk is full 9TB (4 hours) and 28TB(12 hours) –Disks are cleaned before every run Time profile –1G  1.6sec per box 24 PCL Node 101 Storage 201 Storage 202 Storage 203 Storage 204 PCL Node 102 PCL Node 103 PCL Node 104 External disksInternal disks Dell machines IBM machines Size (GB)1102040 Time (sec)1.6163264 Djelloul Boukhelef - XFEL

25 Direct IO – external storage Storage extension box: 3TB 7.2Krpm 6Gbps NL SAS, RAID6 25 File (GB) Network (Gbps) Storage (GB/s) Net read (sec) Disk write (sec) Overall (sec) avg. maxavg.maxavg.maxavg. 19.860.952.270.872.951.063.951.93 109.850.979.018.7215.5810.3016.4511.17 209.830.9817.7917.4830.0720.4530.9521.33 409.610.9738.8135.7458.0641.3959.0642.36 Djelloul Boukhelef - XFEL

26 Direct IO – internal storage Internal disks: 14x900GB 10Krpm 6Gbps SAS RAID6 26 File (GB) Network (Gbps) Storage (GB/s) Net read (sec) Disk write (sec) Overall (sec) avg. maxavg.maxavg.maxavg. 19.861.173.420.872.470.864.231.73 109.601.099.388.9511.289.1912.1510.07 209.361.0619.1818.3622.6318.8423.5519.75 409.381.0738.1136.6445.2537.5446.2238.50 Djelloul Boukhelef - XFEL

27 Long run experiments 27 Internal storage: 9TB, 918 files, 4 hours External storage: 28TB, 2792 files, 12 hours Network readDisk writeOverall time Network readDisk writeOverall time Average Buffer size: 16GB, Slot size: 1GB, File size: 10GB, Time profile: 16sec Djelloul Boukhelef - XFEL

28 Statistics from Ganglia Long run experiment (5:24am - 5:29pm) –Host: exflst201 (with external storage) Disk write: 671.14MB/s Network bandwidth: 676.8MB/s CPU usage: syst: 5.81% user: 0.39 % –Reader: syst: 49.54% user: 0.75% –Writer: syst: 7.03%user: 0.010% 28 Network Disk Reader (core 0) Writer (core 1) CPU Djelloul Boukhelef - XFEL

29 Result summary We need to write 1GB data file within 1.6s per storage box –Both internal and external storage configurations are able to achieve this rate (1.1GB/s, 0.97GB/s, resp.) –16 storage boxes are needed to handle 10GB/s train data stream High network bandwidth and low CPU load (stable) Direct IO: –Network read and disk write operations are overlapped 97%  Low overall time per file –Application buffer: For big files, the bigger the slot size the better disk IO performance (as long as DMA allows) To do: –Concurrent IO operations: write/write, write/read, file merging –Storage manager: file indexing, disk space management 29Djelloul Boukhelef - XFEL

30 SUMMARY & OUTLOOK 30Djelloul Boukhelef - XFEL

31 Summary First half of slice test hardware is configured and running Testing and tuning network and i/o performance using –System/community tools: netperf, iozone, … –PCL software –Train builder board TB (Emulator)  PCL software (Dell) –Bandwidth: 9.9 Gbps (99% of the wire speed) –Low UDP packet loss rate: only few packets loss at the start of runs (3.5×10 8 ~ 5×10 9 )  less than 3.7~0.26 per 10 5 trains can be affected at most PC Layer (Dell)  Storage boxes (IBM) –TCP data streaming: ~9.8 Gbps –Write terabytes of data to disk at 0.97 to 1.1 GB/s speed 31Djelloul Boukhelef - XFEL

32 Outlook Fully featured DAQ system –Data readout, pre-processing, monitoring, storage –Feed the system with real data and apply real algorithms (processing, monitoring, scientific computing) –Deployment, configuration and control: upload libraries, initiate devices, start/stop and monitor runs  Device composition Soak and stress testing –Test performance (CPU, IO, network), behavior (bugs, memory leaks), reliability (error handling, failure), and stability of the system Significant workload applied over long period of time –Parallel tasks: forwarding data to online analysis or scientific computing, multiple streams into the same storage server, … Cluster file system vs. DDN vs. local storage system Data management: structure, access control, metadata, … 32Djelloul Boukhelef - XFEL

33 Thanks! 33Djelloul Boukhelef - XFEL


Download ppt "Data Recording Model at XFEL CRISP 2 nd Annual meeting March 18-19, 2013 Djelloul Boukhelef 1Djelloul Boukhelef - XFEL."

Similar presentations


Ads by Google