Download presentation
Presentation is loading. Please wait.
Published byZoe Theresa Ross Modified over 8 years ago
1
IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei
2
Overview Data Formats Pool, T/P Optimizations CPU times Local tests Large scale tests 2 I. Vukotic ATLAS CHEP2010
3
ATLAS data Data taken until now 14.4 pb -1 more than 1 billion events @450 MB/s of RAW data 1.25 times more of derived data formats Analysis jobs 70+ sites ~1000 users 16k jobs/day 3 I. Vukotic ATLAS CHEP2010
4
Formats data MC data MC egamma JETMET ESDsAODs D3PDs Ball surface ~ event size 4 FormatSize [PB] RAW2 ESD1.8 AOD0.1 DESD0.3 D3PD0.3 I. Vukotic ATLAS CHEP2010
5
transient/persistent split 5 Transient objects are converted to persistent ones. To store it efficiently data from each sub-detector or algorithm passes a different (sometimes very complex) set of transformations. Converters of complex objects call converters for its members. It provides possibility for schema evolution Example: TracksCollection is composed of 20 different objects which can and do evolve individually I. Vukotic ATLAS CHEP2010
6
CPU times FormatSize [MB/ev]Speed root [MB/s]Speed total [MB/s] MC - AOD0.418.385.20 MC - ESD1.3512.993.69 real - AOD0.122.361.60 real - ESD1.299.423.22 6 I. Vukotic ATLAS CHEP2010
7
7 CPU times The smallest object one can read is collection More than 300 collections Most of the objects stored are very small I. Vukotic ATLAS CHEP2010
8
8 CPU times Root is very fast in retrieving large and simple objects (arrays and vectors) Reduction in number of collections and their simplification should get us above 10MB/s in average read speed I. Vukotic ATLAS CHEP2010
9
Root File organization 9 std eventsbaskets doubles floats file We choose to fully split (better compression factor) Baskets are written to file as soon as they get full That makes parts of the same event scattered over the file I. Vukotic ATLAS CHEP2010
10
Tradeoffs Optimizations Options Persistent model Split level – 99 (full) Zip level - 6 Basket size - 2kb Basket reordering by event by branch TTreeCache “New root” – AutoFlush matching TTC size 10 Constrains Manpower Memory Disk size Read/write time Read scenarios Full sequential Some events Parts of events Proof I. Vukotic ATLAS CHEP2010
11
Local disk performance - AOD 11 Large gains when reading ordered file TTreeCache helps a lot but not as much Root optimized file is best option – specially for sparse reading. Now Near future At start of data taking I. Vukotic ATLAS CHEP2010
12
Local disk performance - ESD 12 Unlike in case of AOD, root optimization (hadd –f6) doesn’t help full ESD read time. Needs more investigation of this mode Here worse ? I. Vukotic ATLAS CHEP2010
13
Local disk performance D3PD 13 When reading all events real time dominated by CPU time Not so for sparse reading Root optimized (file rewritten using hadd –f6) improves in CPU but not in HDD time (!) I. Vukotic ATLAS CHEP2010
14
Athena running D3PD maker on AOD files read pattern the same as if doing analysis on AODs 274 files - 834 GB Tests: Original files Reordered files TTreeCache on D3PD reading Egamma dataset 11 files – 90 GB Tests: 100% 1% TTreeCache ON root optimized 14 Large scale tests I. Vukotic ATLAS CHEP2010
15
DPM/xrd As tested at LAL Orsay. Dedicated 80 cores slc5 farm Jobs managed by Torque/Maui Using proof with DPM backend was never before tested. Reading D3PDs showed big performance issues. Some sources of inefficiencies found and should be addressed shortly. orderedoriginalTTC TestTimeAvg CPU Ordered 5:35100% Original9:3085% TTC (20Mb)4:3492% 15 I. Vukotic ATLAS CHEP2010
16
DPM/rfio & LUSTRE 16 As tested at Glasgow (DPM) and QMUL (Lustre) on single file. DPM tuned for file copy to worker node gives disappointing results when used for analysis With very small read-ahead DPM behaves equally good for all the files Both systems scale well to full dataset. Single file reading I. Vukotic ATLAS CHEP2010
17
EOS – xroot disk pool Experimental 1 setup for a large scale analysis farm Xroot server with 24 nodes each with 20 x 2TB raid0 FS (for this test only 10 nodes were used with maximum theoretical throughput 1GB/s ) To stress it used 23 x 8 cores with ROOT 5.26.0b (slc4, gcc 3.4) Only Proof reading D3PDs tested 17 1 Caveat: real life performance will be significantly worse. I. Vukotic ATLAS CHEP2010
18
EOS – xroot disk pool cont. Here only maximal sustained event rates (real use case averages will be significantly smaller) Original – it would be faster to read all the events even if we would need only 1% Reading full optimized data gave sustained read speed of 550 MB/s 18 Log scale ! I. Vukotic ATLAS CHEP2010
19
dCache vs. Lustre Tested in Zeuthen and Hamburg Minimum bias D3PD data HDD read requests 19 Single unoptimized file (Root 5.22, 1k branches of 2kb, CF=1) Single optimized file (Root 5.26, hadd -f2) TTCTest 1Test 2 dCache No 17339440547 Yes44 Lustre No17339440504 Yes19397 I. Vukotic ATLAS CHEP2010
20
dCache 100 jobs TTC on un-optimized files IO rate saturates at <1GB/s and CPU caves in. Less than 20 simultaneous jobs use up all the bandwidth. 20 I. Vukotic ATLAS CHEP2010
21
Conclusions Data volume makes efficient reading of data extremely important Many possible ways and parameters to optimize data for faster input Different formats and use cases with sometimes conflicting requirements makes optimization more difficult Currently used file reordering significantly decreased job duration and stress on the disk systems Will move to root optimized files DPM, Lustre, dCache Need careful job specific tuning to reach optimal performance Need a lot of improvements in order to efficiently support large scale IO required by analysis jobs 21 I. Vukotic ATLAS CHEP2010
22
BACKUP SLIDES 22 I. Vukotic ATLAS CHEP2010
23
Formats 23 Produced at Tier-0 RAW ESDAOD DESD(M)DAOD(M) D2ESD(M)D2AOD(M) D3ESD(M)D3AOD(M) D2P D D3P D D1P D ReconstructionReduction All the formats are root files Athena versions utill 15.9.0 used ROOT 5.22 (with a lot of features backported). Since then ROOT 5.26 D3PDs have completely flat structure I. Vukotic ATLAS CHEP2010
24
Lustre setup 24 I. Vukotic ATLAS CHEP2010
25
NAF I. Vukotic ATLAS CHEP2010 25 PO-MON-031 ATLAS Operation in the GridKa Tier1/Tier2 Cloud DUCKECK, Guenter (LMU Munich) PO-WED-029 ATLAS Setup and Usage of the German National Analysis Facility at DESY MEHLHASE, Sascha (DESY) The batch is a SGE batch system with O(1000) cores. dCache connected via 1 or 10 Gbit depending on pool Lustre connected via Infiniband
26
EOS 26 I. Vukotic ATLAS CHEP2010
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.