Presentation is loading. Please wait.

Presentation is loading. Please wait.

 IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei.

Similar presentations


Presentation on theme: " IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei."— Presentation transcript:

1  IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei

2 Overview  Data  Formats  Pool, T/P  Optimizations  CPU times  Local tests  Large scale tests 2 I. Vukotic ATLAS CHEP2010

3 ATLAS data  Data taken until now  14.4 pb -1 more than 1 billion events  @450 MB/s of RAW data  1.25 times more of derived data formats  Analysis jobs  70+ sites  ~1000 users  16k jobs/day 3 I. Vukotic ATLAS CHEP2010

4 Formats data MC data MC egamma JETMET ESDsAODs D3PDs Ball surface ~ event size 4 FormatSize [PB] RAW2 ESD1.8 AOD0.1 DESD0.3 D3PD0.3 I. Vukotic ATLAS CHEP2010

5 transient/persistent split 5  Transient objects are converted to persistent ones.  To store it efficiently data from each sub-detector or algorithm passes a different (sometimes very complex) set of transformations.  Converters of complex objects call converters for its members.  It provides possibility for schema evolution  Example: TracksCollection is composed of 20 different objects which can and do evolve individually I. Vukotic ATLAS CHEP2010

6 CPU times FormatSize [MB/ev]Speed root [MB/s]Speed total [MB/s] MC - AOD0.418.385.20 MC - ESD1.3512.993.69 real - AOD0.122.361.60 real - ESD1.299.423.22 6 I. Vukotic ATLAS CHEP2010

7 7 CPU times  The smallest object one can read is collection  More than 300 collections  Most of the objects stored are very small I. Vukotic ATLAS CHEP2010

8 8 CPU times  Root is very fast in retrieving large and simple objects (arrays and vectors)  Reduction in number of collections and their simplification should get us above 10MB/s in average read speed I. Vukotic ATLAS CHEP2010

9 Root File organization 9 std eventsbaskets doubles floats file  We choose to fully split (better compression factor)  Baskets are written to file as soon as they get full  That makes parts of the same event scattered over the file I. Vukotic ATLAS CHEP2010

10 Tradeoffs Optimizations Options  Persistent model  Split level – 99 (full)  Zip level - 6  Basket size - 2kb  Basket reordering  by event  by branch  TTreeCache  “New root” – AutoFlush matching TTC size 10 Constrains  Manpower  Memory  Disk size  Read/write time Read scenarios  Full sequential  Some events  Parts of events  Proof I. Vukotic ATLAS CHEP2010

11 Local disk performance - AOD 11  Large gains when reading ordered file  TTreeCache helps a lot but not as much  Root optimized file is best option – specially for sparse reading. Now Near future At start of data taking I. Vukotic ATLAS CHEP2010

12 Local disk performance - ESD 12  Unlike in case of AOD, root optimization (hadd –f6) doesn’t help full ESD read time.  Needs more investigation of this mode Here worse ? I. Vukotic ATLAS CHEP2010

13 Local disk performance D3PD 13  When reading all events real time dominated by CPU time  Not so for sparse reading  Root optimized (file rewritten using hadd –f6) improves in CPU but not in HDD time (!) I. Vukotic ATLAS CHEP2010

14 Athena running D3PD maker on AOD files read pattern the same as if doing analysis on AODs 274 files - 834 GB Tests: Original files Reordered files TTreeCache on D3PD reading Egamma dataset 11 files – 90 GB Tests: 100% 1% TTreeCache ON root optimized 14 Large scale tests I. Vukotic ATLAS CHEP2010

15 DPM/xrd As tested at LAL Orsay. Dedicated 80 cores slc5 farm Jobs managed by Torque/Maui Using proof with DPM backend was never before tested. Reading D3PDs showed big performance issues. Some sources of inefficiencies found and should be addressed shortly. orderedoriginalTTC TestTimeAvg CPU Ordered 5:35100% Original9:3085% TTC (20Mb)4:3492% 15 I. Vukotic ATLAS CHEP2010

16 DPM/rfio & LUSTRE 16  As tested at Glasgow (DPM) and QMUL (Lustre) on single file.  DPM tuned for file copy to worker node gives disappointing results when used for analysis  With very small read-ahead DPM behaves equally good for all the files  Both systems scale well to full dataset. Single file reading I. Vukotic ATLAS CHEP2010

17 EOS – xroot disk pool  Experimental 1 setup for a large scale analysis farm  Xroot server with 24 nodes each with 20 x 2TB raid0 FS (for this test only 10 nodes were used with maximum theoretical throughput 1GB/s )  To stress it used 23 x 8 cores with ROOT 5.26.0b (slc4, gcc 3.4)  Only Proof reading D3PDs tested 17 1 Caveat: real life performance will be significantly worse. I. Vukotic ATLAS CHEP2010

18 EOS – xroot disk pool cont.  Here only maximal sustained event rates (real use case averages will be significantly smaller)  Original – it would be faster to read all the events even if we would need only 1%  Reading full optimized data gave sustained read speed of 550 MB/s 18 Log scale ! I. Vukotic ATLAS CHEP2010

19 dCache vs. Lustre  Tested in Zeuthen and Hamburg  Minimum bias D3PD data HDD read requests 19  Single unoptimized file (Root 5.22, 1k branches of 2kb, CF=1)  Single optimized file (Root 5.26, hadd -f2) TTCTest 1Test 2 dCache No 17339440547 Yes44 Lustre No17339440504 Yes19397 I. Vukotic ATLAS CHEP2010

20 dCache 100 jobs TTC on un-optimized files IO rate saturates at <1GB/s and CPU caves in. Less than 20 simultaneous jobs use up all the bandwidth. 20 I. Vukotic ATLAS CHEP2010

21 Conclusions  Data volume makes efficient reading of data extremely important  Many possible ways and parameters to optimize data for faster input  Different formats and use cases with sometimes conflicting requirements makes optimization more difficult  Currently used file reordering significantly decreased job duration and stress on the disk systems  Will move to root optimized files  DPM, Lustre, dCache  Need careful job specific tuning to reach optimal performance  Need a lot of improvements in order to efficiently support large scale IO required by analysis jobs 21 I. Vukotic ATLAS CHEP2010

22 BACKUP SLIDES 22 I. Vukotic ATLAS CHEP2010

23 Formats 23 Produced at Tier-0 RAW ESDAOD DESD(M)DAOD(M) D2ESD(M)D2AOD(M) D3ESD(M)D3AOD(M) D2P D D3P D D1P D ReconstructionReduction  All the formats are root files  Athena versions utill 15.9.0 used ROOT 5.22 (with a lot of features backported). Since then ROOT 5.26  D3PDs have completely flat structure I. Vukotic ATLAS CHEP2010

24 Lustre setup 24 I. Vukotic ATLAS CHEP2010

25 NAF I. Vukotic ATLAS CHEP2010 25 PO-MON-031 ATLAS Operation in the GridKa Tier1/Tier2 Cloud DUCKECK, Guenter (LMU Munich) PO-WED-029 ATLAS Setup and Usage of the German National Analysis Facility at DESY MEHLHASE, Sascha (DESY) The batch is a SGE batch system with O(1000) cores. dCache connected via 1 or 10 Gbit depending on pool Lustre connected via Infiniband

26 EOS 26 I. Vukotic ATLAS CHEP2010


Download ppt " IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei."

Similar presentations


Ads by Google