 IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei.

 IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei

Overview  Data  Formats  Pool, T/P  Optimizations  CPU times  Local tests  Large scale tests 2 I. Vukotic ATLAS CHEP2010

ATLAS data  Data taken until now  14.4 pb -1 more than 1 billion events  @450 MB/s of RAW data  1.25 times more of derived data formats  Analysis jobs  70+ sites  ~1000 users  16k jobs/day 3 I. Vukotic ATLAS CHEP2010

Formats data MC data MC egamma JETMET ESDsAODs D3PDs Ball surface ~ event size 4 FormatSize [PB] RAW2 ESD1.8 AOD0.1 DESD0.3 D3PD0.3 I. Vukotic ATLAS CHEP2010

transient/persistent split 5  Transient objects are converted to persistent ones.  To store it efficiently data from each sub-detector or algorithm passes a different (sometimes very complex) set of transformations.  Converters of complex objects call converters for its members.  It provides possibility for schema evolution  Example: TracksCollection is composed of 20 different objects which can and do evolve individually I. Vukotic ATLAS CHEP2010

CPU times FormatSize [MB/ev]Speed root [MB/s]Speed total [MB/s] MC - AOD0.418.385.20 MC - ESD1.3512.993.69 real - AOD0.122.361.60 real - ESD1.299.423.22 6 I. Vukotic ATLAS CHEP2010

7 CPU times  The smallest object one can read is collection  More than 300 collections  Most of the objects stored are very small I. Vukotic ATLAS CHEP2010

8 CPU times  Root is very fast in retrieving large and simple objects (arrays and vectors)  Reduction in number of collections and their simplification should get us above 10MB/s in average read speed I. Vukotic ATLAS CHEP2010

Root File organization 9 std eventsbaskets doubles floats file  We choose to fully split (better compression factor)  Baskets are written to file as soon as they get full  That makes parts of the same event scattered over the file I. Vukotic ATLAS CHEP2010

Tradeoffs Optimizations Options  Persistent model  Split level – 99 (full)  Zip level - 6  Basket size - 2kb  Basket reordering  by event  by branch  TTreeCache  “New root” – AutoFlush matching TTC size 10 Constrains  Manpower  Memory  Disk size  Read/write time Read scenarios  Full sequential  Some events  Parts of events  Proof I. Vukotic ATLAS CHEP2010

Local disk performance - AOD 11  Large gains when reading ordered file  TTreeCache helps a lot but not as much  Root optimized file is best option – specially for sparse reading. Now Near future At start of data taking I. Vukotic ATLAS CHEP2010

Local disk performance - ESD 12  Unlike in case of AOD, root optimization (hadd –f6) doesn’t help full ESD read time.  Needs more investigation of this mode Here worse ? I. Vukotic ATLAS CHEP2010

Local disk performance D3PD 13  When reading all events real time dominated by CPU time  Not so for sparse reading  Root optimized (file rewritten using hadd –f6) improves in CPU but not in HDD time (!) I. Vukotic ATLAS CHEP2010

Athena running D3PD maker on AOD files read pattern the same as if doing analysis on AODs 274 files - 834 GB Tests: Original files Reordered files TTreeCache on D3PD reading Egamma dataset 11 files – 90 GB Tests: 100% 1% TTreeCache ON root optimized 14 Large scale tests I. Vukotic ATLAS CHEP2010

DPM/xrd As tested at LAL Orsay. Dedicated 80 cores slc5 farm Jobs managed by Torque/Maui Using proof with DPM backend was never before tested. Reading D3PDs showed big performance issues. Some sources of inefficiencies found and should be addressed shortly. orderedoriginalTTC TestTimeAvg CPU Ordered 5:35100% Original9:3085% TTC (20Mb)4:3492% 15 I. Vukotic ATLAS CHEP2010

DPM/rfio & LUSTRE 16  As tested at Glasgow (DPM) and QMUL (Lustre) on single file.  DPM tuned for file copy to worker node gives disappointing results when used for analysis  With very small read-ahead DPM behaves equally good for all the files  Both systems scale well to full dataset. Single file reading I. Vukotic ATLAS CHEP2010

EOS – xroot disk pool  Experimental 1 setup for a large scale analysis farm  Xroot server with 24 nodes each with 20 x 2TB raid0 FS (for this test only 10 nodes were used with maximum theoretical throughput 1GB/s )  To stress it used 23 x 8 cores with ROOT 5.26.0b (slc4, gcc 3.4)  Only Proof reading D3PDs tested 17 1 Caveat: real life performance will be significantly worse. I. Vukotic ATLAS CHEP2010

EOS – xroot disk pool cont.  Here only maximal sustained event rates (real use case averages will be significantly smaller)  Original – it would be faster to read all the events even if we would need only 1%  Reading full optimized data gave sustained read speed of 550 MB/s 18 Log scale ! I. Vukotic ATLAS CHEP2010

dCache vs. Lustre  Tested in Zeuthen and Hamburg  Minimum bias D3PD data HDD read requests 19  Single unoptimized file (Root 5.22, 1k branches of 2kb, CF=1)  Single optimized file (Root 5.26, hadd -f2) TTCTest 1Test 2 dCache No 17339440547 Yes44 Lustre No17339440504 Yes19397 I. Vukotic ATLAS CHEP2010

dCache 100 jobs TTC on un-optimized files IO rate saturates at <1GB/s and CPU caves in. Less than 20 simultaneous jobs use up all the bandwidth. 20 I. Vukotic ATLAS CHEP2010

Conclusions  Data volume makes efficient reading of data extremely important  Many possible ways and parameters to optimize data for faster input  Different formats and use cases with sometimes conflicting requirements makes optimization more difficult  Currently used file reordering significantly decreased job duration and stress on the disk systems  Will move to root optimized files  DPM, Lustre, dCache  Need careful job specific tuning to reach optimal performance  Need a lot of improvements in order to efficiently support large scale IO required by analysis jobs 21 I. Vukotic ATLAS CHEP2010

BACKUP SLIDES 22 I. Vukotic ATLAS CHEP2010

Formats 23 Produced at Tier-0 RAW ESDAOD DESD(M)DAOD(M) D2ESD(M)D2AOD(M) D3ESD(M)D3AOD(M) D2P D D3P D D1P D ReconstructionReduction  All the formats are root files  Athena versions utill 15.9.0 used ROOT 5.22 (with a lot of features backported). Since then ROOT 5.26  D3PDs have completely flat structure I. Vukotic ATLAS CHEP2010

Lustre setup 24 I. Vukotic ATLAS CHEP2010

NAF I. Vukotic ATLAS CHEP2010 25 PO-MON-031 ATLAS Operation in the GridKa Tier1/Tier2 Cloud DUCKECK, Guenter (LMU Munich) PO-WED-029 ATLAS Setup and Usage of the German National Analysis Facility at DESY MEHLHASE, Sascha (DESY) The batch is a SGE batch system with O(1000) cores. dCache connected via 1 or 10 Gbit depending on pool Lustre connected via Infiniband

EOS 26 I. Vukotic ATLAS CHEP2010

 IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei.

Similar presentations

Presentation on theme: " IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei.

Similar presentations

Presentation on theme: " IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP 2010 18-22 October 2010 Taipei."— Presentation transcript:

Similar presentations

About project

Feedback