Presentation is loading. Please wait.

Presentation is loading. Please wait.

ROOT I/O: The Fast and Furious CHEP 2010: Taipei, October 19. Philippe Canal/FNAL, Brian Bockelman/Nebraska, René Brun/CERN,

Similar presentations


Presentation on theme: "ROOT I/O: The Fast and Furious CHEP 2010: Taipei, October 19. Philippe Canal/FNAL, Brian Bockelman/Nebraska, René Brun/CERN,"— Presentation transcript:

1 ROOT I/O: The Fast and Furious CHEP 2010: Taipei, October 19. Philippe Canal/FNAL, Brian Bockelman/Nebraska, René Brun/CERN,

2 Overview Several enhancements to ROOT I/O performance Prefetching (a.k.a. TTreeCache) Clustering the baskets I/O challenges in CMS Optimizing the streaming engine CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 20102

3 ROOT I/O Landscape Local Disk file Local Disk file Remote Disk file Remote Disk file Zipped buffer Unzipped buffer Zipped buffer Network Objects in memory Unzipped buffer

4 ROOT I/O – Branches And Baskets 4 Streamer Tree in file Branches Tree entries NN+1N+1N+3 N12

5 Without Prefetching Baskets

6 Default was for all buffer to have the same size Branch buffers are not full at the same time. A branch containing one integer/event and with a buffer size of 32Kbytes will be written to disk every 8000 events while a branch containing a non-split collection may be written at each event. Without TTreeCache Many small reads When reading with prefetching there were still inefficiencies Backward seeks Gap in reads Hand tuning the baskets which was feasible with a dozens branches became completely impracticable for TTree with more than 10000 branches. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 20106

7 Solution: TTreeCache Prefetches and caches a set of baskets (from several branches) Designed to reduce the number of file read (or network messages) when reading a TTree (by several order of magnitude). Configuration Size of the reserved memory area Set of branches to be read or range of entry to learn Range of entry to read T->SetCacheSize(cachesize); if (cachesize != 0) { T->SetCacheEntryRange(efirst,elast); T->AddBranchToCache(data_branch,kTRUE); // Request all the sub branches too T->AddBranchToCache(cut_branch,kFALSE); T->StopCacheLearningPhase(); }

8 void taodr(Int_t cachesize=10000000) { gSystem->Load("aod/aod"); //shared lib generated with TFile::MakeProject TFile *f = TFile::Open("AOD.067184.big.pool.root"); TTree *T = (TTree*)f->Get("CollectionTree"); Long64_t nentries = T->GetEntries(); T->SetCacheSize(cachesize); T->AddBranchToCache("*",kTRUE); TTreePerfStats ps("ioperf",T); for (Long64_t i=0;i<nentries;i++) { T->GetEntry(i); } T->PrintCacheStats(); ps.SaveAs("aodperf.root"); } TTreePerfStats Root > TFile f(“aodperf.root”) Root > ioperf.Draw() ******TreeCache statistics for file: AOD.067184.big.pool_3.root ****** Number of branches in the cache...: 9705 Cache Efficiency..................: 0.973247 Cache Efficiency Rel...............: 1.000000 Learn entries......................: 100 Reading............................: 1265967732 bytes in 115472 transactions Readahead..........................: 0 bytes with overhead = 0 bytes Average transaction................: 10.963417 Kbytes Number of blocks in current cache..: 3111, total size: 2782963

9 With TTreeCache Old Real Time = 722s New Real Time = 111s Gain a factor 6.5 !!! The limitation is now cpu time

10 Better But …. Sequential read still requires some backward file seeks Still a few interleaved reads CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201010

11 ROOT I/O – Default Basket sizes 11 Streamer Branches Tree entries NN+1N+1N+3 N12 Tree in file

12 OptimizeBaskets Improve Basket Size Default basket size was the same for all branches and tuning the size by hand is very time consuming. TTree::OptimizeBaskets resizes the basket to even out the number of entry the basket of each branch and reduce the total memory use.

13 ROOT I/O -- Split/Cluster 13 Streamer Branches Tree entries NN+1N+2N+3 N+4 N1234 N1234 N+5 N+6 Tree in file

14 Clustering (AutoFlush) Enforce “clustering” Once a reasonable amount of data (default is 30 Mbytes) has been written to the file, all baskets are flushed out and the number of entries written so far is recorded in fAutoFlush. From then on for every multiple of this number of entries all the baskets are flushed. This insures that the range of entries between 2 flushes can be read in one single file read. The first time that FlushBaskets is called, we also call OptimizeBaskets. The TreeCache is always set to read a number of entries that is a multiple of fAutoFlush entries. No backward seeks needed to read file. Dramatic improvement in the raw disk IO speed.

15 ROOT I/O -- Split/Cluster 15 Streamer Branches Tree entries NN+1N+2N+3 N+4 N1234 N1234 Tree in file

16 OptimizeBaskets, AutoFlush Solution, enabled by default: Automatically tweak basket size! Flush baskets at regular intervals! 2009-11-3016

17 CMS IO Changes 18 months ago averaged 3.88 reads, 138KB data per event and 35KB per read. For cosmics reconstruction with CMSSW_3_0_0 Since then: 1.Changed split level to zero. 2.Fixed TTreeCache usage in CMSSW. There was one cache per file and using multiple TTrees was resetting the cache every time we switched between TTrees. 3.Improved read order. Now averages.18 reads, 108KB data per event and 600KB per read. For event data reconstruction with CMSSW_3_9_x pre-release. This under-estimates the improvements as we are comparing ordered cosmics with unordered pp data. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201017

18 CMS IO Challenges One of the biggest issues tackled was read order. Files come out of CERN with ROOT TTree ordering not equal to the CMS event # ordering. Thus they are read out-of-order; worst performing jobs read 20x too much data from skipping around in the tree. In CMSSW_3_8_0, we now read each run/lumi in TTree order. Huge performance boost on real data. If the runs are contiguous, the whole file is read in order. But this adds new constraints on how we merge files. TTreeCache performance: Great for ntuple reading and reco-like jobs. Needs to be improved for skims: What if the training period was not representative of the use case? We have little knowledge of when a skim will select a full event. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201018

19 CMS – Collaborating with ROOT Team Collaboration is very good because there is mutual interest and experts available. Lack of a CMS IO standard candle: Makes it difficult for CMS to understand how changes in the file layout / workflow patterns affect IO performance. Hinders CMS’s ability to communicate our needs with the ROOT team. ROOT IO debugging is challenging: Tedious to backtrack curious IO patterns to the code causing it. It takes a CMSSW and ROOT IO expert to understand what’s going on and communicate what needs to be fixed. It took 2 years for CMS to notice, investigate and solve (with ROOT’s help) why TTreeCache didn’t work in CMSSW. Statistics tell you when things are bad, but it takes an expert to figure out why or how to fix it. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201019

20 Memberwise Streaming Used for split collection inside a TTree Now the default for streaming the collection even when not split. Better compression, faster read time Results for CMS files  some fully split  some unsplit CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201020 z3z3 y3y3 x3x3 z2z2 y2y2 x2x2 z1z1 y1y1 x1x1 z3z3 z2z2 z1z1 y3y3 y2y2 y1y1 x3x3 x2x2 x1x1

21 Optimization Of Inner Loops ‘Generic’ switch statement used via a large template function for several cases: Single object, Collection of pointers, Collection of objects, caching mechanism for schema evolution, split collection of pointers. Improve code localization and reduce code duplication. Drawbacks: Many initialization done ‘just’ in case Both intentionally and un-intentionally In at least one case the compiler optimization re-ordered the code resulting in more memory fetch being done than necessary. Many if statements to ‘tweak’ the implementation at run-time To stay generic the code could only dereference the address using operator[] hence relying on function overload. For collection proxy, to reduce code duplication, operator[] was very generic. Prevent efficient use of ‘next’ in implementation of looping Code generality required loops in all cases even to loop just once. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201021

22 Optimization Of Inner Loops Possible Solutions: Go further in using template functions by customizing the implementation of the switch cases depending on the ‘major’ case. Disadvantages: Still could not optimize for specific collection (for example vector) because they are ‘hidden’ behind the TVirtualCollectionProxy Can not go toward a complete template solution because it would not support the ‘emulated’ case. Large switch statement still prevents some/most compilers from properly optimizing the code CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201022

23 Optimization Of Inner Loops Solution: Replace switch statement by a customized function call. Advantages: Can add new implementation more easily. Can customize the action for each of the specific cases:  No inner loop for single object.  Loop with known increment for vector of pointer and TClonesArray.  Loop with simple increment (vector and all emulated collections)  Loop using actual iterator for compiled collection. Remove any if statement that can be resolved just by looking at the on- file and at the in-memory class layout.  Able to also strip out some of the functions (calls). Outer loop is simpler and can now be overloaded in the various TBuffer implementation removing code needed only in special cases (XML and SQL). Disadvantages: Increase code duplication (but our code has been very stable). CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201023

24 Examples Before After CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201024 inline void Streamer(void *obj, TBuffer &b) const { // Inline for performance, skipping one function call. (this->*fStreamerImpl)(obj,b,onfile_class); } void TClass::Streamer(void *object, TBuffer &b) const { // Stream object of this class to or from buffer. switch (fStreamerType) { case kExternal: case kExternal|kEmulated: {...; return; } case kTObject: {...; return; } Etc. void TClass::StreamerDefault(void *object, TBuffer &b) const { // Default streaming in cases where Property() has not yet been called. Property(); // Sets fStreamerImpl (this->*fStreamerImpl)(object,b); }

25 Main Focuses Performance Both CPU and I/O (and more improvement to come) Consolidation Coverity, Valgrind, root forum, savannah Support CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201025

26 Backups Slides CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 201026

27 Reading network files TR=Transactions NT=Network Time (latency + data transfer) f = TFile::Open("http://root.cern.ch/files/AOD.067184.big.pool_4.root") f = TFile::Open("http://root.cern.ch/files/atlasFlushed.root")


Download ppt "ROOT I/O: The Fast and Furious CHEP 2010: Taipei, October 19. Philippe Canal/FNAL, Brian Bockelman/Nebraska, René Brun/CERN,"

Similar presentations


Ads by Google