ROOT I/O: The Fast and Furious CHEP 2010: Taipei, October 19. Philippe Canal/FNAL, Brian Bockelman/Nebraska, René Brun/CERN,

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.

File Consistency in a Parallel Environment Kenin Coloma

Operating System Support Focus on Architecture

File System Implementation CSCI 444/544 Operating Systems Fall 2008.

Paging and Virtual Memory. Memory management: Review  Fixed partitioning, dynamic partitioning  Problems Internal/external fragmentation A process can.

Chapter 3.5 Memory and I/O Systems. Memory Management 2 Only applies to languages with explicit memory management (C, C++) Memory problems are one of.

Memory Management 2010.

Ceng Operating Systems

Chapter 3.7 Memory and I/O Systems. 2 Memory Management Only applies to languages with explicit memory management (C or C++) Memory problems are one of.

1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.

CHEP ' 2003David Chamont (CMS - LLR)1 Twelve Ways to Build CMS Crossings from Root Files Benefits and deficiencies of Root trees and clones when : - NOT.

External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.

Filesytems and file access Wahid Bhimji University of Edinburgh, Sam Skipsey, Chris Walker …. Apr-101Wahid Bhimji – Files access.

Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.

Wahid Bhimji University of Edinburgh J. Cranshaw, P. van Gemmeren, D. Malon, R. D. Schaffer, and I. Vukotic On behalf of the ATLAS collaboration CHEP 2012.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

ROOT I/O recent improvements Bottlenecks and Solutions Rene Brun December

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.

File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.

 Optimization and usage of D3PD Ilija Vukotic CAF - PAF 19 April 2011 Lyon.

Chep 2006, Mumbai, IndiaPhilippe Canal (FNAL)1 Recent Developments in the ROOT I/O and TTrees CANAL, Philippe (FERMILAB) BRUN, Rene (CERN) FRANK, Markus.

Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.

8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

ROOT I/O for SQL databases Sergey Linev, GSI, Germany.

ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.

Free Space Management.

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

STAR Event data storage and management in STAR V. Perevoztchikov Brookhaven National Laboratory,USA.

Philippe CANAL root.cern.ch ROOT, I/O and Concurrency Philippe Canal Fermilab.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Debugging of # P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t.

Optimizing I/O Performance for ESD Analysis Misha Zynovyev, GSI (Darmstadt) ALICE Offline Week, October 28, 2009.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

STAR Schema Evolution Implementation in ROOT I/O V. Perevoztchikov Brookhaven National Laboratory,USA.

I MPLEMENTING FILES. Contiguous Allocation:  The simplest allocation scheme is to store each file as a contiguous run of disk blocks (a 50-KB file would.

Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.

File Systems cs550 Operating Systems David Monismith.

CS 147 Virtual Memory Prof. Sin Min Lee Anthony Palladino.

Performance Tuning John Black CS 425 UNR, Fall 2000.

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN.

9/28/2005Philippe Canal, ROOT Workshop TTree / SQL Philippe Canal (FNAL) 2005 Root Workshop.

11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]

CS 540 Database Management Systems

Work in progress Philippe & Rene. Memory Pools when reading Currently we new/delete – zipped buffer (when no cache) – unzipped buffer – branches target.

DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.

Analysis Performance and I/O Optimization Jack Cranshaw, Argonne National Lab October 11, 2011.

Maria Girone, CERN  CMS in a High-Latency Environment  CMSSW I/O Optimizations for High Latency  CPU efficiency in a real world environment  HLT 

What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.

I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)

ROOT : Outlook and Developments WLCG Jamboree Amsterdam June 2010 René Brun/CERN.

CS 540 Database Management Systems

Atlas IO improvements and Future prospects

Indexing and hashing.

Database Management System

Lecture 16: Data Storage Wednesday, November 6, 2006.

Disk Storage, Basic File Structures, and Buffer Management

Overview Continuation from Monday (File system implementation)

Overview: File system implementation (cont)

Chapter 2: Operating-System Structures

Chapter 2: Operating-System Structures

Chapter 5 File Systems -Compiled for MCA, PU

Presentation transcript:

ROOT I/O: The Fast and Furious CHEP 2010: Taipei, October 19. Philippe Canal/FNAL, Brian Bockelman/Nebraska, René Brun/CERN,

Overview Several enhancements to ROOT I/O performance Prefetching (a.k.a. TTreeCache) Clustering the baskets I/O challenges in CMS Optimizing the streaming engine CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 20102

ROOT I/O Landscape Local Disk file Local Disk file Remote Disk file Remote Disk file Zipped buffer Unzipped buffer Zipped buffer Network Objects in memory Unzipped buffer

ROOT I/O – Branches And Baskets 4 Streamer Tree in file Branches Tree entries NN+1N+1N+3 N12

Without Prefetching Baskets

Default was for all buffer to have the same size Branch buffers are not full at the same time. A branch containing one integer/event and with a buffer size of 32Kbytes will be written to disk every 8000 events while a branch containing a non-split collection may be written at each event. Without TTreeCache Many small reads When reading with prefetching there were still inefficiencies Backward seeks Gap in reads Hand tuning the baskets which was feasible with a dozens branches became completely impracticable for TTree with more than branches. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October 20106

Solution: TTreeCache Prefetches and caches a set of baskets (from several branches) Designed to reduce the number of file read (or network messages) when reading a TTree (by several order of magnitude). Configuration Size of the reserved memory area Set of branches to be read or range of entry to learn Range of entry to read T->SetCacheSize(cachesize); if (cachesize != 0) { T->SetCacheEntryRange(efirst,elast); T->AddBranchToCache(data_branch,kTRUE); // Request all the sub branches too T->AddBranchToCache(cut_branch,kFALSE); T->StopCacheLearningPhase(); }

void taodr(Int_t cachesize= ) { gSystem->Load("aod/aod"); //shared lib generated with TFile::MakeProject TFile *f = TFile::Open("AOD big.pool.root"); TTree *T = (TTree*)f->Get("CollectionTree"); Long64_t nentries = T->GetEntries(); T->SetCacheSize(cachesize); T->AddBranchToCache("*",kTRUE); TTreePerfStats ps("ioperf",T); for (Long64_t i=0;i<nentries;i++) { T->GetEntry(i); } T->PrintCacheStats(); ps.SaveAs("aodperf.root"); } TTreePerfStats Root > TFile f(“aodperf.root”) Root > ioperf.Draw() ******TreeCache statistics for file: AOD big.pool_3.root ****** Number of branches in the cache...: 9705 Cache Efficiency : Cache Efficiency Rel : Learn entries : 100 Reading : bytes in transactions Readahead : 0 bytes with overhead = 0 bytes Average transaction : Kbytes Number of blocks in current cache..: 3111, total size:

With TTreeCache Old Real Time = 722s New Real Time = 111s Gain a factor 6.5 !!! The limitation is now cpu time

Better But …. Sequential read still requires some backward file seeks Still a few interleaved reads CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October

ROOT I/O – Default Basket sizes 11 Streamer Branches Tree entries NN+1N+1N+3 N12 Tree in file

OptimizeBaskets Improve Basket Size Default basket size was the same for all branches and tuning the size by hand is very time consuming. TTree::OptimizeBaskets resizes the basket to even out the number of entry the basket of each branch and reduce the total memory use.

ROOT I/O -- Split/Cluster 13 Streamer Branches Tree entries NN+1N+2N+3 N+4 N1234 N1234 N+5 N+6 Tree in file

Clustering (AutoFlush) Enforce “clustering” Once a reasonable amount of data (default is 30 Mbytes) has been written to the file, all baskets are flushed out and the number of entries written so far is recorded in fAutoFlush. From then on for every multiple of this number of entries all the baskets are flushed. This insures that the range of entries between 2 flushes can be read in one single file read. The first time that FlushBaskets is called, we also call OptimizeBaskets. The TreeCache is always set to read a number of entries that is a multiple of fAutoFlush entries. No backward seeks needed to read file. Dramatic improvement in the raw disk IO speed.

ROOT I/O -- Split/Cluster 15 Streamer Branches Tree entries NN+1N+2N+3 N+4 N1234 N1234 Tree in file

OptimizeBaskets, AutoFlush Solution, enabled by default: Automatically tweak basket size! Flush baskets at regular intervals!

CMS IO Changes 18 months ago averaged 3.88 reads, 138KB data per event and 35KB per read. For cosmics reconstruction with CMSSW_3_0_0 Since then: 1.Changed split level to zero. 2.Fixed TTreeCache usage in CMSSW. There was one cache per file and using multiple TTrees was resetting the cache every time we switched between TTrees. 3.Improved read order. Now averages.18 reads, 108KB data per event and 600KB per read. For event data reconstruction with CMSSW_3_9_x pre-release. This under-estimates the improvements as we are comparing ordered cosmics with unordered pp data. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October

CMS IO Challenges One of the biggest issues tackled was read order. Files come out of CERN with ROOT TTree ordering not equal to the CMS event # ordering. Thus they are read out-of-order; worst performing jobs read 20x too much data from skipping around in the tree. In CMSSW_3_8_0, we now read each run/lumi in TTree order. Huge performance boost on real data. If the runs are contiguous, the whole file is read in order. But this adds new constraints on how we merge files. TTreeCache performance: Great for ntuple reading and reco-like jobs. Needs to be improved for skims: What if the training period was not representative of the use case? We have little knowledge of when a skim will select a full event. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October

CMS – Collaborating with ROOT Team Collaboration is very good because there is mutual interest and experts available. Lack of a CMS IO standard candle: Makes it difficult for CMS to understand how changes in the file layout / workflow patterns affect IO performance. Hinders CMS’s ability to communicate our needs with the ROOT team. ROOT IO debugging is challenging: Tedious to backtrack curious IO patterns to the code causing it. It takes a CMSSW and ROOT IO expert to understand what’s going on and communicate what needs to be fixed. It took 2 years for CMS to notice, investigate and solve (with ROOT’s help) why TTreeCache didn’t work in CMSSW. Statistics tell you when things are bad, but it takes an expert to figure out why or how to fix it. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October

Memberwise Streaming Used for split collection inside a TTree Now the default for streaming the collection even when not split. Better compression, faster read time Results for CMS files  some fully split  some unsplit CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October z3z3 y3y3 x3x3 z2z2 y2y2 x2x2 z1z1 y1y1 x1x1 z3z3 z2z2 z1z1 y3y3 y2y2 y1y1 x3x3 x2x2 x1x1

Optimization Of Inner Loops ‘Generic’ switch statement used via a large template function for several cases: Single object, Collection of pointers, Collection of objects, caching mechanism for schema evolution, split collection of pointers. Improve code localization and reduce code duplication. Drawbacks: Many initialization done ‘just’ in case Both intentionally and un-intentionally In at least one case the compiler optimization re-ordered the code resulting in more memory fetch being done than necessary. Many if statements to ‘tweak’ the implementation at run-time To stay generic the code could only dereference the address using operator[] hence relying on function overload. For collection proxy, to reduce code duplication, operator[] was very generic. Prevent efficient use of ‘next’ in implementation of looping Code generality required loops in all cases even to loop just once. CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October

Optimization Of Inner Loops Possible Solutions: Go further in using template functions by customizing the implementation of the switch cases depending on the ‘major’ case. Disadvantages: Still could not optimize for specific collection (for example vector) because they are ‘hidden’ behind the TVirtualCollectionProxy Can not go toward a complete template solution because it would not support the ‘emulated’ case. Large switch statement still prevents some/most compilers from properly optimizing the code CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October

Optimization Of Inner Loops Solution: Replace switch statement by a customized function call. Advantages: Can add new implementation more easily. Can customize the action for each of the specific cases:  No inner loop for single object.  Loop with known increment for vector of pointer and TClonesArray.  Loop with simple increment (vector and all emulated collections)  Loop using actual iterator for compiled collection. Remove any if statement that can be resolved just by looking at the onfile and at the in-memory class layout.  Able to also strip out some of the functions (calls). Outer loop is simpler and can now be overloaded in the various TBuffer implementation removing code needed only in special cases (XML and SQL). Disadvantages: Increase code duplication (but our code has been very stable). CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October

Examples Before After CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October inline void Streamer(void *obj, TBuffer &b) const { // Inline for performance, skipping one function call. (this->*fStreamerImpl)(obj,b,onfile_class); } void TClass::Streamer(void *object, TBuffer &b) const { // Stream object of this class to or from buffer. switch (fStreamerType) { case kExternal: case kExternal|kEmulated: {...; return; } case kTObject: {...; return; } Etc. void TClass::StreamerDefault(void *object, TBuffer &b) const { // Default streaming in cases where Property() has not yet been called. Property(); // Sets fStreamerImpl (this->*fStreamerImpl)(object,b); }

Main Focuses Performance Both CPU and I/O (and more improvement to come) Consolidation Coverity, Valgrind, root forum, savannah Support CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October

Backups Slides CHEP 2010 Philippe Canal, FermilabROOT I/O: The Fast and Furious. October

Reading network files TR=Transactions NT=Network Time (latency + data transfer) f = TFile::Open(" f = TFile::Open("