ROOT : Outlook and Developments WLCG Jamboree Amsterdam 16-18 June 2010 René Brun/CERN.

Slides:



Advertisements
Similar presentations
Chapter 20 Oracle Secure Backup.
Advertisements

Must-Know ROOT Class I/O/TGraph/Tntuple/Ttree/…. 1.
File Systems.
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
BigBed/bigWig remote file access Hiram Clawson UCSC Center for Biomolecular Science & Engineering.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
Systems I Locality and Caching
Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Large Data Bases in High Energy Physics Frontiers in Diagnostic Technologies Frascati November 26 Frascati November 26 Rene Brun/CERN.
Bertrand Bellenot root.cern.ch ROOT I/O in JavaScript Reading ROOT files from any web browser ROOT Users Workshop
Filesytems and file access Wahid Bhimji University of Edinburgh, Sam Skipsey, Chris Walker …. Apr-101Wahid Bhimji – Files access.
ROOT An object oriented HEP analysis framework.. Computing in Physics Physics = experimental science =>Experiments (e.g. at CERN) Planning phase Physics.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
MCTS Guide to Microsoft Windows 7
Wahid Bhimji University of Edinburgh J. Cranshaw, P. van Gemmeren, D. Malon, R. D. Schaffer, and I. Vukotic On behalf of the ATLAS collaboration CHEP 2012.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
ROOT I/O recent improvements Bottlenecks and Solutions Rene Brun December
Introduction to ROOT May May Rene Brun/CERN.
 Optimization and usage of D3PD Ilija Vukotic CAF - PAF 19 April 2011 Lyon.
Chep 2006, Mumbai, IndiaPhilippe Canal (FNAL)1 Recent Developments in the ROOT I/O and TTrees CANAL, Philippe (FERMILAB) BRUN, Rene (CERN) FRANK, Markus.
1 Marek BiskupACAT2005PROO F Parallel Interactive and Batch HEP-Data Analysis with PROOF Maarten Ballintijn*, Marek Biskup**, Rene Brun**, Philippe Canal***,
N EWS OF M ON ALISA SITE MONITORING
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
ROOT for Data Analysis1 Intel discussion meeting CERN 5 Oct 2003 Ren é Brun CERN Distributed Data Analysis.
ROOT I/O for SQL databases Sergey Linev, GSI, Germany.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
Rene Brun Booting ROOT with BOOT René Brun, Fons Rademakers CERN Geneva, Switzerland.
CHEP 2013, Amsterdam Reading ROOT files in a browser ROOT I/O IN JAVASCRIPT B. Bellenot, CERN, PH-SFT B. Linev, GSI, CS-EE.
Persistent Object References in ROOT1 Persistent Object References in ROOT I/O Status & Proposal CMS-ROOT meeting CERN- November 27 Ren é Brun ftp://root.cern.ch/root/refs.ppt.
Optimizing I/O Performance for ESD Analysis Misha Zynovyev, GSI (Darmstadt) ALICE Offline Week, October 28, 2009.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
16 October 2002ROOT 2002, CERN1 Progress with Carrot Valeriy Onuchin.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
ROOT Data bases access1 LCG Data base deployment workshop 11 October Ren é Brun CERN.
ROOT I/O in JavaScript Browsing ROOT Files on the Web For more information see: For any questions please use following address:
Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN.
9/28/2005Philippe Canal, ROOT Workshop TTree / SQL Philippe Canal (FNAL) 2005 Root Workshop.
March, PROOF - Parallel ROOT Facility Maarten Ballintijn Bring the KB to the PB not the PB to the KB.
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
ROOT I/O: The Fast and Furious CHEP 2010: Taipei, October 19. Philippe Canal/FNAL, Brian Bockelman/Nebraska, René Brun/CERN,
Work in progress Philippe & Rene. Memory Pools when reading Currently we new/delete – zipped buffer (when no cache) – unzipped buffer – branches target.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Summary of User Requirements for Calibration and Alignment Database Magali Gruwé CERN PH/AIP ALICE Offline Week Alignment and Calibration Workshop February.
ROOT Workshop M.Frank LHCb/CERN Improvements in the I/O Area (*)  General I/O related improvements  Tree related issues  Plans (*) I present.
S.Linev: Go4 - J.Adamczewski, H.G.Essel, S.Linev ROOT 2005 New development in Go4.
ROOT-CORE Team 1 Philippe Canal Markus Frank I/O Basic I/O, RDBMS interfaces, Trees.
Analysis efficiency Andrei Gheata ALICE offline week 03 October 2012.
I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
Select Operation Strategies And Indexing (Chapter 8)
 IO performance of ATLAS data formats Ilija Vukotic for ATLAS collaboration CHEP October 2010 Taipei.
Mini-Workshop on multi-core joint project Peter van Gemmeren (ANL) I/O challenges for HEP applications on multi-core processors An ATLAS Perspective.
29/04/2008ALICE-FAIR Computing Meeting1 Resulting Figures of Performance Tests on I/O Intensive ALICE Analysis Jobs.
Peter van Gemmeren (ANL) Persistent Layout Studies Updates.
ROOT IO workshop What relates to ATLAS. General Both CMS and Art pushing for parallelization at all levels. Not clear why as they are anyhow CPU bound.
CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.
Atlas IO improvements and Future prospects
Lecture 16: Data Storage Wednesday, November 6, 2006.
BDII Performance Tests
File System Structure How do I organize a disk into a file system?
Support for ”interactive batch”
Selected Topics: External Sorting, Join Algorithms, …
THE GOOGLE FILE SYSTEM.
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
Presentation transcript:

ROOT : Outlook and Developments WLCG Jamboree Amsterdam June 2010 René Brun/CERN

Foreword Since 1995 the ROOT system is in continuous development in all areas: interpreters, 2D and 3D graphics, Mathlibs and statistics, and of course I/O. In this talk, I concentrate on recent developments to speed-up the I/O. In particular these developments will open new possibilities in client-server applications. Remote file access in WANs with efficient caching will be the main topic of this talk. 17 June 20102Rene Brun: ROOT developments

ROOT I/O A short overview of the main features 17 June 2010Rene Brun: ROOT developments3

Main features Designed for write once and read many times but also support for object deletion and write in multiple jobs. Simple file format described in one slide Two types of objects: keys and trees keys for objects like histograms, geometries (Unix-like) trees for collections of similar objects (HEP events) self-describing portable and compact files client-server support 17 June 2010Rene Brun: ROOT developments4

17 June 2010 ROOT file structure 5Rene Brun: ROOT developments

Objects in directory /pippa/DM/CJ eg: /pippa/DM/CJ/h15 A ROOT file pippa.root with 2 levels of sub-directories 17 June 20106Rene Brun: ROOT developments

7 File types & Access Local File X.xml hadoopChirp CastorDcache Local File X.root http rootd/xrootd Oracle SapDb PgSQL MySQL TFile TKey/TTree TStreamerInfo user TSQLServer TSQLRow TSQLResult TTreeSQL 17 June 2010

Memory Tree Each Node is a branch in the Tree tr T.Fill() T.GetEntry(6) T Memory 8Rene Brun: ROOT developments

Automatic branch creation from object model float a; int b; double c[5]; int N; float* x; //[N] float* y; //[N] Class1 c1; Class2 c2; //! Class3 *c3; std::vector ; TClonesArray *tc; 17 June 20109Rene Brun: ROOT developments branch buffers

ObjectWise/MemberWise Streaming abcdabcd a1b1c1d1a2b2c2d2…anbncndn a1a2..anb1b2..bnc1c2..cnd1d2..dn a1a2…an b1b2…bn c1c2…cn d1d2…dn 3 modes to stream an object object-wise to a buffer member-wise to a buffer each member to a buffer member-wise gives better compression 17 June Rene Brun: ROOT developments

8 leaves of branch Electrons Browsing a TTree with TBrowser A double click To histogram The leaf 8 branches of T 17 June Rene Brun: ROOT developments

17 June 2010 Data Volume & Organization 100MB1GB10GB1TB100GB100TB1PB10TB TTree TChain A TChain is a collection of TTrees or/and TChains A TFile typically contains 1 TTree A TChain is typically the result of a query to the file catalogue 12Rene Brun: ROOT developments

Queries to the data base via the GUI (TBrowser of TTreeViewer) via a CINT command or a script tree.Draw(“x”,”y 6”); tree.Process(“myscript.C”); via compiled code chain.Process(“myselector.C+”); in parallel with PROOF 17 June 2010Rene Brun: ROOT developments13

ROOT I/O Current developments Caches and Caches Speed-up Parallel Merge 17 June 2010Rene Brun: ROOT developments14

Buffering effects Branch buffers are not full at the same time. A branch containing one integer/event and with a buffer size of 32Kbytes will be written to disk every 8000 events, while a branch containing a non-split collection may be written at each event. This may give serious problems when reading if the file is not read sequentially. 17 June Rene Brun: ROOT developments

Tree Buffers layout Example of a Tree with 5 branches b1 : 400 bytes/event b2: 2500 ± 50 bytes/ev b3: 5000 ± 500 bytes/ev b4: 7500 ± 2500 bytes/ev b5: ± 5000 bytes/ev 17 June 2010Rene Brun: ROOT developments16 10 rows of 1 MByte in this 10 MBytes file typical Trees have several hundred branches each branch has its own buffer (8000 bytes) (< 3000 zipped)

17 June 2010Rene Brun: ROOT developments17 Looking inside a ROOT Tree TFile f("h1big.root"); f.DrawMap(); 3 branches have been colored entries 280 Mbytes 152 branches

I/O Performance Analysis Monitor TTree reads with TTreePerfStats 17 June 2010Rene Brun: ROOT developments18 TFile *f = TFile::Open("xyz.root"); T = (TTree*)f->Get("MyTree"); TTreePerfStats ps("ioperf",T); Long64_t n = T->GetEntries(); for (Long64_t i = 0;i < n; ++i) { GetEntry(i); DoSomething(); } ps.SaveAs("perfstat.root"); New in v5.25/04! (June 2009)

Study TTreePerfStats Visualizes read access: x: tree entry 17 June 2010Rene Brun: ROOT developments19 TFile f("perfstat.root"); ioperf->Draw(); ioperf->Print(); y: file offset y: real time

See Doctor 17 June Rene Brun: ROOT developments

Overlapping reads 100 MBytes 17 June Rene Brun: ROOT developments

After doctor Old Real Time = 722s New Real Time = 111s gain a factor 6.5 !! The limitation is now cpu time 17 June Rene Brun: ROOT developments

Important factors Local Disk file Remote Disk file Zipped buffer Unzipped buffer Zipped buffer network Objects in memory 17 June Rene Brun: ROOT developments

17 June 2010Rene Brun: ROOT developments24 A major problem: network latency Client Server Latency Response Time Round Trip Time ( RTT ) = 2*Latency + Response Time Runt Trip Time ( RTT ) Client Process Time ( CPT ) Total Time = 3 * [Client Process Time ( CPT )] + 3*[Round Trip Time ( RTT )] Total Time = 3* ( CPT ) + 3 * ( Response time ) + 3 * ( 2 * Latency )

17 June 2010Rene Brun: ROOT developments25 Idea ( diagram ) Perform a big request instead of many small requests (only possible if the future reads are known !! ) Client Server Latency Response Time Client Process Time ( CPT ) Total Time = 3* ( CPT ) + 3 * ( Response time ) + ( 2 * Latency ) readv

What is the TreeCache It groups into one buffer all blocks from the used branches. The blocks are sorted in ascending order and consecutive blocks merged such that the file is read sequentially. It reduces typically by a factor the number of transactions with the disk and in particular the network with servers like httpd, xrootd or dCache. The typical size of the TreeCache is 30 Mbytes, but higher values will always give better results. 17 June Rene Brun: ROOT developments readv

readv implementations xrootd TFile f1(“root://machine1.xx.yy/file1.root”) dCache TFile f2(“dcap://machine2.uu.vv/file2.root”) httpd TFile f3( uses a standard (eg apache2) web server performance winner (but not many people know !) 17 June 2010Rene Brun: ROOT developments27 I like it

TTreeCache with LANs and WANs 17 June 2010Rene Brun: ROOT developments28 clientlatency (ms) cachesize 0 cachesize 64k cachesize 10 MB A: local pcbrun.cern.ch 03.4 s B: 100Mb.s CERN LAN s C: 10 Mb/s CERN wireless s D: 100 Mb/s Orsay s E: 100 Mb/s Amsterdam s F: 8 Mb/s ADSL home s G: 10 Gb/s Caltech s One query to a 280 MB Tree I/O = 6.6 MB old slide from 2005

TreeCache results table Cache size (MB)readcallsRT pcbrun4 (s)CP pcbrun4 (s)RT macbrun (s)CP macbrun (s) LAN 1ms Cache size (MB)readcallsRT pcbrun4 (s)CP pcbrun4 (s)RT macbrun (s)CP macbrun (s) LAN 1ms Cache size (MB)readcallsRT pcbrun4 (s)CP pcbrun4 (s)RT macbrun (s)CP macbrun (s) LAN 1ms Reclust: OptimizeBaskets 30 MB (1086 MB), 9705 branches split=99 Reclust: OptimizeBaskets 30 MB (1147 MB), 203 branches split=0 Original Atlas file (1266 MB), 9705 branches split=99 17 June Rene Brun: ROOT developments

TreeCache: new interface Facts: Most users did not know if they were using or not the TreeCache. We decided to implement a simpler interface from TTree itself (no need to know about the class TTreeCache anymore). Because some users claimed to use the TreeCache and the results clearly showing the contrary, we decided to implement a new IO monitoring class TTreePerfStats. 17 June Rene Brun: ROOT developments

TTreeCache Sends a collection of read requests before analysis needs the baskets Must predict baskets: learns from previous entries takes TEntryList into account Enabled per TTree 17 June 2010Rene Brun: ROOT developments31 Improved in v5.25/04! f = new TFile ("xyz.root"); T = (TTree*)f->Get("Events"); T->SetCacheSize( ); T->AddBranchToCache("*");

What is the readahead cache The readahead cache will read all non consecutive blocks that are in the range of the cache. It minimizes the number of disk access. This operation could in principle be done by the OS, but the fact is that the OS parameters are not tuned for many small reads, in particular when many jobs read concurrently from the same disk. When using large values for the TreeCache or when the baskets are well sorted by entry, the readahead cache is not necessary. Typical (default value) is 256 Kbytes, although 2 Mbytes seems to give better results on Atlas files, but not with CMS or Alice. 17 June Rene Brun: ROOT developments

Half Way Much more ordered reads Still lots of jumps because baskets spread across file 17 June 2010Rene Brun: ROOT developments33

OptimizeBaskets, AutoFlush Solution, enabled by default: Tweak basket size! Flush baskets at regular intervals! 17 June New in v5.25/04! Rene Brun: ROOT developments

OptimizeBaskets Facts: Users do not tune the branch buffer size Effect: branches for the same event are scattered in the file. TTree::OptimizeBaskets is a new function in 5.25 that optimizes the buffer sizes taking into account the population in each branch. One can call this function on an existing read only Tree file to see the diagnostics. 17 June Rene Brun: ROOT developments

FlushBaskets TTree::FlushBaskets was introduced in 5.22 but called only once at the end of the filling process to disconnect the buffers from the tree header. In version 5.25/04 this function is called automatically when a reasonable amount of data (default is 30 Mbytes) has been written to the file. The frequency to call TTree::FlushBaskets can be changed by calling TTree::SetAutoFlush. The first time that FlushBaskets is called, we also call OptimizeBaskets. 17 June Rene Brun: ROOT developments

FlushBaskets 2 The frequency at which FlushBaskets is called is saved in the Tree (new member fAutoFlush). This very important parameter is used when reading to compute the best value for the TreeCache. The TreeCache is set to a multiple of fAutoFlush. Thanks to FlushBaskets there is no backward seeks on the file for files written with 5.25/04. This makes a dramatic improvement in the raw disk IO speed. 17 June Rene Brun: ROOT developments

without FlushBaskets 17 June 2010Rene Brun: ROOT developments38 Atlas file written with 5.22 and read with a 20 MB cache

with FlushBaskets 17 June 2010Rene Brun: ROOT developments39 Atlas file written with 5.26 and read with a 20 MB cache

Similar pattern with CMS files CMS : mainly CPU problem due to a complex object model 17 June Rene Brun: ROOT developments

Use Case reading 33 Mbytes out of 1100 MBytes Old ATLAS fileNew ATLAS file Seek time = 3186*5ms = 15.9sSeek time = 265*5ms = 1.3s 17 June Rene Brun: ROOT developments

Use Case re ading 20% of the events Even in this difficult case cache is better 17 June Rene Brun: ROOT developments

Caching a remote file ROOT can write a local cache on demand of a remote file. This feature is extensively used by the ROOT stress suite that read many files from root.cern.ch TFile f( The CACHEREAD option opens an existing file for reading through the file cache. If the download fails, it will be opened remotely. The file will be downloaded to the directory specified by SetCacheFileDir(). The CACHEREAD option opens an existing file for reading through the file cache. If the download fails, it will be opened remotely. The file will be downloaded to the directory specified by SetCacheFileDir(). 17 June 2010Rene Brun: ROOT developments43

Caching the TreeCache The TreeCache is mandatory when reading files in a LAN and of course a WAN. It reduces by a factor the number of network transactions. One could think of a further optimization by keeping locally the TreeCache for reuse in a following session. A prototype implementation (by A.Peters) is currently being tested and looks very promising. A generalisation of this prototype to fetch treecache buffers on proxy servers would be a huge step forward. 17 June 2010Rene Brun: ROOT developments44

Caching the TreeCache 17 June 2010Rene Brun: ROOT developments45 Local disk or memory file network 10 MB zip 30 MB unzip Remote disk file

A.Peters cache prototype 17 June 2010Rene Brun: ROOT developments46

A.Peters cache prototype 17 June 2010Rene Brun: ROOT developments47

A.Peters cache prototype 17 June 2010Rene Brun: ROOT developments48

caching the TreeCache Preliminary results 17 June 2010Rene Brun: ROOT developments49 sessionReal Time(s)Cpu Time (s) local remote xrootd with cache (1 st time) with cache (2 nd time) results on an Atlas AOD 1 GB file with preliminary cache from Andreas Peters very encouraging results

other improvements Code optimization to reduce the CPU time for IO Use of memory pools to reduce malloc/free calls and in particular memory fragmentation. The use of memory pools could be extended automatically to include user data structures, the main cause for memory fragmentation. working on parallel buffers merge, a very important requirement for multi/many core systems 17 June 2010Rene Brun: ROOT developments50

Parallel buffers merge parallel job with 8 cores each core produces a 1 GB file in 100 seconds. Then assuming that one can read each file at 50MB/s and write at 50 MB/s, merging will take 8* = 320s !! One can do the job in <160s 17 June 2010Rene Brun: ROOT developments51 F8F8 F1F1 F2F2 F3F3 F4F4 F5F5 F6F6 F7F7 8 GB 1 GB 10 KB

Parallel buffers merge 17 June 2010Rene Brun: ROOT developments52 F8F8 F1F1 F2F2 F3F3 F4F4 F5F5 F6F6 F7F7 8 GB B1B1 B2B2 B3B3 B4B4 B5B5 B6B6 B7B7 B8B8 1 GB10 MB 10 KB

Summary After 15 years of developments, we are still making substantial improvements in the IO system thanks to the many use cases and a better understanding of the chaotic user analysis. We believe that file access in a WAN with local caches and proxys is the way to go. This will have many implications, including a big simplification of the data management. We are preparing the ground to make an efficient use of many-core systems. 17 June 2010Rene Brun: ROOT developments53