I/O Strategies for Multicore Processing in ATLAS P van Gemmeren 1, S Binet 2, P Calafiura 3, W Lavrijsen 3, D Malon 1 and V Tsulaia 3 on behalf of the.

Slides:



Advertisements
Similar presentations
More on File Management
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.
1 Overview of Storage and Indexing Chapter 8 (part 1)
1: Operating Systems Overview
External Sorting R & G Chapter 13 One of the advantages of being
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Computer Organization and Architecture
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Persistence Technology and I/O Framework Evolution Planning David Malon Argonne National Laboratory 18 July 2011.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
XAOD persistence considerations: size, versioning, schema evolution Joint US ATLAS Software and Computing / Physics Support Meeting Peter van Gemmeren.
Wahid Bhimji University of Edinburgh J. Cranshaw, P. van Gemmeren, D. Malon, R. D. Schaffer, and I. Vukotic On behalf of the ATLAS collaboration CHEP 2012.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
ATLAS Peter van Gemmeren (ANL) for many. xAOD  Replaces AOD and DPD –Single, EDM – no T/P conversion when writing, but versioning Transient EDM is typedef.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Database Management Systems, R. Ramakrishnan and J. Gehrke 1 External Sorting Chapter 13.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Toward a Next-generation I/O Framework David Malon, Jack Cranshaw, Peter van Gemmeren, Marcin Nowak, Alexandre Vaniachine US ATLAS Technical Planning Meeting.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
I/O Infrastructure Support and Development David Malon ATLAS Software Technical Interchange Meeting 9 November 2015.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
File Systems cs550 Operating Systems David Monismith.
CPSC Why do we need Sorting? 2.Complexities of few sorting algorithms ? 3.2-Way Sort 1.2-way external merge sort 2.Cost associated with external.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Summary of persistence discussions with LHCb and LCG/IT POOL team David Malon Argonne National Laboratory Joint ATLAS, LHCb, LCG/IT meeting.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.
Framework support for Accelerators Sami Kama. Introduction Current Status Future Accelerator use modes Symmetric resource Asymmetric resource 09/11/2015.
Next-Generation Navigational Infrastructure and the ATLAS Event Store Abstract: The ATLAS event store employs a persistence framework with extensive navigational.
I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)
I/O and Metadata Jack Cranshaw Argonne National Laboratory November 9, ATLAS Core Software TIM.
POOL Based CMS Framework Bill Tanenbaum US-CMS/Fermilab 04/June/2003.
Mini-Workshop on multi-core joint project Peter van Gemmeren (ANL) I/O challenges for HEP applications on multi-core processors An ATLAS Perspective.
Multi Process I/O Peter Van Gemmeren (Argonne National Laboratory (US))
Peter van Gemmeren (ANL) Persistent Layout Studies Updates.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
File-System Management
Atlas IO improvements and Future prospects
Memory Management.
Multicore Computing in ATLAS
Grid Canada Testbed using HEP applications
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Selected Topics: External Sorting, Join Algorithms, …
Virtual Memory Hardware
Chapter 2: Operating-System Structures
General External Merge Sort
External Sorting.
Chapter 2: Operating-System Structures
New I/O navigation scheme
Presentation transcript:

I/O Strategies for Multicore Processing in ATLAS P van Gemmeren 1, S Binet 2, P Calafiura 3, W Lavrijsen 3, D Malon 1 and V Tsulaia 3 on behalf of the ATLAS collaboration 1 Argonne National Laboratory, Argonne, Illinois 60439, USA 2 Laboratoire de l'Accélérateur Linéaire/IN2P3, Orsay Cédex, France 3 Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA

Abstract A critical component of any multicore/manycore application architecture is the handling of input and output. Even in the simplest of models, design decisions interact both in obvious and in subtle ways with persistence strategies. When multiple workers handle I/O independently using distinct instances of a serial I/O framework, for example, it may happen that because of the way data from consecutive events are compressed together, there may be serious inefficiencies, with workers redundantly reading the same buffers, or multiple instances thereof. With shared reader strategies, caching and buffer management by the persistence infrastructure and by the control framework may have decisive performance implications for a variety of design choices. Providing the next event may seem straightforward when all event data are contiguously stored in a block, but there may be performance penalties to such strategies when only a subset of a given event's data are needed; conversely, when event data are partitioned by type in persistent storage, providing the next event becomes more complicated, requiring marshalling of data from many I/O buffers. Output strategies pose similarly subtle problems, with complications that may lead to significant serialization and the possibility of serial bottlenecks, either during writing or in post-processing, e.g., during data stream merging. In this paper we describe the I/O components of AthenaMP, the multicore implementation of the ATLAS control framework, and the considerations that have led to the current design, with attention to how these I/O components interact with ATLAS persistent data organization and infrastructure. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 2

Outline  Current AthenaMP I/O infrastructure  Data storage –Raw data –Simulated, Reconstructed, Derived Data  Handicaps of current I/O on multicore platforms –Read data –Write data  Scatter / Gather architecture for multicore I/O –Shared reader strategies and alternatives ByteStream POOL / ROOT –Shared writer  Conclusions / Outlook 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 3

Current AthenaMP I/O infrastructure  Multiple workers handle I/O independently using distinct instances of a serial I/O framework –Each worker process produces its own output file, which need to be merged after all workers are done. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 4 parallel event processing initfork bootstrap … … … … collect & merge

Data storage Raw Data: ByteStream  Simple C-style structure  Since 2011, ATLAS compresses Raw data events –Saves about 50% disk storage –Event-wise (not file) compression preserves efficient single event reading. Simulated, Reconstructed, Derived Data: (POOL) / ROOT  Object data –made simpler and schema evolvable using Trans.-Pers. conversion.  Collections of objects in ROOT TBranches –Read separately on demand  Column-wise compressed into Baskets –Depending on data product, multiple events share a Basket 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 5 Compressed bufferRetrieval Access … … … … … 123… … #: Event Number

Handicaps of current I/O infrastructure on multicore platforms for ROOT data  Read data: A process (initialization, event execute,…) reads part of the input file (e.g., to retrieve one event, or collection). –All workers use the same input file. Multiple accesses may mean poor read performance, especially if events are not consecutive. For POOL / ROOT, a worker may retrieve a collection, after the same collection for a later event has already been processed by a different worker. These ‘back reads’ hurt I/O performance.  Decompress / Stream ROOT baskets: Each worker will retrieve its own event data, which means reading many ROOT baskets, decompressing them and streaming them into persistent objects. –ROOT baskets contain object member of several events, so multiple worker use the same baskets and each of them will decompress them independently: Wastes CPU time (multiple decompress of the same data) Wastes memory (multiple copies of the same Basket, not shared) 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 6

Multicore reading of ROOT data 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 7 decompresst/p conv. Compressed baskets (b) Persistent State (P) Transient State (T) Baskets (B) stream t/p conv. read Input File [event N+1] [event N] ATLASROOT stream Additional Objects

Handicaps of current I/O infrastructure on multicore platforms for ROOT data  Write data: Each process writes its own output file. Output needs to be merged in serial.  Compress / Stream to ROOT baskets: Writers compress data separately. Suboptimal compression factor (which will cost storage and CPU time at subsequent reads) Wastes memory (each worker needs its own set of output buffer)  Merging: The first implementation of AthenaMP used a separate full Athena job to merge the workers output data. –This job would read (decompress, stream, p->t) all data from all files, do nothing, and re-write (t->p, stream, compress) the data into a merged file. Compared to processing this takes about 5% of CPU time, but because it had to be done in serial, this was the instant performance bottleneck. –A fast merge was developed that appends compressed baskets and maintains a navigational redirection layer (to keep externalized references valid). In-File metadata is still propagated using full Athena framework. Combined event and metadata merging is 5 – 10 times faster. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 8

Fast merge utility  The ATLAS fast merge utility appends event data without decompressing the baskets and uses the Athena framework to summarize metadata. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 9 File 1 … 1… … < 5 evt … 5 events File 2 … 2… … < 5 evt … 5 events Merged File … 1… … < 5 evt … 5 events … 2… … < 5 evt 5 events fast merge metadata summary Athena POOL / ROOT POOL Redirection

Scatter / Gather architecture for multicore I/O 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 10 Output File Output Daemon (Writer) Receives pers. objects, stream, compresses Input Daemon (Reader) Decompress, stream, provide pers. objects Input File init, metadata retrieve, fork boot strap, event data processing final

Shared reader strategies: ByteStream data  For Raw data, providing the next event is rather straightforward as all event data are contiguously stored in a single block which is read entirely. –At the same time, benefit of a single reader for ByteStream is small as Raw events can be decompressed individually. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 11 event# queue in mother process Calls seek() to schedule next() to process event in worker 1,2,3,4, … Input Daemon (Reader) Decompress, provide event data via shared memory, announce file transitions (for metadata), handle provenance … bootstrap4 … 1 3 … 2 init, metadata retrieve, fork collect & merge

Challenges for shared ByteStream reader  Metadata propagation –As for multicore processing in general, metadata poses one of the key challenges for the shared ByteStream reader. –In this architecture the reader is the only component that detects file transitions. It will need to inform worker to fire incidents, which control metadata propagation. Each worker will then fire the End File Incident after processing its last event from the file and the Begin File Incident before processing the next event. –Input metadata needs to be provided to all worker processes. For Raw data, ATLAS uses in-file metadata which is very limited in content and structure.  Provenance –When reading Raw data, the ByteStream reader creates an artificial provenance record that needs to be communicated to the event worker. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 12

Shared reader strategies: POOL / ROOT data  For simulated, reconstructed or derived data, there may be performance penalties to event scatter strategies (such as shown for ByteStream). –POOL / ROOT data is accessed for each collection on demand and not all are needed for each job. –Notation of a persistent event may not be natural on a flat event store?  However, scattering individual collections of data objects for each event becomes more complicated, requiring marshalling of data from many I/O buffers and increased communication between the reader process and the workers. –ATLAS stores several hundreds of collections per event. Can use existing Athena I/O and StoreGate object retrieval mechanism to send a request to the reader. Advanced strategies, such as learning what container are needed, could be used to improve performance. –Persistent objects could be passed through shared memory. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 13

Alternative reader strategies for POOL / ROOT data  Since 2011, ATLAS uses the ROOT auto-flush feature for their main event data TTree to write clusters of 5 / 10 events (depending on data product). –To avoid baskets getting too small for efficient I/O, splitting was switched off. So all the data of a collection is streamed member-wise into a single TBranch.  Compressed baskets contain data from only 5 / 10 subsequent events  If the entire cluster of events is processed by the same worker, most of the performance disadvantages from slide 6 are solved. –Some of the inefficiencies will remain due to reading of auxiliary TTrees, that are not in sync with event numbers (less than 5% of data).  Expected to speed up reading by 20 – 50% in CPU time. –Better utilization of memory: no duplicated buffer on different workers.  Not that simple: Fast merged files will destroy event cluster organization. –Last cluster for each file can have fewer than 5 / 10 events. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 14

Writing  Output strategies pose similarly subtle problems, with complications that may lead to significant serialization and the possibility of serial bottlenecks, either during writing or in post-processing, e.g., during data stream merging. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 15

TMemFile  ROOT also prepares for multi-core I/O. –E.G.: TMemFile, ROOT way of combining/merging several TTrees to the same output TTree.  With the migration away from POOL and potentially closer to ROOT, ATLAS may decide to leverage these features directly.  However, there still are open questions: –Biggest obstacle for ATLAS: Inserting TTree entry in TMemFile will not return valid entry number, so ATLAS cannot easily create an external Token. Tokens are the basis for the navigational infrastructure –Not clear whether type specific function calls could be sufficient to merge metadata objects –And what about multiple tree synchronization? 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 16

Summary and Outlook  A multi-process control framework (like AthenaMP)) to enable HEP event processing on multicore computing architectures is an important step, but it is only one of many steps that need to be accomplished.  Optimizing event data storage, so that it can be efficiently retrieved in the granularity needed by the multiple worker processes is key to avoiding performance penalties during reading. –The 2011 change to member-wise streaming with a small number of entries per basket will help ATLAS to tackle inefficiencies in multi-process reading of ROOT data. –The data layout also must ensure that data produced by multiple processes can be: efficiently combined, as merging is typically done serially, the resulting output is as efficient to read and store  ATLAS is in the process of developing an I/O architecture and components that can efficiently support even higher numbers of parallel worker processes.  First prototypes are being tested, but much work remains. 05/21/2012 Peter van Gemmeren (ANL): "I/O Strategies for Multicore Processing in ATLAS" 17