Philippe CANAL root.cern.ch ROOT, I/O and Concurrency Philippe Canal Fermilab.

Slides:



Advertisements
Similar presentations
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Advertisements

Adding scalability to legacy PHP web applications Overview Mario A. Valdez-Ramirez.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
What is Unix Prepared by Dr. Bahjat Qazzaz. What is Unix UNIX is a computer operating system. An operating system is the program that – controls all the.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Introduction to .Net Framework
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
P ARALLEL P ROCESSING I NSTITUTE · F UDAN U NIVERSITY 1.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 10Slide 1 Architectural Design l Establishing the overall structure of a software system.
Multithreading in Java Project of COCS 513 By Wei Li December, 2000.
Chapter 3.5 Memory and I/O Systems. 2 Memory Management Memory problems are one of the leading causes of bugs in programs (60-80%) MUCH worse in languages.
Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++
The HipHop Compiler from Facebook By Megha Gupta & Nikhil Kapoor.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Chapter 8 Object Design Reuse and Patterns. Object Design Object design is the process of adding details to the requirements analysis and making implementation.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
Chapter 4: Multithreaded Programming. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts What is Thread “Thread is a part of a program.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Full and Para Virtualization
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
The JANA Reconstruction Framework David Lawrence - JLab May 25, /25/101JANA - Lawrence - CLAS12 Software Workshop.
9/28/2005Philippe Canal, ROOT Workshop TTree / SQL Philippe Canal (FNAL) 2005 Root Workshop.
Copyright © Curt Hill Operating Systems An Introductory Overview.
ROOT I/O: The Fast and Furious CHEP 2010: Taipei, October 19. Philippe Canal/FNAL, Brian Bockelman/Nebraska, René Brun/CERN,
Silberschatz, Galvin and Gagne ©2011 Operating System Concepts Essentials – 8 th Edition Chapter 2: The Linux System Part 2.
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
What’s In Store For ROOT I/O Philippe Canal March 22nd, 2012.
CHAPTER 6 Threads, Handlers, and Programmatic Movement.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)
POOL Based CMS Framework Bill Tanenbaum US-CMS/Fermilab 04/June/2003.
Mini-Workshop on multi-core joint project Peter van Gemmeren (ANL) I/O challenges for HEP applications on multi-core processors An ATLAS Perspective.
Adding Concurrency to a Programming Language Peter A. Buhr and Glen Ditchfield USENIX C++ Technical Conference, Portland, Oregon, U. S. A., August 1992.
HYDRA Framework. Setup of software environment Setup of software environment Using the documentation Using the documentation How to compile a program.
Introduction to Operating Systems Concepts
TensorFlow– A system for large-scale machine learning
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Chapter 14: System Protection
Process Management Process Concept Why only the global variables?
A Closer Look at Instruction Set Architectures
Applying Control Theory to Stream Processing Systems
SOFTWARE DESIGN AND ARCHITECTURE
Processes and Threads Processes and their scheduling
Task Scheduling for Multicore CPUs and NUMA Systems
The Client/Server Database Environment
2018 I/O POW Ph. Canal, B. Bockelman, D. Piparo, Jakob B., J. Pivarski, Z. Zhang, O. Shadura.
Introduction Enosis Learning.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Introduction Enosis Learning.
Software models - Software Architecture Design Patterns
Lecture Topics: 11/1 General Operating System Concepts Processes
Chapter 2: Operating-System Structures
Outline Chapter 2 (cont) OS Design OS structure
Lecture 2 The Art of Concurrency
CS510 Operating System Foundations
CrawlBuddy The web’s best friend.
Chapter 2: Operating-System Structures
ROOT Support and Developments at FNAL
Concurrency: Threads, Address Spaces, and Processes
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

Philippe CANAL root.cern.ch ROOT, I/O and Concurrency Philippe Canal Fermilab

Philippe CANAL root.cern.ch Overview ROOT I/O Concurrency 2

Philippe CANAL root.cern.ch ROOT Guiding Principles User Oriented – Support and Feedback is essential – HEP is main but not sole user/target C++: Interpreter and Reflection High performance (many dimensions) Release early and often Open-ended – Include interfaces to other languages – Help promote associated project (RooFit, etc.) 3

Philippe CANAL root.cern.ch I/O Long Term Goals Performance – Keep up / Pass competition With StreamerInfo layer and JIT, large opportunities for optimizations Features – Maintain and enhance schema evolution support – Adapt to new hardware landscape Today: multitask, vectorization ; Tomorrow: transactional memory. Interoperability – Open their ecosystem to using ROOT [I/O] – Open ROOT users to use the other ecosystem(s) 4

Philippe CANAL root.cern.ch I/O Priorities Multi-processing / Multi-threading Performances improvements – Amdahl, File Format, Streaming, Vectorization Interface Simplification and Clarification – Leverage C++11 for ease of use/documentation Interoperability – HDF5, R, Python, Blaze, numpy, etc. Additional statistics and Feedback on I/O Perf. 5

root.cern.ch ROOT Planning Day 08 July 2014 Here comes cling Cling introduces binary compatible Just In Time compilation of script and code snippets. Will allow: – I/O for ‘interpreted’ classes – Runtime generation of CollectionProxy Dictionary no longer needed for collections! [Summer Student] – Run-time compilation of I/O Customization rules including those carried in ROOT file. – Derivation of ‘interpreted’ class from compiled class In particular TObject – Faster, smarter TTreeFormula – Potential performance enhancement of I/O Optimize hotspot by generating/compiling new code on demand – Interface simplification thanks to full C++ support New, simpler TTree interface (TTreeReader) [Summer Contributor] 6

root.cern.ch ROOT Planning Day 08 July 2014 Challenges Two distinct user bases – Individual Users Want everything automatic / just works – Framework Developers Want to control everything (want no surprise) Two distinct mode of operations – Thread based – Task based Must support all 4 combinations Concurrent access must not cost (too much) for non- threaded use.

root.cern.ch ROOT Planning Day 08 July 2014 User / Thread Example Simple merge histo interface. – User add ‘only’ the lines (*) // Main Thread TDirectory *merger = new THistoMerger(nthreads); // (*) // Each thread init merger->cd(); // (*) TH1F* h = new THF1(“h”,…); // Each thread event loop h->Fill(value); // Tear down or end merger->Merge(); // (*) ouputdir->cd(); merger->Write(); // Each thread init merger->cd(); // (*) TH1F* h = new THF1(“h”,…); // Each thread event loop h->Fill(value);

root.cern.ch ROOT Planning Day 08 July 2014 User / Task Example Simple merge histo interface. – User add ‘only’ the lines (*) // Main initialization TDirectory *merger = new THistoMerger(nthreads); // (*) // Each task init TH1F* h = new THF1(“h”,…); h->SetDirectrory(merger); // (*) // Each task iteration h->Fill(value); // Final task merger->Merge(); // (*) ouputdir->cd(); merger->Write(); // Each task init TH1F* h = new THF1(“h”,…); merger->Append(h); // (*) // Each task iteration h->Fill(value);

root.cern.ch ROOT Planning Day 08 July 2014 Framework example Framework want to owns histos. But what about the case where there is 100,000 of histo? – Especially if filled rapidly (so need lock less Fill) 10 // Each thread/task init TH1F* h = new THF1(“h”,…); fwk_ownlist->push_back(h); // Each thread/task event loop h->Fill(value); // Tear down or end foreach h in ownlist(s) (or equiv) TList *lst = complex_code_to_gather_histo(fwk_threadlist); merged = Merge(lst); outputdir->WriteObject(merged);

root.cern.ch ROOT Planning Day 08 July ,000 of histos on several threads Questions: – What is the use case really like? What is the required performance (‘where’ can we put a lock) – When is the data merge done? Every Fill Every so many calls fills – One of the threads – A merger thread? – Why is it better that one histo per threads – Are TH* really heavy weight? What is the real over-head? When/how is the allocate-the-bins-only-when needed used? – Related concerns Should variable size and fixed size bins histo be more clearly separated? – Improve performance, Reduce over head (of fixed size case) – Is it making the interface harder to use/explain? 11

root.cern.ch ROOT Planning Day 08 July 2014 Another interface idea. // Main initialization TH1F *mainh = new TH1F(“h”,….) // Each task init HistoTaskHandle h(mainh); h->SetBufferSize(…); // Each task iteration h->Fill(value); // Final task foreach handle h: mainh->Add(h); outputdir->Write(mainh); // Each task iteration h->Fill(value); // Each task init HistoTaskHandle h(mainh); h->SetBufferSize(…);

root.cern.ch ROOT Planning Day 08 July 2014 Another interface idea. Spot the difference // Main initialization TH1F *mainh = new TH1F(“h”,….) // Each task init TH1*h = mainh->Clone(); h->SetBufferSize(…); // Each task iteration h->Fill(value); // Final task foreach handle h: lst->Add(h); mainh->Merge(lst); outputdir->Write(mainh); // Each task iteration h->Fill(value); // Each task init TH1*h = mainh->Clone(); h->SetBufferSize(…);

root.cern.ch ROOT Planning Day 08 July 2014 Framework Requirements If ROOT uses threads that should be not computationally intensive – Example: prefetching thread If ROOT wants to run cpu-intensive tasks they must (be able to be forced to) request CPU time from the framework – For example a parallel unzipping would need to be a Task pushed on the TBB stack – Implies that when adding improvement that uses parallel execution we ought to follow a task model 14

root.cern.ch ROOT Planning Day 08 July 2014 Task vs. Thread Task model simpler to use – Delegate load balancing to ‘framework’ – Similar to Proof However task more ‘flexible’ control flow Proof more extensive (over more hardware config) – Should we promote the task mode (tutorials, etc.) ? Requirements – Need (virtual) Interface To allow replacing TBB with Apple GCD. – Need always available default implementation Which is easily replaceable by the user controlled one. – Need to be pluggable/controllable by the user Other Utilities we could offer – Similar to boost::thread_specific smart pointer.

root.cern.ch ROOT Planning Day 08 July 2014 PROOF vs. Task Models What are the difference? i.e. can PROOF(lite-with-thread) be (extended to be) our interface? If not how do we provide a smooth experience – From single stream of operation – To many streams of operation – To many machine with many streams of operations – Without changing the code? 16

root.cern.ch ROOT Planning Day 08 July 2014 Vectorization Many alternative vectorization techniques – VC, VDT, Cuda, by hand, etc. GeantV uses template techniques and traits to steer the choice. Should we adopt the same techniques – and share/distribute the common parts? 17

Philippe CANAL root.cern.ch Backup slides 18

Philippe CANAL root.cern.ch Priorities Multi-processing / Multi-threading Performances improvements – Amdahl, File Format, Streaming, Vectorization Interface Simplification and Clarification – Leverage C++11 for ease of use/documentation Interoperability – HDF5, R, Python, Blaze, numpy, etc. Additional statistics and Feedback on I/O Perf. 19

root.cern.ch ROOT Planning Day 08 July 2014 Multi-Processing Import Chris’ changes to v5.34 and port to v6.02 Extend the ability to disable auto-add – Limited to TH* so far – Remove use of I/O in TH*::Clone Resolve parallelism limitations – As shown in the CMS condition database example 20

root.cern.ch ROOT Planning Day 08 July 2014 Multi-Processing Histogram and multi-threading – Need to start prototyping & testing asap – New interface to incrementally merge histograms from multiple threads Read/Write TTree branches in multiple user thread – Need to start prototyping/testing asap – Do we need new/simpler interface? – Need to design the limit and semantics – Extra complexity/cost to conserve basket clustering – Require TFile synchronization 21

root.cern.ch ROOT Planning Day 08 July 2014 Thread Safety Cling enables support for robust multi-thread I/O – Cling has clear separation of database engine and execution engine allowing to lock them independently Chris’ changes allow multi-threaded I/O as long as – Each TFile and TTree objects are accessed by only one thread (or the user code is explicitly locking the access to them) – Interpreter is *not* the top level entry point. – Cling will allow to remove the second limitation. More has to be done to optimize – Some object layout leads to poor performance and poor scalability – Reduce number of ‘class/version/checksum’ searches To reduce the number of atomic and thread local uses 22

root.cern.ch ROOT Planning Day 08 July 2014 Parallel Merge Challenges Need official daemon/thread parallelMergeServer – Could use Zero MQ as underlying transport. Need to efficiently deal with many histograms – Each of them still need to be merged at the end Lack of ordering of the output of the workers – No enforcing of luminosity block boundaries for example – Support for ordering increases worker/server coupling – Space reservation is challenging (variable entry) Need a new concept (an Entry Block) – ‘Set of entries that are semantically related’ – To be used to gather those entries together ‘automatically’ – Need flexible/customizable marker – Is it really worth the extra complexity? 23

root.cern.ch ROOT Planning Day 08 July 2014 Fully tested and performing version requires Parallel Merge Thread Parallel Merge Daemon (authorization, auto-start, error handling) Parallel Merge for Histogram (proper set of benchmarks, performance improvement, etc.) Benchmarks – Still to be designed – Based on existing example (some multithread) and new example based of the Event test. – Based on experiment uses cases. Parallel merge 24

root.cern.ch ROOT Planning Day 08 July 2014 Other Possible Parallel Processing Read/Write branches using internals thread/tasks – Need to partially back out memory optimization – Require TFile synchronization Offload work (compression) to separate thread – Need to work well with task based scheduler Thread safe version of TFile – Not quite sure of semantic – Need to be cost-neutral for traditional uses Support for ‘multiple’ interpreter state – Decide on need / interface / use limitations – shared libraries (their PCMs) shared between interpreters? 25

root.cern.ch ROOT Planning Day 08 July 2014 Vectorization In TTree – Eg. TTree::Draw execute formula on more than one element at a time – New interface allowing retrieval of multiple entries at once. In Streaming – Changing endianess would also merging and vectorization of even more streaming actions. 26

root.cern.ch ROOT Planning Day 08 July 2014 Interoperability HDF5, R, Python, Blaze, numpy, etc. – These ecosystems has their strengths and weaknesses as well some similarities and significant differences with ROOT – What can we learn from them? – How can ROOT [I/O] can be leveraged to enhance them? – How could our workflows benefit from using directly or indirectly any part of these ecosystems? – Who can help? 27

root.cern.ch ROOT Planning Day 08 July 2014 Why one thread/schedule per TTree When reading TTree holds: – Static State: List of branches, their types their data location on file. – Dynamic State: Current entry number, TTreeCache buffer (per TTree), User object ptr (one per (top level) branch), Decompressed basket (one per branch) – Separating both would decrease efficiency Advantages – Works now! – No need for locks or synchronization – Decoupling of the access patterns Disadvantages – Duplication of some data and some buffers. However this is usually small compare to the dynamic state. – Duplication of work if access overlap 28