LHCb Core Software Programme of Work 11-12 January, 2012 Pere Mato (CERN)

Slides:



Advertisements
Similar presentations
Introduction to Grid Application On-Boarding Nick Werstiuk
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Spark: Cluster Computing with Working Sets
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
1 Concurrent and Distributed Systems Introduction 8 lectures on concurrency control in centralised systems - interaction of components in main memory -
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
What is Software Architecture?
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
REVIEW OF NA61 SOFTWRE UPGRADE PROPOSAL. Mandate The NA61 experiment is contemplating to rewrite its fortran software in modern technology and are requesting.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
A scheduling component for e-Science Central Anirudh Agarwal Jacek Cała.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
LC Software Workshop, May 2009, CERN P. Mato /CERN.
Conditions DB in LHCb LCG Conditions DB Workshop 8-9 December 2003 P. Mato / CERN.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.
Requirements for a Next Generation Framework: ATLAS Experience S. Kama, J. Baines, T. Bold, P. Calafiura, W. Lampl, C. Leggett, D. Malon, G. Stewart, B.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Welcome and Introduction P. Mato, CERN.  Outcome of the FNAL workshop ◦ Interest for common effort to make rapid progress on exploratory R&D activities.
Processor Architecture
ATLAS Meeting CERN, 17 October 2011 P. Mato, CERN.
INFSO-RI Enabling Grids for E-sciencE Ganga 4 – The Ganga Evolution Andrew Maier.
The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.
Pere Mato (CERN) 10 October  Triggered by the needs from physicists  Main goal ◦ Get quicker responses when having 4-8 core machines at your disposal.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
CS533 - Concepts of Operating Systems 1 Threads, Events, and Reactive Objects - Alan West.
Parallelization Geant4 simulation is an embarrassingly parallel computational problem – each event can possibly be treated independently 1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.
INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.
Event Management. EMU Graham Heyes April Overview Background Requirements Solution Status.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.
GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.
Processes and threads.
Chapter 3: Process Concept
Distributed Shared Memory
GWE Core Grid Wizard Enterprise (
Spark Presentation.
Chapter 6: CPU Scheduling
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
Chapter5: CPU Scheduling
Chapter 5: CPU Scheduling
Chapter 6: CPU Scheduling
Threads Chapter 4.
Chapter 5: CPU Scheduling
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Planning next release of GAUDI
Module 5: CPU Scheduling
Parallel Gaudi A reminder and prospects
Presentation transcript:

LHCb Core Software Programme of Work January, 2012 Pere Mato (CERN)

 Gaudi Parallel ◦ What exists today ◦ What are the current limitations  Future Opportunities ◦ Re-engineering Gaudi for concurrency  Conclusions 6/18/2016GaudiParallel2

Task/WorkManager Based Event Based Multi-core systems User-written Python script to distribute the work onto a pool of processes and collect results GaudiParallel job Completely transparent to the end-user gaudirun.py -–ncpus=N Cluster The user needs to provide the list of nodes in which he/she has access GaudiParallel job (not in use) gaudirun.py --remote port1:port2 (never really tested) 6/18/2016GaudiParallel3

 User parallelizable task derives from Task ◦ initializeLocal() is executed in parent process ◦ initializeRemote() is executed once in each remote process ◦ process() is executed for each work item in remote process ◦ finalize() is executed at the end in the parent process 6/18/2016GaudiParallel4 Task initializeLocal() initializeRemote() process(item) finalize() initializeLocal() initializeRemote() process(item) finalize() WorkManager __init__(ncpus, ppservers) process(task, items) __init__(ncpus, ppservers) process(task, items) MyTask __init__(parameters) initializeLocal() initializeRemote() process(item) finalize() __init__(parameters) initializeLocal() initializeRemote() process(item) finalize()

6/18/2016GaudiParallel5 from ROOT import TH1F, TRandom, TCanvas, gROOT from GaudiMP.Parallel import Task, WorkManager from math import sqrt class HistTask(Task): def __init__(self, nHist=4) : self.nHist = nHist self.canvas = None def initializeLocal(self): self.output = [TH1F('h%d'%i,'h%d'%i,100,-3.,3.) for i in range(self.nHist)] self.random = TRandom() def initializeRemote(self): pass def process(self, n): for h in self.output : for i in range(n): x = self.random.Gaus(0.,1.) h.Fill(x) def finalize(self): self.canvas = TCanvas('c1', 'Gaudi.Parallel canvas', 200, 10, 700, 700) nside = int(sqrt(self.nHist)) nside = nside*nside < self.nHist and nside + 1 or nside self.canvas.Divide(nside, nside, 0, 0) for i in range(self.nHist): self.canvas.cd(i+1) self.output[i].Draw() >>> from GaudiMP.Parallel import WorkManager >>> from HistTask import HistTask >>> task = HistTask(nHist=9) >>> wmgr = WorkManager() >>> wmgr.process( task, [ for i in range(100)]) Job execution statistics: job count | % of all jobs | job time sum | time per job | job server 100 | | | | lxbuild114.cern.ch Time elapsed since server creation

LocalNode Worker 6/18/2016GaudiParallel6 myscript.py Worker forkpipe + pickle RemoteNode Worker ppserver.py Worker forkpipe + pickle ssh socket + pickle Node Allocation and Reservation Network FS (AFS)

 It works!  AFAIK people are writing scripts making use of GaudiMP and they are satisfied  There are nevertheless a number oddities  Obscure interactions with ‘configuration files’. Often is too late to change the configuration  Huge choice of what is a ‘task data item’? E.g. a file, collection of files, event number, etc.  Results must be copy-able and add-able (references to (Python) objects are tricky to handle)  …  Inadequate for a program producing large output data ◦ E.g. event processing programs: simulation, reconstruction, etc. (  Event-based parallelization) 6/18/2016GaudiParallel7

6/18/2016GaudiParallel8

 Based on TES serialization ◦ ROOT streamers, TBufferFile, Pickle, etc.  Complete TES content and structure is copied from reader->worker, worker->writer  Bandwidth: MB/s (not a problem for Brunel)  Could be improved with the new ROOT Parallel Merger 6/18/2016GaudiParallel9

 AFAIK it is not in use  Eoin Smith (Fellow) left 1 year ago ◦ His final presentation can be found in indicofinal presentation  In April 2011, I managed to run Brunel ◦ Minor changes had to be made in code repository ◦ As far as I could see Event Data/Histograms/File Records are produced at the end of the job  Full content validation has not been done ◦ Histograms were validated (by Eoin) comparing all all histograms produced in both running modes ◦ No work was done for the validation of file records 6/18/2016GaudiParallel10

 Exploitation of Copy-on-Write requires extra startup complexity ◦ Is it really required?  Unsorted (and not added) log files so far  ATLAS claims that the processing python module does not handle improper worker termination ◦ You may get into a mess if one of the workers crashes  Scalability beyond workers has not been proven ◦ Main CPU overhead is in copying TES contents between process ◦ Large and stable memory savings cannot be easily achieved ◦ Other resources like DB connections can also be a limitation  All this only makes sense if the computing resources are migrated towards ‘whole-node submission’ mode 6/18/2016GaudiParallel11

6/18/2016GaudiParallel12

 We need to adapt current applications to the new many-core architectures (~100 cores) ◦ Expected no change in the overall throughput with respect trivial one-job-per-core parallelism  Reducing the required resources per core ◦ I/O bandwidth ◦ Memory ◦ Connections to DB, open files, etc.  Reduce latency for single jobs (e.g. trigger, user analysis) ◦ Run a given job in less time making use of available cores B. Hegner, P. Mato/CERN

 Concrete algorithms can be parallelized with some effort ◦ Making use of Threads, OpenMP, MPI, GPUs, etc. ◦ But difficult to integrate them in a complete application ◦ Performance-wise only makes sense to parallelize the complete application and not only parts  Developing and validating parallel code is very difficult ◦ ‘Physicists’ should be saved from this ◦ Concurrency will limit what can and can not be done in the algorithmic code (policies)  At the Framework level you have the overview and control of the application B. Hegner, P. Mato/CERN

 Ability to schedule modules/algorithms concurrently ◦ Full data dependency analysis would be required (no global data or hidden dependencies) ◦ Need to resolve the DAGs (Direct Acyclic Graphs) statically and dynamically  Not much to gain with today’s designed ‘Tasks’ ◦ But, algorithm decomposition would certainly be influenced by the new framework concurrent capabilities B. Hegner, P. Mato/CERN Time Input Processing Output

 DAG of Brunel ◦ Obtained from the existing code instrumented with ‘Auditors’ ◦ Probably still missing ‘hidden or indirect’ dependencies (e.g. Tools)  Can serve to give an idea of potential ‘concurrency’ ◦ Assuming no changes in current reconstruction algorithms B. Hegner, P. Mato/CERN

 Need to deal with the tails of sequential processing  Introducing Pipeline processing ◦ Never tried before! ◦ Exclusive access to resources or non-reentrant algorithms can be pipelined e.g. file writing  Need to design or use a powerful and flexible scheduler  Need to define the concept of an “event context”  Nice results from Markus’s recent studies B. Hegner, P. Mato/CERN Time

 It is not simple but we are not alone ◦ Technologies like the Apple’s Grand Central Dispatch (GCD) are designed to help write applications without having to fiddle directly with threads and locking (and getting it terribly wrong)  New paradigms for concurrency programming ◦ Developer needs to factor out the processing in ‘chunks’ with their dependencies and let the framework (system) to deal with the creation and management of a ‘pool’ of threads that will take care of the execution of the ‘chunks’ ◦ Tries to eliminates lock-based code and makes it more efficient B. Hegner, P. Mato/CERN

 Better than a “new” complete and self-contained framework, LHC experiments would like to see a set of functional components from where to pick and choose what to incorporate into their frameworks ◦ Experiments have a huge investment in ‘algorithmic’ code and configuration based of a specific framework  Complete solution should be provided for new experiments ◦ The previous constraint does not apply to new experiments ◦ The timing is less critical for them B. Hegner, P. Mato/CERN

EventStore Algorithm B. Hegner, P. Mato/CERN Algorithm Logging Configuration Persistency Data Store B-Field Geometry Material Random Scheduling (*) Any resemblance to Gaudi is pure coincidence const non-const Services Direct Acyclic Graph

 “Concurrent White Board” (multi-event data store) ◦ Data declaration (in, out, update) ◦ Get synchronized data access (being executed) ◦ API for input, output, update and commit  “Dispatch Service” (scheduler) ◦ Management of task queues and threads ◦ For example could be based on GCD  “Logging Service” ◦ Ensuring message integrity ◦ Sorting by event B. Hegner, P. Mato/CERN

 Modeling them as ‘servers’ ◦ Genuinely asynchronous ◦ Supporting concurrent clients (caching issues) ◦ Possible use of new hardware architectures (e.g. GPU, MIC)  E.g. Random Service ◦ Reproducibility in a concurrent environment  E.g. Magnetic Field Service ◦ Given a point, return the best estimate of the B-field ◦ It may involve complex interpolations and/or parameterizations  E.g. Material Service (transport service) ◦ Given two points, return the best estimate of material between them B. Hegner, P. Mato/CERN

 Investigate current LHCb applications to gather requirements and constrains ◦ Dependencies, data access patterns, opportunities for concurrency, etc. ◦ Understanding ‘non thread-safe’ practices, devising possible solutions  Prototypes of new services can be tested in realistic applications (Brunel, HTL, …) ◦ Slot-in replacement of existing services ◦ Possibility of GPU/MIC implementations 6/18/2016GaudiParallel23

 The existing Gaudi Parallel (both schemas) solution should be put into production ◦ Sound and effective solution for the next few years ◦ Full output validation is still missing ◦ Validation tools should be developed and be re-used later  LHCb should be one of the main players providing specific requirements, participating to the [common] project development and taking advantage of the new framework ◦ Clear benefits need to be demonstrated ◦ Would imply some re-engineering of parts of the experiment applications  Participation in a R&D program to evaluate existing technologies and development of partial prototypes of critical parts 6/18/2016GaudiParallel24