Pere Mato (CERN) 10 October 2008.  Triggered by the needs from physicists  Main goal ◦ Get quicker responses when having 4-8 core machines at your disposal.

Slides:

Advertisements

Similar presentations

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Advertisements

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

CS 104 Introduction to Computer Science and Graphics Problems

Usage of the Python Programming Language in the CMS Experiment Rick Wilkinson (Caltech), Benedikt Hegner (CERN) On behalf of CMS Offline & Computing 1.

The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.

Trains status&tests M. Gheata. Train types run centrally FILTERING – Default trains for p-p and Pb-Pb, data and MC (4) Special configuration need to be.

CMSC 104, Version 8/061L22Arrays1.ppt Arrays, Part 1 of 2 Topics Definition of a Data Structure Definition of an Array Array Declaration, Initialization,

LHCb Simulation Tutorial CERN, 21 st -22 nd February B 00 l e How to pass a detector geometry to.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.

Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.

Automatic Regression Test Facility for Support Modules Jon Thompson, Diamond Light Source Vancouver, 1 May 2009.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

THttpServer class Sergey Linev (GSI). Some history Development was inspired by JSRootIO why not achieve similar functionality with online ROOT application?

Scientific Computing By: Fatima Hallak To: Dr. Guy Tel-Zur.

Browsing Data with GaudiPython Disclaimers: 1.I am an apprentice! 2.The porpoise: we think GaudiPython could be quite useful for a LHCb physicist… Browsing.

LC Software Workshop, May 2009, CERN P. Mato /CERN.

Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.

Analysis of the ROOT Persistence I/O Memory Footprint in LHCb Ivan Valenčík Supervisor Markus Frank 19 th September 2012.

Introduction Advantages/ disadvantages Code examples Speed Summary Running on the AOD Analysis Platforms 1/11/2007 Andrew Mehta.

Guide To UNIX Using Linux Third Edition Chapter 8: Exploring the UNIX/Linux Utilities.

ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.

Trigger input to FFReq 1. Specific Issues for Trigger The HLT trigger reconstruction is a bit different from the offline reconstruction: – The trigger.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.

Reconstruction Configuration with Python Chris Jones University of Cambridge.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

ATLAS Meeting CERN, 17 October 2011 P. Mato, CERN.

MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.

HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.

Interstage BPM v11.2 1Copyright © 2010 FUJITSU LIMITED ADVANCE FEATURES.

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

LHCb Core Software Programme of Work January, 2012 Pere Mato (CERN)

ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.

SQL Database Management

Big Data is a Big Deal!.

Multicore Computing in ATLAS

Lecture Topics: 11/1 Processes Process Management

Sharing Memory: A Kernel Approach AA meeting, March ‘09 High Performance Computing for High Energy Physics Vincenzo Innocente July 20, 2018 V.I. --

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Grid Canada Testbed using HEP applications

湖南大学-信息科学与工程学院-计算机与科学系

Processor Management Damian Gordon.

CS703 - Advanced Operating Systems

Control and monitoring of trigger algorithms using Gaucho

Wide Area Workload Management Work Package DATAGRID project

Eric van Herwijnen March 10th 2005

CENG 351 Data Management and File Structures

Status and plans for bookkeeping system and production tools

Use Of GAUDI framework in Online Environment

Processor Management Damian Gordon.

MapReduce: Simplified Data Processing on Large Clusters

Production Manager Tools (New Architecture)

Production client status

Parallel Gaudi A reminder and prospects

Presentation transcript:

Pere Mato (CERN) 10 October 2008

 Triggered by the needs from physicists  Main goal ◦ Get quicker responses when having 4-8 core machines at your disposal ◦ Minimal changes (and understandable) to the Python scripts driving the program  Leverage from existing third-party Python modules ◦ processing module, parallel python (pp) module  Optimization of memory usage 17 January

17 January #--- Common configuration from Gaudi.Configuration import * importOptions('$STDOPTS/LHCbApplication.opts') importOptions(‘DoDC06selBs2Jpsi2MuMu_Phi2KK.opts') #--- modules to be reused in all processes from GaudiPython import AppMgr from ROOT import TH1F,TFile,gROOT files = [... ] hbmass = TH1F(’hbmass','Mass of B cand',100,5200.,5500.) appMgr = AppMgr() appMgr.evtsel().open(files) evt = appMgr.evtsvc() while True : appMgr.run(1) if not evt['Rec/Header'] : break cont = evt['Phys/DC06selBs2Jpsi2MuMu_Phi2KK/Particles'] if cont : for b in cont : hbmass.Fill(b.momentum().mass()) hbmass.Draw() Predefined configuration + customizations Preparations (booking histos, selecting files, etc.) Looping over events Presenting results

 Introduced a very simple Model for parallel processing of tasks  Common model that can be implemented using either processing or pp modules (or others)  The result of processing can be any ‘pickle-able’ Python or C++ object (with a dictionary)  Placeholder for additional functionality ◦ Setting up the environment (including servers) ◦ Monitoring ◦ Merging results (summing what can be summed or appending to lists)

 User parallelizable task derives from Task ◦ initializeLocal() is executed in parent process ◦ initializeRemote() is executed once in each remote process ◦ process() is executed for each work item in remote process ◦ finalize() is executed at the end in the parent process Task initializeLocal() initializeRemote() process(item) finalize() initializeLocal() initializeRemote() process(item) finalize() WorkManager __init__(ncpus, ppservers) process(task, items) __init__(ncpus, ppservers) process(task, items) MyTask __init__(parameters) initializeLocal() initializeRemote() process(item) finalize() __init__(parameters) initializeLocal() initializeRemote() process(item) finalize()

from ROOT import TH1F, TRandom, TCanvas, gROOT from GaudiPython.Parallel import Task, WorkManager class HistTask(Task): def __init__(self, nHist=4) : self.nHist = nHist self.canvas = None def initializeLocal(self): self.output = [TH1F('h%d'%i,'h%d'%i,100,-3.,3.) for i in range(self.nHist)] self.random = TRandom() def initializeRemote(self): pass def process(self, n): for h in self.output : for i in range(n): x = self.random.Gaus(0.,1.) h.Fill(x) def finalize(self): self.canvas = TCanvas('c1', 'Gaudi.Parallel canvas', 200, 10, 700, 700) nside = int(sqrt(self.nHist)) nside = nside*nside < self.nHist and nside + 1 or nside self.canvas.Divide(nside, nside, 0, 0) for i in range(self.nHist): self.canvas.cd(i+1) self.output[i].Draw() >>> from GaudiPython.Parallel import WorkManager >>> from HistTask import HistTask >>> task = HistTask(nHist=9) >>> wmgr = WorkManager() >>> wmgr.process( task, [ for i in range(100)]) Job execution statistics: job count | % of all jobs | job time sum | time per job | job server 100 | | | | lxbuild114.cern.ch Time elapsed since server creation

from ROOT import TFile,TCanvas,TH1F,TH2F from Gaudi.Configuration import * from GaudiPython.Parallel import Task, WorkManager from GaudiPython import AppMgr class MyTask(Task): def initializeLocal(self): self.output = {'h_bmass': TH1F('h_bmass','Mass of B candidate',100,5200.,5500.)} def initializeRemote(self): #--- Common configuration importOptions('printoff.opts') importOptions('$STDOPTS/LHCbApplication.opts') importOptions('$STDOPTS/DstDicts.opts') importOptions('$DDDBROOT/options/DC06.opts') importOptions('$PANORAMIXROOT/options/Panoramix_DaVinci.opts') importOptions('$BS2JPSIPHIROOT/options/DoDC06selBs2Jpsi2MuMu_Phi2KK.opts') appConf = ApplicationMgr( OutputLevel = 999, AppName = 'Ex2_multicore') appConf.ExtSvc += ['DataOnDemandSvc','ParticlePropertySvc'] def process(self, file): appMgr = AppMgr() appMgr.evtsel().open([file]) evt = appMgr.evtsvc() while 0 < 1: appMgr.run(1) # check if there are still valid events if evt['Rec/Header'] == None : break cont = evt['Phys/DC06selBs2Jpsi2MuMu_Phi2KK/Particles'] if cont != None : for b in cont : success = self.output['h_bmass'].Fill(b.momentum().mass()) def finalize(self): self.output['h_bmass'].Draw() print 'Bs found: ', self.output['h_bmass'].GetEntries() from GaudiPython.Parallel import WorkManager from MyTask import MyTask from inputdata import bs2jpsiphimm_files as files ppservers = ('lxbuild113.cern.ch','lxbuild114.cern.ch') if __name__ == '__main__' : task = MyTask() wmgr = WorkManager(ppservers=ppservers) wmgr.process( task, files[:50])

LocalNode Worker myscript.py Worker forkpipe + pickle RemoteNode Worker ppserver.py Worker forkpipe + pickle ssh socket + pickle Node Allocation and Reservation Network FS (AFS)

 The [pp]server is created/destroyed by the user script ◦ Requires ssh properly setup  Ultra-lightweight setup  Single user, proper authentication, etc.  The user environment is cloned to the server ◦ Requires a shared file system  Automatic cleanup on crashes/aborts

Example reading 50 DST files from castor (rfio) or local disk and filling some histogram With 9 process a speedup of 7 can be expected

Same example using 3 nodes with 8 cores Same example reading 50 DST files. The castor case is probably protocol bound (~70 % CPU utilization/proces s)

 Measuring memory usage with smaps ◦ Data lives in /proc/$pid/smaps ◦ Developed python script to digest it  Parallelism achieved using Python processing module ◦ Uses fork() internally  Tested with LHCb reconstruction program Brunel ◦ Latest version (v33r1) 17 January 3a5c a5c r-xp : /lib64/ld so Size: 84 kB Rss: 76 kB Pss: 1 kB Shared_Clean: 76 kB Shared_Dirty: 0 kB Private_Clean: 0 kB Private_Dirty: 0 kB Referenced: 76 kB

17 January from Gaudi.Configuration import * from GaudiPython.Parallel import Task, WorkManager from GaudiPython import AppMgr, PyAlgorithm class Brunel(Task): def initializeLocal(self): pass def initializeRemote(self): importOptions('$BRUNELROOT/options/Brunel-Default.py') importOptions('$BRUNELROOT/options/DC06-Files.py') ApplicationMgr( OutputLevel = ERROR, AppName = 'PyBrunel') appMgr = AppMgr() appMgr.initialize() def process(self, file): appMgr.evtsel().open([file]) appMgr.run(10) def finalize(self): pass from Files import digi_files as files if __name__ == '__main__' : task = Brunel() wmgr = WorkManager(ncpus=4) wmgr.process( task, files[:4]) Initialization on each sub- process Process 10 events in each file Main program

17 January from Gaudi.Configuration import * from GaudiPython.Parallel import Task, WorkManager from GaudiPython import AppMgr, PyAlgorithm importOptions('$BRUNELROOT/options/Brunel-Default.py') importOptions('$BRUNELROOT/options/DC06-Files.py') ApplicationMgr( OutputLevel = ERROR, AppName = 'PyBrunel') appMgr = AppMgr() appMgr.initialize() class Brunel(Task): def initializeLocal(self): pass def initializeRemote(self): pass def process(self, file): appMgr.evtsel().open([file]) appMgr.run(10) def finalize(self): pass from Files import digi_files as files if __name__ == '__main__' : task = Brunel() wmgr = WorkManager(ncpus=4) wmgr.process( task, files[:4]) Common initialization in parent process Process 10 events in each file Main program

# codedatasumtotal § total/N privateshareprivateshare Single Single Single Fork a † Fork b † Fork b † Fork c 4086 † Fork c,d 8086 † ¶ Physical memory usage in Mbytes, § including parent process a no initialization in parent process, b all services initialized in parent, c one event run in parent, d only estimated † part of the code only mapped in parent process

 The shared memory pages after processing the 10 th event on different files can be as much as 70% (code+data) and 62% (data only) ◦ Probably will be degraded after processing many events under different conditions  On a 8-core computer running 8 independent reconstruction instances requires 2.7 GB, while running them with fork() requires 1.4 GB  In this measurements the output file(s) not taken into account

 In case of a program producing large output data the current model is not adequate ◦ Data production programs: simulation, reconstruction, etc.  At least two options: ◦ Each sub-process writing a different file, parent process merges the files  Merging files non-trivial when References exists ◦ Event data is collected by parent process and written into a single output stream  Some additional communication overhead

Sub-process Transient Event Store Parent-process Transient Event Store Algorithm input output Event Serialization/ Deserialization OutputStream Work Queue Output Queue input

 Prototype event serialization/deserialization exists ◦ The entire transient event store (or parts) can be moved from/to sub-processes ◦ No performance evaluation yet  Currently working on improving the parallelization Model and Scheduling ◦ Adding asynchronous queues ◦ New method: localProcess() called for each work item in parent process

 Python offers a very high-level control of the applications based on GAUDI  Very simple changes in user scripts may led to huge gains in latency  Gains in memory usage are also important  Several schemas and scheduling strategies can be tested without changes in the C++ code ◦ E.g. local/remote initialization, task granularity (file, event, chunk), sync or async mode, queues, etc.