Building a scalable Python distribution for HEP Analysis

Slides:

Advertisements

Similar presentations

05/11/2001 CPT week Natalia Ratnikova, FNAL 1 Software Distribution in CMS Distribution unitFormContent Version of SCRAM managed project.

Advertisements

March 24-28, 2003Computing for High-Energy Physics Configuration Database for BaBar On-line Rainer Bartoldus, Gregory Dubois-Felsmann, Yury Kolomensky,

Case Tools Trisha Cummings. Our Definition of CASE  CASE is the use of computer-based support in the software development process.  A CASE tool is a.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

Cambodia-India Entrepreneurship Development Centre - : :.... :-:-

Hands-On Microsoft Windows Server 2008 Chapter 11 Server and Network Monitoring.

Introduction to The Linaro Toolchain Embedded Processors Training Multicore Software Applications Literature Number: SPRPXXX 1.

1 Introduction to Tool chains. 2 Tool chain for the Sitara Family (but it is true for other ARM based devices as well) A tool chain is a collection of.

Linux Operations and Administration

M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,

Framework for Automated Builds Natalia Ratnikova CHEP’03.

Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.

Business Unit or Product Name © 2007 IBM Corporation Introduction of Autotest Qing Lin.

The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

October, Scientific Linux INFN/Trieste B.Gobbo – Compass R.Gomezel - T.Macorini - L.Strizzolo INFN - Trieste.

 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.

QuikTrac 5.5, a validated Motorola Software Solution, allows you to take your Host ERP screens and extend them out to fixed or mobile devices including.

CERN-PH-SFT-SPI August Ernesto Rivera Contents Context Automation Results To Do…

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

Optimizing CMS Data Formats for Analysis Peerut Boonchokchuay August 11 th,

David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

Thoughts on How to Distribute the Bare Minimum for Analysis Natalia Ratnikova, Fermilab 16 May, 2006 Software Development Tools Meeting.

INFSO-RI Enabling Grids for E-sciencE Ganga 4 Technical Overview Jakub T. Moscicki, CERN.

Discussion of Python/ROOT “interoperability” David Lange June 13, 2016 D. Lange1.

CHAPTER 2 COMPUTER SOFTWARE. LEARNING OUTCOMES At the end of this class, students should be able to:  Explain the significance of software  Define and.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Towards an Information System Product Team.

MAUS Status A. Dobbs CM43 29 th October Contents MAUS Overview Infrastructure Geometry and CDB Detector Updates CKOV EMR KL TOF Tracker Global Tracking.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL May 19, 2003 BNL Technology Meeting.

How to Get Started With Python

Introduction to Visual Basic. NET,. NET Framework and Visual Studio

Outline Installing Gem5 SPEC2006 for Gem5 Configuring Gem5.

Development Environment

Leveraging R and Shiny for Point and Click ADaM Analysis

MET4750 Techniques for Earth System Modeling

PyTimber & CO M. Betz, R. De Maria, M. Fitterer, C. Hernalsteens, T. Levens Install: $ pip install pytimber Sources:

Self Healing and Dynamic Construction Framework:

SPI external software build tool and distribution mechanism

Spark Presentation.

Pipeline Execution Environment

LCG Generator Services project

Lecture 13 RPM and its advantages.

API Documentation Guidelines

Integration of Singularity With Makeflow

5 SYSTEM SOFTWARE CHAPTER

DHCP, DNS, Client Connection, Assignment 1 1.3

X in [Integration, Delivery, Deployment]

Apache Tomcat Web Server

DESIGN & IMPLEMENTATION

Leigh Grundhoefer Indiana University

CIS16 Application Development – Programming with Visual Basic

Chapter 2 – Software Processes

Chapter 2: The Linux System Part 1

Module 01 ETICS Overview ETICS Online Tutorials

Purge-it! USP's, pre-sales process & helping the customer to decide

EMSE 6574 – Programming for Analytics: Python 101 – Python Enviornments Joel Klein.

Electronics II Physics 3620 / 6620

5 SYSTEM SOFTWARE CHAPTER

APACHE MXNET By Beni Mulyana.

Dtk-tools Benoit Raybaud, Research Software Manager.

NIEM Tool Strategy Next Steps for Movement

Introduction of PTM (Planning Tracking & Management) Tool - developed by Meridian Technology 29/05/2019.

David Cleverly – Development Lead

DBOS DecisionBrain Optimization Server

Software Re-engineering and Reverse Engineering

MS Confidential : SharePoint 2010 Developer Workshop (Beta1)

DIBBs Brown Dog BDFiddle

Presentation transcript:

Building a scalable Python distribution for HEP Analysis David Lange August 22, 2017

Concept: Develop and distribute data science oriented Python stack for HEP Reduce startup burden for analysis: Software environment easily available Provide an(other) option for a standard software environment: Make it easy to use common software within an analysis group Ease use of distributed computing systems: Simplify distribution of analysis on the GRID

Initial motivations for Python in CMS and CMSSW Job configuration and control of CMSSW Analysis: FWLite+PyROOT import FWCore.ParameterSet.Config as cms from Configuration.StandardSequences.Eras import eras process = cms.Process('RECO',eras.Run2_2016) # import of standard configurations process.load('Configuration.StandardSequences.Services_cff') process.load('SimGeneral.HepPDTESSource.pythiapdt_cfi') process.maxEvents = cms.untracked.PSet( input = cms.untracked.int32(10)) # Input source process.source = cms.Source("PoolSource", fileNames = cms.untracked.vstring('file:step2.root'), secondaryFileNames = cms.untracked.vstring() ) import ROOT from DataFormats.FWLite import Events, Handle events = Events ('ZmumuPatTuple.root') # loop over events for event in events: ….... Turn the original usecases into a diagram – job configuration analysis via FWLite, analysis via PYROOT ~2008

New motivations: Python has evolved to a standard data science platform [inside and outside of HEP] Massive increase in interest in applications of data analytics and machine learning techniques to HEP Easier for everyone in the experiment to have a pre-built software stack to use as one option to use for analysis And to be able to modify this software stack to suit individual needs Anticipate the transition from development to production use of these tools Our focus Tool interoperability. E.g., conversion into or out of ROOT into other data formats for analysis (e.g., for Spark, etc) Data science tools developed in python ecosystem Make sure this includes common platforms for CMS users community Eg, Ease transition from dev to production for ML

Experiment Software Distributions Conflicting goals of production software distributions and user analysis developments? Experiment Software Distributions Analysis R&D d

Our approach We adopted PIP and PYPI distributions of packages Simplify, standardize and modernize python package distribution in CMSSW Use source distributions when available (which is almost always) so that tool builds and distributions are consistent with the rest of CMSSW Priorities for adding specific packages driven by suggestions and interests from users (and what packages we see the community using)

Required dependencies Adopting PIP simplifies the implementation of a new python package in our build system Tool name and version ### RPM external py2-numba 0.33.0 ## INITENV +PATH PYTHONPATH %{i}/${PYTHON_LIB_SITE_PACKAGES} Requires: py2-funcsigs py2-enum34 py2-six Requires: py2-singledispatch py2-llvmlite py2-numpy ## IMPORT build-with-pip Required dependencies Common code to build Explicitly indicate dependent packages. These dependencies can sometimes be derived from PIP (e.g., they can be automatically determined). However complex packages often require understanding of setup options.

What is inside build-with-pip? Define name on PYPI #File: with-with-pip %if "%{?pip_name:set}" != "set" %define pip_name %(echo %n | cut -f2-5 -d-) %endif %if "%{?PipDownloadOptions:set}" != "set" %define PipDownloadOptions --no-deps%%20--no-binary%%3D:all: %if "%{?PipBuildOptions:set}" != "set" %define PipBuildOptions --no-deps Source: pip://%{pip_name}/%{realversion}?pip_options=%{PipDownloadOptions}&output=/source.tar.gz Requires: python BuildRequires: py2-pip %prep %build mkdir -p %{i} tar xfz %{_sourcedir}/source.tar.gz %{?PipPreBuild:%PipPreBuild} export PIPFILE=`cat files.list` export PYTHONUSERBASE=%i pip install --user -v %{PipBuildOptions} $PIPFILE %install %{?PipPostBuild:%PipPostBuild} Unpack source Use Pip to install Paste it in and explain main features (eg, no wheels, no auto dependecies) d Handle special cases Retrieve source (from PYPI)

Users can now use PIP to explore CMSSW python stack dlange> pip list appdirs (1.4.3) bleach (2.0.0) Bottleneck (1.2.1) certifi (2017.4.17) chardet (3.0.4) click (6.7) climate (0.4.6) configparser (3.5.0) cycler (0.10.0) Cython (0.22) decorator (4.0.11) deepdish (0.3.4) docopt (0.6.2) downhill (0.4.0) …......... Standarization.. d

VirtualEnv helps users be more agile than CMSSW releases when they need to be Example: I want to upgrade Theano dlange> pip list | grep Theano Theano (0.8.2) dlange> python Python 2.7.11 (default, Apr 28 2017, 13:50:27) [GCC 6.3.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import theano >>> print theano.__version__ 0.8.2 dlange> virtualenv updateTheano New python executable in /build/dlange/CMSSW_9_3_X_2017-08-07-2300/updateTheano/bin/python Installing setuptools, pip, wheel...done. dlange> source updateTheano/bin/activate.csh [updateTheano] dlange> setenv PYTHONPATH $PWD/updateTheano/lib/python2.7/site-packages/:$PYTHONPATH CMS distributes 0.8.2 f To get started: Set up and activate virtualenv

Install the version I want VirtualEnv helps users be more agile than CMSSW releases when they need to be [updateTheano] dlange> pip install Theano==0.9.0 Collecting Theano==0.9.0 Requirement already satisfied: scipy>=0.14 Requirement already satisfied: numpy>=1.9.1 Installing collected packages: Theano Found existing installation: Theano 0.8.2 Not uninstalling theano at /cvmfs/cms-ib.cern.ch/…....... /lib/python2.7/site-packages, outside environment /build/dlange/CMSSW_9_3_X_2017-08-07-2300/updateTheano Successfully installed Theano-0.9.0 [updateTheano] dlange> pip list | grep Theano Theano (0.9.0) [updateTheano] dlange> python Python 2.7.11 (default, Apr 28 2017, 13:50:27) >>> import theano >>> print theano.__version__ 0.9.0 Install the version I want f Done!

Aside: Python can interact with CMSSW analysis jobs (and data structures) Maintain processing control after job configuration is complete (instead of launching an executable Iterative alignment or calibration techniques e= CmsRun(process) e.run() Enables new possibilities. E.g., Allocate memory in numpy arrays, fill and/or process data within experiment framework applications Ease interface to spark (e.g.) analysis https://github.com/diana-hep/c2numpy/.../commonblock-demo.ipynb

Tools we integrated : HEP developments (Incomplete list) rootpy root_pandas xrootdpyfs

Tools we integrated : Jupyter stack Common use case: Do python analysis in CMSSW software stack from laptop https://github.com/cms-sw/cmssw/...pcsfVerify.ipynb

Tools we integrated: Data science (incomplete list)

Tools we integrated : Machine learning (incomplete list) Keras

But my laptop does not run CentOS7. What can I do? dlange> pip install pyCMSSW We defined a simple wrapper package that provides that versions of Python tools corresponding to current CMSSW distribution Preliminary version of this is available now. We plan to update this as we deploy and validate significant changes to the python tools in CMSSW Thus, it won’t the latest and greatest set of tools available on PYPI, but hopefully a self consistent version.

Conclusion and Outlook Attempt to develop and distribute a Python software stack to match wants and needs of data analysis in HEP. We believe this approach can simultaneously fit the needs of production environments and user agility needs. CMS software stack has been used for this work, but nothing is CMS specific (aside from SPEC file implementations, but these are nearly trivial) Working to develop performance monitoring unit tests Try it out – we would like to hear how to improve what we’ve done

Backup information

Example: Converting ROOT data to the data format you need for ML https://indico.cern.ch/event/613842/contributions/2585787/attachments/1463230/2260889/pivarski-data-formats.pdf