Software Release Build Process and Components in ATLAS Offline Emil Obreshkov for the ATLAS collaboration.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Configuration management
Test Case Management and Results Tracking System October 2008 D E L I V E R I N G Q U A L I T Y (Short Version)
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
CS 501 : An Introduction to SCM & GForge An Introduction to SCM & GForge Lin Guo
Low level CASE: Source Code Management. Source Code Management  Also known as Configuration Management  Source Code Managers are tools that: –Archive.
Automated Tests in NICOS Nightly Control System Alexander Undrus Brookhaven National Laboratory, Upton, NY Software testing is a difficult, time-consuming.
Introduction to Software Testing
Version Control with git. Version Control Version control is a system that records changes to a file or set of files over time so that you can recall.
This chapter is extracted from Sommerville’s slides. Text book chapter
SCRAM Software Configuration, Release And Management Background SCRAM has been developed to enable large, geographically dispersed and autonomous groups.
Abstract The automated multi-platform software nightly build system is a major component in the ATLAS collaborative software organization, validation and.
Framework for Automated Builds Natalia Ratnikova CHEP’03.
Introduction to Version Control
A Distributed Computing System Based on BOINC September - CHEP 2004 Pedro Andrade António Amorim Jaime Villate.
Nightly Releases and Testing Alexander Undrus Atlas SW week, May
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
1 Lecture 19 Configuration Management Software Engineering.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
NICOS System of Nightly Builds for Distributed Development Alexander Undrus CHEP’03.
INFSO-RI Enabling Grids for E-sciencE The gLite Software Development Process Alberto Di Meglio CERN.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Chapter © 2006 The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/ Irwin Chapter 7 IT INFRASTRUCTURES Business-Driven Technologies 7.
Version control Using Git Version control, using Git1.
Nightly System Growth Graphs Abstract For over 10 years of development the ATLAS Nightly Build System has evolved into a factory for automatic release.
Copyright © 2015 – Curt Hill Version Control Systems Why use? What systems? What functions?
2nd September Richard Hawkings / Paul Laycock Conditions data handling in FDR2c  Tag hierarchies set up (largely by Paul) and communicated in advance.
G.Corti, P.Robbe LHCb Software Week - 19 June 2009 FSR in Gauss: Generator’s statistics - What type of object is going in the FSR ? - How are the objects.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Organization and Management of ATLAS Nightly Builds F. Luehring a, E. Obreshkov b, D.Quarrie c, G. Rybkine d, A. Undrus e University of Indiana, USA a,
B. Hegner, P. Mato, P. Mendez CERN, PH-SFT Group 1 ST FORUM CERN 28-SEP-2015 THE QUALITY AND TESTING INFRASTRUCTURE OF PH-SFT.
Cluster Consistency Monitor. Why use a cluster consistency monitoring tool? A Cluster is by definition a setup of configurations to maintain the operation.
CERN IT Department t LHCb Software Distribution Roberto Santinelli CERN IT/GS.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
G.Govi CERN/IT-DB 1 September 26, 2003 POOL Integration, Testing and Release Procedure Integration  Packages structure  External dependencies  Configuration.
A. Aimar - EP/SFT LCG - Software Process & Infrastructure1 SPI Software Process & Infrastructure for LCG Project Overview LCG Application Area Internal.
1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
1 Object-Oriented Analysis and Design with the Unified Process Figure 13-1 Implementation discipline activities.
F. Carbognani Software Engineering for the Virgo Project at EGOGeneva-iCALEPCS 14/10/2005 Software Engineering for the Virgo Project at EGO F. Carbognani.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
JRA1 Meeting – 09/02/ Software Configuration Management and Integration EGEE is proposed as a project funded by the European Union under contract.
AliRoot survey: Calibration P.Hristov 11/06/2013.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
SYSTEM INTEGRATION TESTING Getting ready for testing shifts Gunter Folger CERN PH/SFT Geant4 Collaboration Workshop 2011 SLAC.
Compute and Storage For the Farm at Jlab
Build and Test system for FairRoot
Agenda:- DevOps Tools Chef Jenkins Puppet Apache Ant Apache Maven Logstash Docker New Relic Gradle Git.
How to Contribute to System Testing and Extract Results
Virtualisation for NA49/NA61
Dag Toppe Larsen UiB/CERN CERN,
Dag Toppe Larsen UiB/CERN CERN,
ALICE analysis preservation
SPI Software Process & Infrastructure
fields of possible improvement
Chapter 2: System Structures
Virtualisation for NA49/NA61
Software Configuration Management
The ATLAS software in the Grid Alessandro De Salvo <Alessandro
Leanne Guy EGEE JRA1 Test Team Manager
Building and Testing using Condor
Applied Software Implementation & Testing
Introduction to Software Testing
Module 01 ETICS Overview ETICS Online Tutorials
Chapter 1 Introduction(1.1)
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Overview Activities from additional UP disciplines are needed to bring a system into being Implementation Testing Deployment Configuration and change management.
Grid Computing Software Interface
Presentation transcript:

Software Release Build Process and Components in ATLAS Offline Emil Obreshkov for the ATLAS collaboration

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 2 Introduction Software release building process in ATLAS Offline ATLAS software components Code and configuration management Numbered and nightly releases TagCollector and software package approvals Parallel and Distributed Builds NICOS (Nightly COntrol System) Test and Validation Frameworks

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 3 ATLAS Experiment Characteristics Very complex detector ATLAS has ~80 M electronic channels Very large collaboration ATLAS involves ~3000 scientists and engineers from 174 institutions in 38 countries. Large geographically dispersed developers community ~600, mainly part time Large code base mostly in C++ and Python ~5M lines of code Several platforms and compilers in use Need to support stable production and development activities Rapid bug fixing High statistic testing before deployment

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 4 ATLAS Software Components Packages (~2000) Groups of C++ and/or Python classes Primary development/management units Dependencies against classes in other packages Some packages are externally supplied ~200 packages change/week (in all nightlies) Projects (~10) Groups of packages that can be built together Similar dependencies within project Domain specific (e.g. reconstruction, analysis) Primary release coordination units Some projects externally supplied Platforms (~6) Combinations of Operating System, compiler and compiler flags E.g. Scientific Linux 4 & 5, gcc 3.4 and gcc 4.3 in opt/dbg, 32b and 64b, icc & llvm Branches Self contained versions of software supporting development and production

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 5 Code and Configuration Management SVN (Subversion) Used as source code repository (atlasoff) All projects share same repository Can also incorporate software from other repositories SVN tags used to track stable package versions SVN permissions used to control access - read & write 2 additional seperated SVN repositories Users and groups specific code (atlasusr & atlasgrp) CMT (Configuration Management Tool) Used as primary build and configuration management tool Manages dependencies between packages and projects Every package must specify its configuration in a text file called “requirements” Easy to read and modify Establishes common patterns used by all packages and projects E.g. Compiler flags, linker options Definition of specific patterns or actions used only in desired packages Manages build process ensuring correct build sequence Sets up correct runtime environment

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 6 Numbered and nightly releases (1/2) Nightly releases Daily builds kept for several days (usually a week) before being overwritten Coupled with several validation frameworks (see later slides) Coupled with package approval mechanism by identified experts to minimize instabilities Migration releases Nightly releases for developer communities, allowing “disruptive” migrations Typically standard nightly releases with a few specific package modifications Numbered releases Stable snapshots of software used for production and analysis Deployed after high statistics validation Patch releases Providing overrides to fix small problems discovered after deployment of numbered releases. Physics analysis caches (10 of them and growing) : dedicated to physics group to capture specific analysis software on top of a patch release Numbered releases and patch releases distributed separately on the grid

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei Base release Patch releases Physics analysis patch releases Numbered and nightly releases (2/2)

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 8 Tag Collector and Package Approvals (1/2) Web based tool to assign packages to projects, package versions to releases and describe project dependencies Developer user interface to submit new package versions Newly submitted package versions go through validation and approval procedure before being accepted into release branch Set of release coordinators and approvals using automatically generated s

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 9 Tag Collector and Package Approvals (2/2)

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 10 Parallel and Distributed Builds (1/2) Techniques required in order to build in timely manner (< 1 night) Platform Level Parallelism Builds for each platform performed in parallel and results merged together Build machine per platform ATLAS Nightly Control System (NICOS) Farm of dedicated Build Machines Project Level Parallelism Build projects having no mutual dependencies in parallel NICOS

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 11 Parallel and Distributed Builds (2/2) Package Level Parallelism Build packages with no cross dependencies in parallel Take advantage of multi-core chips Tbroadcast: implement that parallelism Defined on top of CMT. File Level Parallelism Parallel make (make -j ) Distributed compilation - Using distcc

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 12 NICOS Control system NICOS Control system ATN Test Tool Int Testing QA Testing Unit Testing Error Analysis Code Checkout Build Results Automatic s ATLAS NIGHTLY BUILDS Tag Collector ATLAS SVN CMT NICOS (1/2)

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 13 NICOS (2/2) Nightly build stability is assured by a local disk usage and automatic discovery and recovery from failures. Nightly releases are checked out, built, and tested on local disks of build machines. Failures to connect to an external tool,such as Tag Collector, are followed by several repeat attempts. Quality of releases is immediately tested NICOS has integrated “Atlas Test Nightly” testing system Runs more than 300 integration tests in all domains of ATLAS software. From simple standalone tests to a few events full reconstruction ATN tests results are available for all platforms shortly after the release build completion. Set of shifters checking the results every day and reporting if any problems

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 14 NICOS Web Pages (1/2)

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 15 NICOS Web Pages (2/2)

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 16 Experience Usage of the local disk on build machines Avoid AFS problems and heavy AFS disk usage Tests also performed locally 1 platform used as master for copy to AFS - the other platforms merge in only binaries Use of trboadcast+distcc - make the builds fast Time to build/copy ~8h Previously took more than 24h - unacceptable for our needs NICOS has some error detection mechanisms incorporated to ensure proper build Checkout from SVN - wait and retry on failure Build problems retry (due to AFS or network failures) Retries in case of failures to connect to Tag Collector

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 17 Test and Validation Frameworks Several test and validation frameworks for nightly and numbered releases Nightlies: ATN (Atlas Testing Nightly) - Unit tests, Functionality tests at O(<10) events RTT (Run Time Tester) - Developers defined functionality tests at O(100) events FCT (Full Chain Test) - Production tests at O(1K) events Run on dedicated instance of RTT using simulated data TCT (Tier0 Chain Test) - Production tests at O(10K) events Run on dedicated instance of RTT using real data Numbered releases: BCT (Big Chain Test) - Production tests at O(1M) events Run on the GRID or central CERN processing facility (Tier0) using real data SampleA - Production tests at O(100K) events Run on the GRID using simulated data FCT, TCT, BCT and SampleA Checking of functionality and physics quantities Semi-automatic comparison against reference histograms

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 18 Run Time Tester Framework using RTT packages Unified test configuration.XML file -> defines RTT job RTT job is a three step process Manual and automated modes of running Developers create and upload/set RTT jobs RTT - finds, runs and publishes results TCT and FCT - run inside RTT

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 19 RTT Job Results

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 20 Build and Test clusters Build cluster Used for full, patch and migration nightlies Numbered releases created by copying and renaming nightly releases 4 Core – 8xSLC5 8 Core – 42xSLC5 8 distcc server machines - shared with other experiments Test cluster (RTT, FCT, TCT) 112 machines in total 13 launch nodes 99 test executors

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 21 Final conclusions Robust release building and validation infrastructure essential for large, complex HEP experiments with distributed developer base Performing the build and test process every night -> easy software development and increases its quality Combination of nightly and numbered releases to support development and production activities Rigorous validation important, both before and after deployment Rapid patching of problems discovered after deployment Require dedicated hardware and manpower resources Build cluster and multiple complementary validation testbeds

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 22 Questions ? Backup slides follow.

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 23 NICOS NICOS (NIghtly COntrol System) manages the multi-platform nightly releases of ATLAS software. Flexible and easy to use At initialization the information about the project tags and project dependencies is retrieved from Tag Collector Code checked out from ATLAS SVN repository Projects are build with CMT configuration management tools Quality checks, unit and integration tests are performed Build and test results posted on NICOS web pages, automatic notifications about problems sent to developers Builds of different projects and platforms are performed in parallel, with all processes thoroughly synchronized: Builds of independent projects are parallel Project is built then testing is started in parallel with the build of the next project in the chain Builds on different platforms are performed simultaneously on different build machines. Upon completion the results are merged to the single location on AFS file system. Parallelism allows to fully load multi-processor build machines and accelerate the nightly releases completion.

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 24 Tbroadcast Tbroadcast - implements parallelism across packages in a project. Python script defined on top of CMT. Parses “cmt show uses” command to get the dependency graph and other package information. Gives better compilation time.

Emil Obreshkov Software Release Build Process and Components in ATLAS Offline CHEP 2010, Taipei 25 Distcc Distcc is a program to distribute builds of C, C++, Object C and Object C++ code Done on several machines in a network Generate same results as a local build (if setup correctly) Does not require all machines to share filesystem Distcc only runs compiler and assembler jobs Compiler and assembler take a single input file and produce single output file Distcc ships these files across the network Preprocessor runs locally Need to access header files on the local machine Linker runs locally Need to examine libraries and object files Build is easy - Distcc works with “make -j ”. The -j value is normally set to about twice the number of available CPUs ATLAS is using dedicated distcc cluster - lxdistcc kindly provided/supported by CERN/IT