Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

Slides:

Advertisements

Similar presentations

 What Is Desktop Virtualization?  How Does Application Virtualization Help?  How does V3 Systems help?  Getting Started AGENDA.

Advertisements

Introduction to CMS computing CMS for summer students 7/7/09 Oliver Gutsche, Fermilab.

Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Data Management for Physics Analysis in PHENIX (BNL, RHIC) Evaluation of Grid architecture components in PHENIX context Barbara Jacak, Roy Lacey, Saskia.

A tool to enable CMS Distributed Analysis

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.

Status of CMS Matthew Nguyen Recontres LCG-France December 1 st, 2014 *Mostly based on information from CMS Offline & Computing Week November 3-7.

Grid Data Management A network of computers forming prototype grids currently operate across Britain and the rest of the world, working on the data challenges.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

Claudio Grandi INFN Bologna CMS Operations Update Ian Fisk, Claudio Grandi 1.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.

1. Maria Girone, CERN  Q WLCG Resource Utilization  Commissioning the HLT for data reprocessing and MC production  Preparing for Run II  Data.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

ATLAS in LHCC report from ATLAS –ATLAS Distributed Computing has been working at large scale Thanks to great efforts from shifters.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Successful Common Projects: Structures and Processes WLCG Management.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.

Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.

Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.

1 LHCC RRB SG 16 Sep P. Vande Vyvre CERN-PH On-line Computing M&O LHCC RRB SG 16 Sep 2004 P. Vande Vyvre CERN/PH for 4 LHC DAQ project leaders.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.

PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.

Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.

LHCbComputing LHCC status report. Operations June 2014 to September m Running jobs by activity o Montecarlo simulation continues as main activity.

6 march Building the INFN Grid Proposal outline a.ghiselli,l.luminari,m.sgaravatto,c.vistoli INFN Grid meeting, milano.

LHC Computing, CERN, & Federated Identities

Workflows and Data Management. Workflow and DM Run3 and after: conditions m LHCb major upgrade is for Run3 (2020 horizon)! o Luminosity x 5 ( )

Predrag Buncic ALICE Status Report LHCC Referee Meeting CERN

ATLAS Distributed Computing perspectives for Run-2 Simone Campana CERN-IT/SDC on behalf of ADC.

Ian Bird WLCG Networking workshop CERN, 10 th February February 2014

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

Maria Girone, CERN CMS Experiment Status, Run II Plans, & Federated Requirements Maria Girone, CERN XrootD Workshop, January 27, 2015.

Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013

LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Computing Model José M. Hernández CIEMAT, Madrid On behalf of the CMS Collaboration XV International Conference on Computing in High Energy and Nuclear.

16 September 2014 Ian Bird; SPC1. General ALICE and LHCb detector upgrades during LS2  Plans for changing computing strategies more advanced CMS and.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.

Maria Girone, CERN  CMS in a High-Latency Environment  CMSSW I/O Optimizations for High Latency  CPU efficiency in a real world environment  HLT 

1 June 11/Ian Fisk CMS Model and the Network Ian Fisk.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.

CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.

Efi.uchicago.edu ci.uchicago.edu FAX splinter session Rob Gardner Computation and Enrico Fermi Institutes University of Chicago ATLAS Tier 1 / Tier 2 /

Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.

LHCb LHCb GRID SOLUTION TM Recent and planned changes to the LHCb computing model Marco Cattaneo, Philippe Charpentier, Peter Clarke, Stefan Roiser.

Ideas and Plans towards CMS Software and Computing for Phase 2 Maria Girone, David Lange for the CMS Offline and Computing group April

LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.

ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,

ATLAS – statements of interest (1) A degree of hierarchy between the different computing facilities, with distinct roles at each level –Event filter Online.

1. Maria Girone, CERN  Highlights of Run I  Challenges of 2015  Mitigation for Run II  Improvements in Data Management, Organized Processing and Analysis.

Maria Girone, CERN CMS Status Report Maria Girone, CERN David Lange, LLNL.

Ian Bird WLCG Workshop San Francisco, 8th October 2016

Computing models, facilities, distributed computing

for the Offline and Computing groups

Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)

Thoughts on Computing Upgrade Activities

ALICE Computing Model in Run3

Real IBM C exam questions and answers

Presentation transcript:

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration

Ian Fisk and Maria Girone  In Run2 CMS expects to increase the HLT output to ~1kHz  will promptly reconstruct more than twice as many events as the final year of Run2  expected pile-up will require twice as much processing time per event  The budget and evolution of technology permits less than doubling of the computing resources between 2012 and 2015  The main focus of Long Shutdown 1 has been finding ways to do more with less and to look for efficiency improvements in every system

Ian Fisk and Maria Girone  Evolution out of the MONARC model had already started in Run1. The LS1 focused on  Additional Flexibility on the way resources are accessed  Improved Performance on the way resources are used  Optimized Access to data for analysis and production  Instrumental for these changes have been the adoption of  The higher level trigger for production use  Logical separation between the Disk and Tape storage systems  Dynamic Data Placement  Data Federation  Distributed Tier-0  CRAB3

Ian Fisk and Maria Girone  Operational improvements  Reducing the number of reprocessing passes expected and commissioning the use of the HLT farm for the main data processing in the winter  Constraining the simulation budget and developing techniques to allow simulation reconstruction to be run on more resources  Distributing the prompt reconstruction between CERN and the Tier-1 centers  Technical improvements  Reconstruction improvements and the development of a multi- core application  Commissioning of the multi-core queues at Tiero and Tier1s  Multi-core decreases the number of processes that need to be tracked and reduces the overall operational load

Ian Fisk and Maria Girone  An addition for Run II is the use of the High Level Trigger (HLT) farm for offline processing  It is a large computing resource (15k cores) that is similar in size to the Tier-0 in terms of number of cores, but we cannot reach this scale until March  Successfully interfaced using cloud computing tools. It is similar to the Tier-0 AI  In 2014 the network link P5 to the computing center was upgraded from 20 to 60Gb/s  Far larger than needed for data taking but necessary to access the storage in the computing center for simulation reconstruction  Will be upgraded to 120Gb/s before the 2015 run starts  Production workflows have been commissioned including the HI reprocessing, Gen-Sim, and Simulation reconstruction  All access to data is through the data federation and primarily served from CERN 40Gbit/s sustained 5

Ian Fisk and Maria Girone  CMS logically separated disk and tape. Creating two sites: one disk and one tape  Simple change with far reaching consequences  Improved disk management  Ability to treat the archival services independently  Reduces functional differences between sites  Reduced differences between sites  Eliminates the need for storage classes

Ian Fisk and Maria Girone  In addition to work on data federation we have tried to improve our traditional data placement and access  The use of samples is continuously monitored through the data popularity  Number of replicas depends on the popularity of the datasets  Samples will be replicated dynamically if the load is high and replicas removed if they are not accessed for a period of time  Samples are only deleted when there is new data to replicate, disks are kept full  “0 old”, shows the volume data that were last accessed prior to the period covered by the plot; the “0” bin stands for no access in the period selected  The zero bin includes un-accessed replicas and datasets that have only one copy on disk  As of today, the system has triggered the deletion of roughly 1.5 PB of the least popular samples residing at Tier-2 sites, and it is being enabled in a few Tier-1s

Ian Fisk and Maria Girone  Any Data, Anytime, Anywhere (AAA) has been a primary focus area in 2014  CERN, all Tier-1s,most of the Tier-2 sites serve data in the federation  More than 90% of all current CMS data should be accessible  Sufficient IO capacity to provide 20% of the total access (~100TB/day)  Enable to system of hierarchical redirectors to maintain access within geographic regions when possible  Nearly all sites are configured to use the federation to access samples, if they aren’t available locally  Optimization of the IO has been an ongoing activity for several years, which has paid off in high CPU efficiency over the wide area  Big push in 2014 to commission sites to measure IO and file open rates and to understand the sustainable load and to deploy and use advanced monitoring CMS has developed a new user analysis data format that is 5-10x smaller than the Run 1 format – Design and content based on Run 1 experience across analysis groups – Targets most analysis needs (80-90%) Potential for big analysis improvements in Run 2: – Increased analysis agility in limited computing resources: Recreating miniAOD is much faster than rerunning the reconstruction for a vast range of performance improvements

Ian Fisk and Maria Girone  The addition of the production workflow puts additional constraints on the required reliability of the Federation  Validated sites are in the production federation, sites being commissioned are in an independent federation and only when a sample cannot be found in production are they used  This new concept has been a result of a collaborative effort between ATLAS and CMS federation developers Redirector Site Redirector Site Production Transitional

Ian Fisk and Maria Girone  The Tier-0 submission infrastructure was reworked over LS1  The capability of distributing Prompt Reco was added to allow the Tier-1 centers to contribute  The Tier-0 processing predominately come from the CERN Agile Infrastructure with direct resources provisioning through Condor  CMS was an early adopter of CERN AI  The Tier-0 infrastructure was implemented using the components from the rest of the computing reprocessing system  This reduces the software maintenance of the system and reduces the overall operations load

Ian Fisk and Maria Girone  The CMS Remote Analysis Building (CRAB) has been the user interface to the distributed computing resources since the beginning  Completed the development of the next generation of CRAB during LS1 (CRAB3)  Is a client-server model  A light client uploads to a server  Given much better resubmission and allows the experiment more central prioritization  Output files are handed asynchronously

Ian Fisk and Maria Girone  In Run2 CMS computing resources are intended to work more like a coherent system than a collection of sites with specialized functions  Improved networks have been key to this 12  Data Federation will make CMS datasets available transparently across the Grid  One central queue for all resources and all workflows  The HLT farm is now an integrated resource when we are not in data-taking mode HLT Tier-1 Tier-0 Tier-2 GEN-SIM MC RECO DATA RECO ANALYSIS

Ian Fisk and Maria Girone  A lot of work has been done in LS1  We have a functional data federation, which we expect to declare production ready before the start of the run  We have much better flexibility in where workflows can run  We have reworked the Tier-0 infrastructure for distributed prompt-reconstruction  We needed to make big increases in operational efficiency to survive the conditions in Run2 with the resources that could be afforded