External Focus CMS@Home Dr Ivan D Reid Brunel University London 02/09/2016 Ivan D Reid.

Slides:



Advertisements
Similar presentations
University of Southampton Electronics and Computer Science M-grid: Using Ubiquitous Web Technologies to create a Computational Grid Robert John Walters.
Advertisements

Volunteer Computing Laurence Field IT/SDC 21 November 2014.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Batch Production and Monte Carlo + CDB work status Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
The Prototype Laurence Field IT/SDC 11 November 2014.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
1 port BOSS on Wenjing Wu (IHEP-CC)
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
The Data Bridge Laurence Field IT/SDC 6 March 2015.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.
David Cameron Claire Adam Bourdarios Andrej Filipcic Eric Lancon Wenjing Wu ATLAS Computing Jamboree, 3 December 2014 Volunteer Computing.
David Cameron Riccardo Bianchi Claire Adam Bourdarios Andrej Filipcic Eric Lançon Efrat Tal Hod Wenjing Wu on behalf of the ATLAS Collaboration CHEP 15,
BOINC: Progress and Plans David P. Anderson Space Sciences Lab University of California, Berkeley BOINC:FAST August 2013.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
1 Volunteer Computing at CERN past, present and future Ben Segal / CERN (describing the work of many people at CERN and elsewhere ) White Area lecture.
Predrag Buncic (CERN/PH-SFT) Software Packaging: Can Virtualization help?
– Past, Present, Future Volunteer Computing at CERN Helge Meinhard, Nils Høimyr / CERN for the CERN BOINC service team H. Meinhard et al. - Volunteer.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
Volunteer Clouds for the LHC experiments H. Riahi – 12/11/15 EGI User Forum Laurence Field Hassen Riahi CERN IT-SDC.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
CernVM and Volunteer Computing Ivan D Reid Brunel University London Laurence Field CERN.
Emulating Volunteer Computing Scheduling Policies Dr. David P. Anderson University of California, Berkeley May 20, 2011.
The Limits of Volunteer Computing Dr. David P. Anderson University of California, Berkeley March 20, 2011.
Frontiers of Volunteer Computing David Anderson Space Sciences Lab UC Berkeley 28 Nov
Claudio Grandi INFN Bologna Workshop congiunto CCR e INFNGrid 13 maggio 2009 Le strategie per l’analisi nell’esperimento CMS Claudio Grandi (INFN Bologna)
Volunteer Computing and BOINC
Dynamic Extension of the INFN Tier-1 on external resources
WLCG IPv6 deployment strategy
Status of WLCG FCPPL project
Laurence Field IT/SDC Cloud Activity Coordination
Bulk production of Monte Carlo
Xiaomei Zhang CMS IHEP Group Meeting December
L’analisi in LHCb Angelo Carbone INFN Bologna
Volunteer Computing for Science Gateways
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
NA61/NA49 virtualisation:
Dag Toppe Larsen UiB/CERN CERN,
Overview of the Belle II computing
Belle II Physics Analysis Center at TIFR
Dag Toppe Larsen UiB/CERN CERN,
ALICE Monitoring
Outline Benchmarking in ATLAS Performance scaling
Workload Management System
Bulk production of Monte Carlo
How to enable computing
David Cameron ATLAS Site Jamboree, 20 Jan 2017
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Simulation use cases for T2 in ALICE
US CMS Testbed.
CernVM Status Report Predrag Buncic (CERN/PH-SFT).
WLCG Collaboration Workshop;
Haiyan Meng and Douglas Thain
20409A 7: Installing and Configuring System Center 2012 R2 Virtual Machine Manager Module 7 Installing and Configuring System Center 2012 R2 Virtual.
Ivan Reid (Brunel University London/CMS)
Backfilling the Grid with Containerized BOINC in the ATLAS computing
Condor-G Making Condor Grid Enabled
Implementation of a small-scale desktop grid computing infrastructure in a commercial domain    
Exploit the massive Volunteer Computing resource for HEP computation
Harrison Howell CSCE 824 Dr. Farkas
Grid Computing Software Interface
Exploring Multi-Core on
The LHCb Computing Data Challenge DC06
Presentation transcript:

External Focus CMS@Home Dr Ivan D Reid Brunel University London 02/09/2016 Ivan D Reid

What is was CMS@Home? Developmental distributed computing using Volunteer contributions, similar to SETI@Home Based on the framework BOINC (Berkeley Open Infrastructure for Network Computing) Also uses the VirtualBox virtualiser; CMSSW jobs are run in a Red Hat Linux VM on Windows, Linux, and Macintosh PCs. VM includes cvmfs to allow access to CMS software and data 02/09/2016 Ivan D Reid

Why? Snapshot of ATLAS Dashboard on 12/01/2016: 10,000 ATLAS@Home jobs running at once… Why me? 02/09/2016 Ivan D Reid

…to the Volunteer Installs BOINC and VirtualBox if necessary Signs up to CMS@Home at http://boincai05.cern.ch/CMS-dev/index.php using an invitation code (available on request) Enables Task download for the project, and watches as they arrive and run Can watch progress on a remote console (or Dashboard) and locally view logs in a browser 02/09/2016 Ivan D Reid

…to the Manager Creates a CMSSW project area and designs a suitable work-flow (Monte-Carlo only, no data) Submits batch jobs using CRAB3; special commands direct jobs to a dedicated Condor server at RAL Logs onto the Condor server to adjust job parameters Monitors batch on Dashboard and via logs on the server Monitors proxy certificate (default 7 days), manually installs new one if necessary; also periodically archives logs to save disk space It was intended that results come to local SE; not yet implemented, data remains on the CERN Data-Bridge 02/09/2016 Ivan D Reid

Finer Details The BOINC “tasks” just run the VM for 24 hrs VM sends out requests to the Condor server and receives a CMSSW job from the batch When a job completes, its output files are copied to the Data-Bridge, and a new request is made The Volunteer receives BOINC credit for the time the VM is running, not for jobs completed 02/09/2016 Ivan D Reid

It’s Actually Not That Simple… The Way it Was CMS Volunteer Infrastructure Volunteer CA Volunteer GET Glidein VM Agent Wrapper Glidein WMS CRT GET Proxy VCCS* glidein Jobs Join Condor Pool CRAB3 VBoxwrapper PUT Output DataBridge Task DynaFed Grid Async stage-out S3 FTS Happy User *Volunteer Computing Credential Service 02/09/2016 Ivan D Reid

Implementation The Way it is Now Application Server Common Infrastructure Volunteer’s machine VCCS GET Proxy Job Manager Condor Volunteer VM condor_submit Instant Glidein DataBridge Join Pool gfal-copy PUT DynaFed Data IO S3 VBoxwrapper http://svnweb.cern.ch/trac/lcgdm/wiki/Dynafeds Grid FTS

Sandboxing and Authentication Trusted Domain Untrusted Domain IGTF CA Online CA Request Certificate Request Certificate VCCS Sandbox GET Proxy Submit Job Get Job Grid Service Jobs VM Grid Data Transfer Data Data Transfer

Volunteer Considerations Because of the special nature of CMS@Home tasks, some special considerations apply: Hosts should preferably run 24/7; stopping during a job risks losing the job (this has been considerably improved lately) High download bandwidth is required to fetch the VM, etc. Where upload bandwidth is limited (e.g. home ADSL), care must be taken running multiple hosts to avoid saturation 02/09/2016 Ivan D Reid

That was Then, This is Now Organisation at CERN has changed somewhat this year, to enable a more common approach CMS@Home has become vLHCathome-dev Run development apps for several projects: CMS Simulation LHCb Simulation Theory Simulation ATLAS Simulation ALICE Simulation Sixtrack Simulation Benchmark Application 02/09/2016 Ivan D Reid

For “Production” The old Test4Theory has become vLHCathome Has stable apps for Theory Simulations (Test4Theory) CMS Simulations – now opened up to any user LHCb Simulations News last week: 3 TRILLION EVENTS REACHED BY TEST4THEORY TODAY !!! 02/09/2016 Ivan D Reid

What Does CMS Run We can run practically any Monte-Carlo simulation on any version of CMSSW available through cvmfs; data analysis is not possible because of Volunteer bandwidth constraints Initially ran CMSSW_6_2_0_SLHC26_patch3 jobs for MinimumBias and TTbar Aimed for jobs ~2 hr with results O(100 MB) 02/09/2016 Ivan D Reid

Current Workflow To try to better match Volunteers’ bandwidth, we looked for a workflow that would have reduced output/hour and also be of some use to CMS Found a B-physics GEN-SIM request with low efficiency (0.00003) which had been put on hold, and started submitting jobs for it 02/09/2016 Ivan D Reid

Latest Performance Currently run jobs with 200,000 events, in batches of 10,000 jobs Median wall-time per job is 2h25m (~20m – 20h) Average result file ~16 MB Since the end of June we have processed requests for >22 billion events Dashboard currently reports ~98% success, but we have lost jobs in stage-out to DataBridge 02/09/2016 Ivan D Reid

Process The decay of Λ0b to p, µ and ν The process is a background to Bs -> µµ due to a misidentified proton as a muon 8,822 result files on DataBridge from one batch (Dashboard claims 9,784 successes) were analysed and found to contain 58,196 events This gives a fraction of 58196/(8822*2e5)= 3.3e-5 02/09/2016 Ivan D Reid

pµ Invariant Mass, p pT and µ pT 02/09/2016 Ivan D Reid

Recent Snapshots vLHCatHome-dev vLHCatHome 02/09/2016 Ivan D Reid

Running Jobs 02/09/2016 Ivan D Reid

Dashboard Dashboard: 9724 files 02/09/2016 Ivan D Reid

Current Status Project is running relatively smoothly, but still time-consuming – more automated monitoring would help Submission is to move from CRAB3 to WMAgent – this will help in several areas; hindered by losing key personnel and (recently) August Implementation of the final stage, transfer from Data-Bridge to GRID storage, is needed soon but dependent upon WMAgent being functional 02/09/2016 Ivan D Reid

Developments Currently trialling multi-core VMs – these present themselves to BOINC as computers with n CPUs and thus run n BOINC jobs (n is user-selectable) Works well for Theory app, but there are problems that need to be ironed out for CMS and Benchmark apps, on Linux at least Have to be careful about memory and bandwidth… 02/09/2016 Ivan D Reid

Conclusions Distributed Volunteer Computing has the capability to significantly enhance the reach of scientific calculations into areas that may be seen as too non-productive for normal operations The use of DVC may be seen as “free” but it has non-trivial costs in organisation and management 02/09/2016 Ivan D Reid