David Cameron Claire Adam Bourdarios Andrej Filipcic Eric Lancon Wenjing Wu ATLAS Computing Jamboree, 3 December 2014 Volunteer Computing
What is volunteer computing? Ordinary people voluntarily running scientific tasks on their PCs
Berkeley Open Infrastructure for Network Computing (BOINC)
Volunteer CERN 2004: Sixtrack 2011: Test4Theory 2014: (LHCb)
Why use volunteer computing for ATLAS? –It’s free! (almost) –Public outreach Considerations –Low priority jobs with high CPU-I/O ratio Non-urgent Monte Carlo simulation –Need virtualisation for ATLAS sw environment CERNVM image and CVMFS –No grid credentials or access on volunteer hosts ARC middleware for data staging –The resources should look like a regular Panda queue ARC Control Tower
Initial Architecture ARC Control Tower Panda Server ARC CE Session Directory BOINC LRMS Plugin BOINC server Volunteer PC BOINC Client VM Shared Directory Grid Catalogs and Storage DB proxy cert BOINC PQ
CERN Current Setup ARC Control Tower Panda Server ARC CE BOINC server Volunteer PC BOINC Client VM Shared Directory Grid Catalogs and Storage DB on demand BOINC PQ
History Test server with ARC CE and BOINC server with app ran in Beijing from January – –Volunteers found it somehow… In July volunteers were moved to CERN server with ARC CE + BOINC – (alias atlasathome.cern.ch) –CERN IT provided 1TB NFS space for job input/output At the same time became an official BOINC project In early October the BOINC server was changed to a server run by CERN IT –Volunteers + credit moved too A parallel test setup with separate ARC CE and BOINC server exists for testing
Boinc jobs Real simulation tasks –mc12_8TeV PowhegPythia_P2011C_ttbar_nonallhad_mtt_2000p.simul.e2940_s1773 –Full athena jobs –50 events/job Runs in CERNVM with pre-cached software But some data still needs to be downloaded at runtime –Conditions data from squid/frontier Image is 1.1GB (500MB compressed) and downloaded only once Input files (data file + small scripts) is 1-100MB Output is ~100MB VM memory is now 2GB (was 1GB initially, but now more complex jobs) Jobs take from few hours up to a few days on fast (single) core Validation –Per work unit, that correct output is produced (just that file exists, the content is not checked) –Physics validation comparing results to regular Grid task
How does it work for volunteers? Install BOINC client and VirtualBox –Linux, Mac and Windows supported –Currently 80% of hosts have Windows In BOINC client choose and create an account That’s it!
Issues with jobs The majority of volunteers (~80%) never complete a single job –Not powerful enough resources, entry barrier is too high Requires 64-bit, at least 4GB, decent bandwidth, installing VirtualBox is the hardest BOINC project to run (quote from volunteer) –Unreliable system/failing jobs also push people away The worst thing for volunteers is to use CPU and not give credit –BUT the normal retention rate of a project is 10% More problems –Virtualisation/VMwrapper causes a lot of problems (memory, jobs not finishing, unstable) –Firewall issues accessing conditions data through squids We are working on ways to cache this data in the image to avoid network access from the job
Volunteer growth Currently >12000 volunteers, 1000 active 300 new volunteers/week 300k volunteers, 47k active 5 million volunteers, 150k active
Job statistics Continuous running jobs almost 300k completed jobs 500k CPU hours 14M events 50% CPU efficiency
Scale of 28 th largest ATLAS simulation site
Very roughly 3 credits/event
Very active message boards
Standard Boinc webpage Technical info on how to join Message boards Jobs/results Job statistics
public outreach page cern.chhttps://atlasphysathome. cern.ch Designed by Claire using Drupal Entry point for the public to find out what they are contributing to Many links to existing outreach pages
Screensaver Many BOINC projects run as “screensavers” Working with Riccardo-Maria Bianchi from ATLAS event display VP1 to make screensaver –Show pre-configured event displays as events are produced to show people what they are running This can help motivate people to look more into the physics details
Lessons Learned and Future It takes a lot of effort to run –In the interaction with volunteers Some volunteers are extremely competent and knowledgeable and help others –Maintaining and improving the system workflow The number of running jobs has reached a plateau –We are exploring scaling options with CERN IT (Ceph, multiple apache servers etc) –Not enough people joining But we deliberately haven’t advertised too much to ramp up slowly The major problems are caused by vboxwrapper BOINC developers very enthusiastic to help us –They give us fixes/new features in days We have a few more things to fix before can move out of beta –New manpower starting now will help greatly We want to push internally inside ATLAS –eg now available as part of NICE, to put on CERN administrative PCs
Stop press!
potential It is not possible to run any ATLAS jobs on –See earlier considerations about I/O, unreliability etc But ~50% of jobs could feasibly run on this platform The high entry barrier may limit general public participation Can it replace small Grid sites? –For example a CPU-only T3 site or small university cluster –Instead of setting up all the Grid infrastructure just install BOINC on the worker nodes –Standard Grid accounting in APEL is provided by ARC CE
Thanks Thanks to our CERN IT colleagues in for providing the Boinc infrastructure and storage space.. and please join us!