GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College London
GridPP18 Glasgow Mar 07Introduction Tevatron –Running experiments (Less data than LHC, but still PBs/experiment) –Growing - great physics & better still to come.. Have 2fb -1 of data and expect up 6fb -1 more by end running been discussed Computing model: Datagrid (SAM) for all data handling & originally distributed computing with evolution to automated use of common tools/solutions on the grid (SAMGrid) for all tasks –Started with production tasks –Started with production tasks eg MC generation, data processing Greatest need & easiest to ‘gridify’ - ahead of the wave and a running expt. –Base on SAMGrid, but have a program of interoperability from v. early on Initially LCG and then OSG –Increased automation, user analysis considered last SAM gives remote data analysis
GridPP18 Glasgow Mar 07 Computing Model Remote Analysis Systems Data Handling Services Central Analysis Systems Remote Farms Central Farms User Desktops Central Storage Raw Data RECO Data RECO MC User Data
GridPP18 Glasgow Mar 07 Components - Terminology SAM (Sequential Access to Metadata) –Well developed metadata & distributed data replication system –Originally developed by DØ & FNAL-CD, now used by CDF & MINOS JIM ( Job Information and Monitoring) –handles job submission and monitoring (all but data handling) –SAM + JIM → SAMGrid – computational grid Runjob –handles job workflow management UK Role –Project leadership –Key technology – runjob, integration of SAMGrid dev & production
GridPP18 Glasgow Mar 07 SAM plots Over 10 PB (250B evts) last yr Up to 1.2 PB moved per month (x5 increase over 2 yrs ago) SAM TV - monitor SAM and SAM stations Continued success: SAM shifters – often remote 1PB / month All
GridPP18 Glasgow Mar 07SAMGrid-plots JIM: > 10 active execution sites “Moving to forwarding nodes” “No longer add red dots”
GridPP18 Glasgow Mar 07 SAMGrid Interoperability Long programme of interoperability – LCG 1 st and then OSG Step 1: Co-existence – use shared resources with SAM(Grid) headnode –Widely done for both MC and 2004/5 data reprocessing Nikhef MC v. good example – GridPP10 talk Step 2 – SAMGrid-LCG interface –SAM does data handling & JIM job submission –Basically forwarding mechanism –Data fixing in early 2006 –MC since OSG activity – learnt from LCG activity –P20 data reprocessing now Replicate as needed
GridPP18 Glasgow Mar 07 Monte Carlo Massive increase with spread of SAMGrid use & LCG (OSG later) P17 – 455M events since 09/05 30M events/month 80% in Eu –Almost a const of nature UKRAC –Full details on web – neu/d0_uk_rac/d0_uk_rac.html LCG gridwide submission reached scaling problem
GridPP18 Glasgow Mar 07 P14 Reprocessing: Winter 2003/04 –100M events remotely, 25M in UK –Distributed computing rather than Grid P17 Reprocessing: Spring – Autumn 05 –x 10 larger ie 1B events, 250TB, from raw –SAMGrid as default (using mc_runjob) P17 Fixing: Spring 06 –All RunIIa – 1.4B events in 6 weeks –SAMGrid-LCG ‘burnt-in’ Moving to primary processing and skimming Data – reprocessing & fixing Site certification
GridPP18 Glasgow Mar 07 A comment.. if I may Largest data challenges (I believe) in HEP using the grid Learnt a lot about the technology, and especially how it scales Learnt a lot about organisation / operation of such projects Some of these can be abstracted and of benefit to others… (a different talk…)
GridPP18 Glasgow Mar 07 A comment - graphically P20 reprocessing –I know its OSG –(started with LCG) –SAMGrid-LCG Will use to catch-up IN2P3 OSG A lot of green A lot of red
GridPP18 Glasgow Mar 07 (DØ –) Runjob Used in all production tasks – UK responsibility In 04 we froze SAM at v5 & mc_runjob used by SAMGrid for MC and reprocessing from then till summer 06 DØrunjob - the rewrite Joint (CDF,) CMS, DØ, FNAL-CD project Base classes from common Runjob package Things got messy – but triumph –Sustainable, long term product with SAM v7 For details see: Runjob CDFRunjobCMSRunjobDØRunjob
GridPP18 Glasgow Mar 07 Next steps / issues - I Complete endgame development – ability to analysis larger datasets with decreasing manpower –Additional functionality – skimming, primary processing at multiple sites, MC prod at diff stages, diff output… –Additional resources - Completing the forwarding nodes Full data /MC capability Scaling issues to access the full LCG and OSG worlds –Data analysis – how gridified do we go? – an open issue My feeling – need to be ‘interoperable’ – Fermigrid, certain large LCG sites Will need development, deployment and operations effort –And operations..
GridPP18 Glasgow Mar 07 Next steps / issues - II “Steady” state – goal to reach by end of CY 07 (≥ 2yrs running) –Maintenance of existing functionality –Continued experimental requests –Continued evolution as grid standard’s evolve –Operations You do still need manpower and not just to make sure the hardware works MC and data are not fire and forget Manpower a real issue (Especially with data analysis on the grid)
GridPP18 Glasgow Mar 07 Summary / plans DØ and Tevatron performing very well –Big physics results have come out, better yet on their way –Much more data to come increasing needs, with reduced effort SAM & SAMGrid critical to DØ –Without the grid DØ would not have worked – THANKS - –GridPP key part of effort (technical / leadership) – THANKS - –Users - demanding, hard to develop and maintain production level services Baseline: Ensure (scaling for) production tasks –Move to SAMv7 and d0runjob –Accessing all LCG - establishing UKRAC – forwarding nodes In parallel open question of data analysis – will need to go part way Manpower for development, integration and operation is a real issue
GridPP18 Glasgow Mar 07 Back-ups
GridPP18 Glasgow Mar 07 SAMGrid Architecture