f Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005
f September Run II Computing Review DO Status DO is running SAMGrid for MC production and Reprocessing –SAMGrid is a 1 st generation production system Typical configuration, installation and robustness issues are being addressed –LCG SamGrid interoperability proto-type is going well –OSG resource selector will be developed in order to facilitate similar functionality as with LCG
f September Run II Computing Review CDF Status CDF has prototype grid job submission based on the CDF Analysis Facility that uses Condor Glide-in –Running well and usefully in “owner/operator” mode on a few sites –Does not have integrated data handling –May not be handling tarballs –Requires installation on a head node, and outbound node connectivity –Has some legacy security policies to address CAF is kerberos basedCDF has prototype grid job submission based on the CDF Analysis Facility that uses Condor Glide-in
f September Run II Computing Review Why? Glide-in technology is attractive in many ways. –There is always a certain appeal in the next great thing. Illustrative of a general tension for the Run II experiments –Competing agendas—difficult for CDF to turn down effort. Italians support Glide-CAF CDF wants to do analysis on the Grid, and they do not want user interface to change. –Probably could have achieved that requirement other ways, however CDF is also vested in the CAF as an model. Ultimately probably beneficial to both CDF and DO –If Glide-in works in production on a reasonable time scale, might be able to use –VO specific services support is a motivation for the Edge Services Pre-proposal for OSG—Edge Services will almost certainly benefit DO
f September Run II Computing Review Run II computing in the LHC Era Grid is the strategic direction for FNAL CD to meet commitments to Run II, CMS and other stakeholders. –05 Run II computing review complimented DO and CDF on moving to towards grid models –Run II effort task force acknowledges strategy Concerns about –Availability of resources, especially disk –Urged to make more formal agreements –“Expenses” involved in operating a production Grid –About heavyweight and nonstandard interfaces on the production system –About real world issues for the prototype Mitigations –DO and FNAL CD proposing an installation team, supported by the review –Move towards standard interfaces, more robust –Guest Scientist positions could be used to leverage knowledge and expertise—particularly in cases where physics potential would also leveraged.
f September Run II Computing Review OSG Pre Proposals The OSG Pre Proposal call was targeted at core functionality –SAMGrid was built with the support of PPDG funds. –Noted that a service without customers is of limited use. –Some calls to work closely with TERAGrid. –Still working through details for a full proposal –Encouraged to make a proposal for an OSG that will thrive!
f September Run II Computing Review Summary
f September Run II Computing Review RUN II Department Roles Operations—Running the systems, standing pager rotations/shifts, researching latest technologies – purchasing and deploying equipment – tracking down and fixing problems – code management Development—exploring use cases, writing code, introducing new features, testing, documenting, exploring technologies Integration—testing, more testing, training users, transition from development to operations Planning—how best to use resources to meet stakeholder needs, facility issues Interfacing – Serve in experiment management roles, bridging the CD and the experiments, CD department to CD department, hosting guest scientists Participate in physics analysis as collaboration members -- 30% of department FTEs hold scientific positions
f September Run II Computing Review Risks, expanded Increased calls on FNAL CD as migration of effort and equipment to LHC Declining equipment and operations budgets are already limiting the data collection rate. –Over time, limits in the equipment and operating budget will create delays Operational performance of user code –DO reconstruction code performance and release turn-around –CDF user code has caused inefficiencies on the CAF COTS Computing –Experiments need best price/performance, which introduces risk. –Moore’s law –Have a good process in place for evaluation, purchase and acceptance. –Each purchase of worker nodes presents challenges FNAL CD plays engineering/integrator role by default –Commodity fileservers are maintenance intensive
f September Run II Computing Review Risks, expanded Data Handling –SAM system, dCache, hardware working well –User patterns are still evolving, sometimes conflicts between wanting to get results out and using standard production. –Scaling with data sample size might have unanticipated consequences. –Count on next generation tape drives to mitigate tape costs Longevity of hardware components and software applications –Starting to use a 4 year replacement cycle for worker nodes so the equipment is off warranty the final year. –5 year life cycle on major components, replacement needed again around 2010 when budget for Run II will be extremely limited. –Migrating either experiment from existing mode of operation or user interfaces would be time intensive and costly.