Presentation is loading. Please wait.

Presentation is loading. Please wait.

November 16, 2004FermiGrid Project1 FermiGrid – Fermilab Grid Gateway Keith Chadwick Bonnie Alcorn Steve Timm.

Similar presentations


Presentation on theme: "November 16, 2004FermiGrid Project1 FermiGrid – Fermilab Grid Gateway Keith Chadwick Bonnie Alcorn Steve Timm."— Presentation transcript:

1 November 16, 2004FermiGrid Project1 FermiGrid – Fermilab Grid Gateway Keith Chadwick Bonnie Alcorn Steve Timm

2 November 16, 2004FermiGrid Project2 FermiGrid - Strategy and Goals: In order to better serve the entire program of the laboratory the Computing Division will place all of its production resources in a Grid infrastructure called FermiGrid. This strategy will continue to allow the large experiments who currently have dedicated resources to have first priority usage of certain resources that are purchased on their behalf. It will allow access to these dedicated resources, as well as other shared Farm and Analysis resources, for opportunistic use by various Virtual Organizations (VOs) that participate in FermiGrid (i.e. all of our lab programs) and by certain VOs that use the Open Science Grid. (Add something about prioritization and scheduling – lab/CD – new forums). The strategy will allow us: to optimize use of resources at Fermilab to make a coherent way of putting Fermilab on the Open Science Grid to save some effort and resources by implementing certain shared services and approaches to work together more coherently to move all of our applications and services to run on the Grid to better handle a transition from Run II to LHC (and eventually to BTeV) in a time of shrinking budgets and possibly shrinking resources for Run II worldwide to fully support Open Science Grid and the LHC Computing Grid and gain positive benefit from this emerging infrastructure in the US and Europe.

3 November 16, 2004FermiGrid Project3 FermiGrid – What It Is: FermiGrid is a meta-facility composed of a number of existing “resources”, many of which are currently dedicated to the exclusive use of a particular stakeholder. FermiGrid (the facility) provides a way for jobs of one VO to run either on shared facilities (such as the current General Purpose Farm or a new GridFarm?) or on the Farms primarily provided for other VOs. (>>> needs wordsmithing to say what not how) FermiGrid will require some development and test facilities to be put in place in order to make it happen. FermiGrid will provide access to storage elements and storage and data movement services for jobs running on any of the compute elements of FermiGrid The resources that comprise FermiGrid will continue to be accessible in “local” mode as well as “Grid” mode

4 November 16, 2004FermiGrid Project4 The FermiGrid Project This is a cooperative project across the Computing Division and its stakeholders to define and execute the steps necessary to achieve the goals of FermiGrid Effort is expected to come from Providers of shared resources and services – CSS and CCF Stakeholders and providers of currently dedicated resources - Run II, CMS, MINOS, SDSS The total program of work is not fully known at this time – but the WBS is being fleshed out. It will involve at least the following Adding services required by some stakeholders to other stakeholders dedicated resources Work on authorization and accounting Providing some common FermiGrid Services (e.g …. ) Providing some head-nodes and gateway machines Modifying some stakeholders scripts, codes, etc. to run in the FermiGrid environment Working with OSG technical activities to make sure FermiGrid and OSG (and thereby LCG) are well aligned and interoperable Working on monitoring and web pages and whatever else it takes to make this all work and happen Evolving and defining forums for prioritizing access to resources and scheduling

5 November 16, 2004FermiGrid Project5 FermiGrid –Some Notations Condor = Condor / Condor-G as necessary.

6 November 16, 2004FermiGrid Project6 FermiGrid – The Situation Today Many separate clusters: CDF (x3), CMS, D0 (x3), GP Farms, FNALU Batch, etc. When the cluster “landlord” does not fully utilize the cluster cycles – it is very difficult for others to opportunistically utilize the excess computing capacity. In the face of flat or declining budgets, we need to make the most effective use of the computing capacity. We need some sort of system to capture the unused available computing and put it to use.

7 November 16, 2004FermiGrid Project7 FermiGrid – The State of Chaos Today

8 November 16, 2004FermiGrid Project8 FermiGrid – The Vision The Future is Grid enabled computing. Dedicated systems resources will be assimilated – slowly... Existing access to resources will be maintained. I am chadwick of grid – prepare to be assimilated… Not! Enable Grid based computing, but do not require all computing to be Grid. Preserve existing access to resources for current installations. Let a thousand flowers bloom – Well not quite. Implement Grid interfaces to existing resources without perturbation of existing access mechanisms. Once FermiGrid is in production, deploy new systems as Grid enabled from the get go. People will naturally migrate when they need expanded resources. Help people with their migrations?

9 November 16, 2004FermiGrid Project9 FermiGrid – The Mission FermiGrid is the Fermilab Grid Gateway infrastructure to accept jobs from the Open Science Grid, and following appropriate credential authorization, schedule these jobs for execution on Fermilab Grid resources.

10 November 16, 2004FermiGrid Project10 FermiGrid – The Rules First do no harm: Wherever possible, implement such that existing systems and infrastructure is not compromised. Only when absolutely necessary, require changes in existing systems or infrastructure, and work with those affected to minimize and mitigate the impact of the required changes. Provide resources and infrastructure to help experiments transition to a Grid enabled model of operation.

11 November 16, 2004FermiGrid Project11 FermiGrid – Players and Roles CSS Hardware & Operating System Management & Support. CCF Grid Infrastructure Application Management & Support. OSG & “A cast of thousands” Submit Jobs & utilize resources. –CDF –D0 –CMS –Lattice QCD –Sloan –Minos –MiniBoone –FNAL –Others?

12 November 16, 2004FermiGrid Project12 FermiGrid – System Evolution Start “small”, but plan for success. Build the FermiGrid gateway system as a cluster of redundant server systems to provide 24x7 service. Initial implementation will not be redundant, that will follow as soon as we learn how to implement the necessary failovers. We’re going to have to experiment a bit an learn how to operate these services. We will need the capability of testing upgrades without impacting production services. Schedule OSG jobs on “excess/unused” cycles from existing systems and infrastructure. How? Initial thoughts were to utilize checkpoint capability within Condor. Feedback from D0 and CMS is that this is not an acceptable solution. Alternatives – 24 hour CPU limit?, nice?, other? Will think about this more – policy?. Just think of FermiGrid like PACMAN… (munch, munch, munch…)

13 November 16, 2004FermiGrid Project13 FermiGrid – Software Components Operating System and Tools: Scientific Linux 3.0.3 VDT + Globus toolkit. Cluster tools: –Keep the cluster “sane”. –Migrate services as necessary. Cluster aware file system: –Google file system? –Lustre? –other?. Applications and Tools: VOMS + VOMRS GUMS Condor-G + GRIS + GIIS + …

14 November 16, 2004FermiGrid Project14 FermiGrid – Overall Architecture FermiGrid Common Gateway Services HN CDF: Storage: SRM dcache D0: HN SDSS: Lattice QCD: CMS: GP Farm: SAZ VO Users Storage

15 November 16, 2004FermiGrid Project15 FermiGrid – General Purpose Farm Example “The D0 Wolf stealing food out of the mouth of babies.” Farm Head Node FermiGrid FBS Via Globus / Condor GP Farm Users VO Users

16 November 16, 2004FermiGrid Project16 FermiGrid – D0 Example SamGrid SamGfarm FNSF0 FermiGrid FBS D0 Jobs Globus / Condor “Babies stealing food out of the mouth of the D0 wolf” Via Globus / Condor VO Users

17 November 16, 2004FermiGrid Project17 FermiGrid – Future Grid Farms? FermiGrid Via Globus / Condor VO Users VO Users

18 November 16, 2004FermiGrid Project18 FermiGrid – Gateway Software See: http://computing.fnal.gov/docs/products/voprivilege/index.html

19 November 16, 2004FermiGrid Project19 FermiGrid – Gateway Hardware Architecture FNAL FermiGate1 FermiGate2 FermiGate3 Switch Cyclades FermiGrid

20 November 16, 2004FermiGrid Project20 FermiGrid – Gateway Hardware Roles FermiGate1 Primary for Condor + GRIS + GIIS Backup for FermiGate2 Secondary backup for FermiGate3 FermiGate2 Primary for VOMS + VOMRS Backup for FermiGate3 Secondary backup for FermiGate1 FermiGate3 Primary for GUMS + [PRIMA] (eventually) Backup for FermiGate1 Secondary backup for FermiGate2 All FermiGate systems will have VDT + Globus job manager.

21 November 16, 2004FermiGrid Project21 FermiGrid – Gateway Hardware Specification 3 x Poweredge 6650 Dual processor 3.0 Xeon MP, 4 MB cache Rapid rails for dell rack 4 GB DDR SDRAM, 8x512 PERC3-DC, 128MB 1 int, 1 ext. 2x 36GB 15k RPM drive 2x 73GB 10k RPM drive dual on-board 10/100/1000 nics Redundant power supply Dell Remote Access Card, Version III, without modem 24x IDE CD-Rom Poweredge Basic Setup 3yr same day 4 hr response parts _ onsite labor 24x7 $14,352.09 each Cyclades console + dual PM20 + local switch + Rack Total system cost ~= $50K Expandable in place by addition of processors or disks within systems.

22 November 16, 2004FermiGrid Project22 FermiGrid – Alternate Hardware Specification 3 x Poweredge 2850 (2U server) Dual processor 3.6 Xeon, 1MB cache, 800 MHz FSB Rapid rails for dell rack 4 GB DDR2 400 MHZ 4x1GB Embedded Perc4ei controller 2x 36Gb 15K RPM drive 2x 73Gb 10K RPM drive Dual on-board 10/100/1000 nics Redundant power supply Dell Remote Access Card, 4th generation 24x IDE CD-Rom Poweredge Basic Setup 3yr same day 4 hr response 24x7 parts + onsite labor $6,951.24 each Cyclades console + dual PM20 + local switch + Rack Total system cost ~= $25K Limited CPU expandability – can only add whole systems or perform forklift upgrade.

23 November 16, 2004FermiGrid Project23 FermiGrid – Condor and Condor-G Condor (Condor-G) will be used for batch queue management. Within FermiGrid gateway systems – definitely. May feed into other head node batch systems (eg. FBS) as necessary. VOs that “own” the resource will have priority access to the resource. Policy? - “guest” VOs will only be allowed to utilize idle/unused resources. Policy? – how quickly must a “guest” VO free resource when desired by owner VO? Condor checkpoint would provide this, but D0 and CMS jobs will not function in this environment. Alternatives – 24 hour CPU limit?, nice?, other? More thought required (perhaps helped by policy decisions above?). For Condor information see: http://www.cs.wisc.edu/condor/

24 November 16, 2004FermiGrid Project24 FermiGrid – VO Management Currently VO management is performed via CMS in a “back pocket” fashion. Not a viable solution for the long term. CMS would probably like to direct that effort towards their work. We recommend that FermiGrid infrastructure should take over the VO Management Server/services and migrate onto the appropriate gateway system (FermiGate2). Existing VOs should be migrated to the new VO Management Server (in the FermiGrid gateway) once the FermiGrid gateway is commissioned. Existing VO management roles delegated to appropriate members of the current VOs. New VOs for existing infrastructure clients (eg. FNAL, CDF, D0, CMS, Lattice QCD, SDSS, others) should be created as necessary/authorized.

25 November 16, 2004FermiGrid Project25 FermiGrid – VO Creation and Support All new VOs created on the new VO Management Server by FermiGrid project personnel or Helpdesk. Policy? - VO creation authorization mechanism? VO management authority delegated to the appropriate members of the VO. Policy? - “FNAL” VO membership administered by the Helpdesk? Like accounts in the FNAL Kerberos domain and Fermi Windows 2000 domain. Policy? - Small experiments may apply to CD to have their VO managed by the Helpdesk also? Need to provide the Helpdesk with the necessary tools for VO membership management.

26 November 16, 2004FermiGrid Project26 FermiGrid – GUMS Grid User Management System Developed at BNL Translates a Grid identity to a local identity (certificate -> local user) Think of it as an automated mechanism to maintain the gridmap file. See: http://www.rhic.bnl.gov/hepix/talks/041018pm/carcassi.ppt ATLAS VO STAR VO PHENIX VO … VO GUMS server Grid resource Grid resource Grid resource mapfile cache GUMS DB 1. 2. 3.

27 November 16, 2004FermiGrid Project27 FermiGrid – Project Management Weekly FermiGrid project management meeting: Fridays from 2:00 PM to 3:00 PM in FCC1. We would like to empanel a set of Godparents Representatives from: –CMS –Run II –Grid Developers? –Security Team? –Other? Godparent panel would be used to provide (short term?) guidance and feedback to the FermiGrid project management team. Longer term guidance and policy from CD line management.

28 November 16, 2004FermiGrid Project28 FermiGrid – Time Scale for Implementation “Today”:Decide and order hardware for gateway systems Explore / kick tires on existing software. Jan 2005:Hardware installation. Begin software installation and initial configure. Feb-Mar 2005:Common Grid services available in non-redundant mode (Condor[-G], VOMS, GUMS, etc.). Future:Transition to redundant mode as hardware/software matures.

29 November 16, 2004FermiGrid Project29 FermiGrid – Open Questions Policy Issues? Lots of policy issues – need direction from CD management. Role of FermiGrid? Direct Grid access to Fermilab Grid resources directly without FermiGrid? Grid access to Fermilab Grid resources only via FermiGrid? “guest” VO access to Fermilab Grid resources only via FermiGrid? Resource Allocation? “owner” VO vs. “guest” VO? How fast? Under what circumstances? Grid Users Meeting a-la Farm Users Meeting? Accounting? Who, where, what, when, how? Recording vs. Access.

30 November 16, 2004FermiGrid Project30 FermiGrid – Guest vs. Owner VO Access “guest” VO Users “owner” VO Users Resource Head Node FermiGrid Gateway Not Allowed? Required?Allowed Allowed?

31 November 16, 2004FermiGrid Project31 FermiGrid – Fin Any Questions?


Download ppt "November 16, 2004FermiGrid Project1 FermiGrid – Fermilab Grid Gateway Keith Chadwick Bonnie Alcorn Steve Timm."

Similar presentations


Ads by Google