Presentation is loading. Please wait.

Presentation is loading. Please wait.

May 24 2001http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN.

Similar presentations


Presentation on theme: "May 24 2001http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN."— Presentation transcript:

1 May 24 2001http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN

2 May 24 2001http://cern.ch/hep-proj-grid-fabric2 Outline Background Architecture Short term prototypes (September 2001) GRID issues Conclusions

3 May 24 2001http://cern.ch/hep-proj-grid-fabric3 Background 3 years EU funded project lead by Fabrizio Gagliardi, CERN Started 1/1/2001 6 principal contractors: CERN, CNRS, ESA, INFN, FOM, PPARC 15 assistant contractors

4 May 24 2001http://cern.ch/hep-proj-grid-fabric4 Workpackages WP1: Workload Management WP2: Grid Data Management WP3: Grid Monitoring Services WP4: Fabric management WP5: Mass Storage Management WP6: Integration Testbed – Production quality International Infrastructure WP7: Network Services WP8: High-Energy Physics Applications WP9: Earth Observation Science Applications WP10: Biology Science Applications WP11: Information Dissemination and Exploitation WP12: Project Management

5 May 24 2001http://cern.ch/hep-proj-grid-fabric5 WP4: Fabric Management “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.”

6 May 24 2001http://cern.ch/hep-proj-grid-fabric6 WP4: Fabric Management ~14 FTEs (6 funded by the EU) for 3 years split over 6 partners: CERN, FOM/NIKHEF, ZIB, Heidelberg Univ. PPARC, INFN The work divided into 6 subtasks –Configuration management –Automatic software installation & maintenance –Monitoring –Fault tolerance –Resource management –“Gridification”

7 May 24 2001http://cern.ch/hep-proj-grid-fabric7 Dependencies Grid Scheduler Grid Monitoring & Information Service Gridification Configuration Management Resource Management Fault Tolerance Installation Management Monitoring Cluster Fabric GRID

8 May 24 2001http://cern.ch/hep-proj-grid-fabric8 Configuration management GUI CLI CDB Compilation (one-way) Client machine MLD Translation HLDLLD Cached LLD Manipulations (read/write) Fetching only HLD = High Level Description LLD = Low Level Description MLD = Machine Level Description

9 May 24 2001http://cern.ch/hep-proj-grid-fabric9 Installation management Software Maintainers Configuration Management SRS Local Node BSS NMS Resource Management Monitoring Fault Tolerance SRS = Software Repository NMS = Node Management BSS = Bootstrap Service

10 May 24 2001http://cern.ch/hep-proj-grid-fabric10 Scheduling of Actions Node autonomy approach (chaotic) –High level configuration change propagated to all affected nodes –Monitoring senses a change of configuration –Fault tolerance fires an actuator to bring the node to its configured state (could be “re-install”) What happens to running jobs? Who tells scheduler that node is in maintenance? How are dependent actions handled (e.g. server intervention)?

11 May 24 2001http://cern.ch/hep-proj-grid-fabric11 Scheduling of Actions Decompose complex actions into simple “atomic” actions that can be serialized centrally –Each configuration change would generate a simple action on the affected nodes –Scripts to bundle the actions together and executes them in a sensible order Use APIs to the different sub-components

12 May 24 2001http://cern.ch/hep-proj-grid-fabric12 Change glibc on service A 1.Get list of ndoes L belonging to service A 2.For all nodes (L1…Ln) –Disable Li in scheduler queue A 3.Wait for completion of 2 4.For all nodes (L1…Ln) –Submit admin job to node Li 5.Wait for completion of 4 6.For all nodes (L1…Ln) –Re-enable node Li in scheduler queue A

13 May 24 2001http://cern.ch/hep-proj-grid-fabric13 For September 2001 First prototype of the configuration management system –Low level (node) query interface –Caching “Interim” installation system –LCFG for upgrades and maintenance –SystemImager for initial system install and VACM console control for system preparation

14 May 24 2001http://cern.ch/hep-proj-grid-fabric14 GRID issues “Gridification” Protect the fabric against GRID jobs –Local farms will still be used by local users –Firewalls (channeling of job I/O, interactive jobs, MPI over WAN, …) –Local authorization of grid users –Job information

15 May 24 2001http://cern.ch/hep-proj-grid-fabric15 Conclusions DataGrid WP4 is not so much about the G-word. It is really about automating cluster management In the process of defining the global architecture. How do we best put the bits and pieces together? Ambitious delivery plans already for September


Download ppt "May 24 2001http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN."

Similar presentations


Ads by Google