Download presentation
Presentation is loading. Please wait.
1
C. Loomis – Status EDG – Dec. 12, 2002 – 1 Status of the European DataGrid Project Charles Loomis (LAL/CNRS) LAL December 12, 2002 Outline Introduction & Goals EDG Architecture EDG Deployment & Use External Software Typical Failure Modes Future Developments
2
C. Loomis – Status EDG – Dec. 12, 2002 – 2 European DataGrid (EDG) European DataGrid EU-funded, 3-year project (2001-3) Goals: —develop grid middleware —deploy onto working testbed —demonstrate grid technology with working applications Strong application component unique! EDG Organization WP1Workload Mgt. WP2Data Mgt. WP3Info. & Monitoring Sys. WP4Fabric Mgt. WP5Storage Mgt. WP6Testbed WP7Networking WP8HEP Apps. WP9Biomedical Apps. WP10Earth Ob. Apps. WP11Dissemination WP12Project Mgt. 6 Partners; 21 Associates
3
C. Loomis – Status EDG – Dec. 12, 2002 – 3 EDG Goals Actors End Users Virtual Organization Site Administrators Transparent Access Allow users transparent access to authorized resources with single authentication. Allow users to delegate authorization to services. High-level selection of resources, including datasets. Virtual Organizations Allow groups of people to acquire resources from sites. Allow organization to manage resource use among members. Optimization Allow optimal use of resources at site and grid levels.
4
C. Loomis – Status EDG – Dec. 12, 2002 – 4 EDG Architecture Global Batch System: Centralized Architecture. Heavy infrastructure. Computing Element Storage Element Site X Information Systems submit query retrieve broker chooses optimal site for job Resource Broker User Interface publish state MDS Replica Catalogs
5
C. Loomis – Status EDG – Dec. 12, 2002 – 5 Comments Optimization of Resources Centralized Architecture Resource Broker —must know state of grid and schedule effectively —requires knowledge of site policies and user/job details Information System (MDS & RC) —must respond quickly to high-volume and high-rate queries Central Points-of-Failure Resource Broker (redundancy at VO-level) MDS (unique hierarchy; some redundancy possible) With high-rate submissions: RB requires lots of memory, CPU, disk space. MDS requires lots of file descriptors, CPU.
6
C. Loomis – Status EDG – Dec. 12, 2002 – 6 Authentication & Authorization Computing Element Storage Element Site X Update CRL retrieve membership lists proxy sent for authentication User Certification Authorities Virtual Organizations register request certificate accept/reject request ~10 Different VOs ATLAS, CMS, … ~15 National CAs France, INFN, … /C=FR/O=CNRS/OU=LAL/CN=Charles Loomis/Email=loomis@lal.in2p3.fr
7
C. Loomis – Status EDG – Dec. 12, 2002 – 7 Comments Infrastructure ~15 National CAs as production service 10 Virtual Organizations —High-Energy Physics: ALICE, BaBar, ATLAS, CMS, DZero, LHCb —Earth Observation —Biomedical Applications —Misc.: WP6, ITeam, Guidelines Limited Central Points-of-Failure VO Membership Server (for VO members) Certification Authority (for CA members) Caching, infrequent updates minimize problems; compromise security.
8
C. Loomis – Status EDG – Dec. 12, 2002 – 8 Deployment & Use Development Testbed (1.4.0) To facilitate testing and integration of new middleware. 3 sites (3 countries) SiteLocationCPUs CC-IN2P3Lyon (F)4005 CERNGeneva (CH)1646 CNAFBologna (I)40 LegnaroLegnaro (I)50 NIKHEFAmsterdam (NL)22 PadovaPadova (I)12 RALRutherford (GB)162 Production Testbed (1.4.0) For applications to use & stress software in “semi-production” environment. 8 sites (5 countries) Application Use CMS Event Simulation ATLAS Event Simulation Regular Tutorial Use Stability Filled Grid this week!
9
C. Loomis – Status EDG – Dec. 12, 2002 – 9 Globus Experience GSI Security (OK) Some limitations with size of proxies. GridFTP (OK) Recent protocol change because of security fix. Replica Catalog (OK, limited) Unannounced, unnecessary schema change. GateKeeper/JobManager (Poor) Race conditions under load leading to failures. High resource use; poor response to errors. Information System-MDS (Poor) Serious problems with stability. Query times increase dramatically under load.
10
C. Loomis – Status EDG – Dec. 12, 2002 – 10 Globus Experience (cont.) Interaction Generally responsive to identified problems. Little advance warning of major changes. —Schema changes. —Rewrite of JobManager/Batch System interface. Testing Essentially non-existent by Globus. Major delays in EDG because of MDS and Gatekeeper. Finding/testing/fixing of major problems done outside Globus. Globus “high-level” services inappropriate for production environment.
11
C. Loomis – Status EDG – Dec. 12, 2002 – 11 Condor Experience CondorG Used for reliable job submission from Resource Broker. Responsive to problems and provide quick fixes. Encountered few problems in our testing. Condor Supported “batch” system for EDG. Largely untested, but expect to use with next major release.
12
C. Loomis – Status EDG – Dec. 12, 2002 – 12 Typical Failure Modes Operations: CRL generation (CA); CRL update (sites) Network accessibility (VO LDAP servers) Misconfiguration of services (typically SE) Poor implementation (BUGS) Most catastrophic ones eliminated. Resource Exhaustion File descriptors, ports, disk space. Design Limitations Central points-of-failure (RB, MDS).
13
C. Loomis – Status EDG – Dec. 12, 2002 – 13 Future Developments EDG Plans: Advanced data management —Real “Storage Element”. —Replica Location Service (distributed Replica Catalog) —Replica Manager (higher-level user interface) Job Management —job splitting, checkpointing —interactive jobs Replace MDS with R-GMA. More robust, consistent security model. Local resources better tied to grid credentials. OGSA (Open Grid Services Architecture) New services written as web services. Probably no complete conversion with EDG lifetime.
14
C. Loomis – Status EDG – Dec. 12, 2002 – 14 SlashGrid Grid File System: Uses grid credentials for access to local files. Frees grid user from local unix account. —Simplifies mapping of users to accounts. —Allows true account recycling. More Uses: Could hide remote access to data. Provide compatibility to Globus security model. … Implementation: User-space daemon on top of CODA kernel module. Plug-in interface allows easy extension.
15
C. Loomis – Status EDG – Dec. 12, 2002 – 15 Authentication & Authorization (VOMS) Computing Element Storage Element Site X Update CRL proxy sent for authentication and authorization User Certification Authorities VOMS request “ticket” request certificate accept/reject request ~15 National CAs France, INFN, … Local Authorization Decision!
16
C. Loomis – Status EDG – Dec. 12, 2002 – 16 Conclusions Software & Testbed: Production-quality security infrastructure in place. Production and development testbeds: —Deployed. —Starting to see heavy use by end-users. —Reasonable stability for the first time. Failure modes: —Moving from bugs and operations problems to design and resource limitations. Unanswered Questions: Can optimization be achieved? At what level? How can resources be limited, reserved, and shared? Can efficient scheduling be done with inhomogeneous site policies?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.