Download presentation
Presentation is loading. Please wait.
Published byAdam Johnson Modified over 9 years ago
1
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 1 HEP Applications Evaluation of the EDG Testbed and Middleware Stephen Burke (EDG HEP Applications – WP8) s.burke@rl.ac.uk
2
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 2 Introduction u Updated from the CHEP talk ~ 1 year ago n Some things have changed, some not! u Based on D8.4 report (EDG only here, 2.0/2.1 releases) u Achievements of WP8 u Updated use case analysis mapping HEPCAL to EDG u Lessons learnt
3
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 3 OBJECTIVESACHIEVEMENTS Evaluate EDG Application Testbed, and integrate into experiment tests as appropriate. u Further successful evaluation of 1.4.n throughout the summer. u Evaluation of EDG 2.0 on the EDG Application Testbed since October, and of EDG 2.1 since December Liaise with LCG regarding EDG/LCG integration and the development of the LCG service. u EIPs (Loose Cannons) helped testing of EDG components on the LCG Cert TB prior to LCG-1 start in September. u Performed stress tests on LCG-1. Continue work with experiments on data challenges throughout the year. u All 6 experiments have conducted data challenges of different scales throughout 2003 on EDG App TB or LCG/Grid.it.
4
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 4 OBJECTIVES ACHIEVEMENTS Continued work in Architectural Task Force (ATF) u Walkthroughs of HEP use cases helped to clarify interfacing problems. Reactivation of the Application Working Group (AWG) u Extension of HEPCAL use cases covering key areas in Biomedicine and Earth Sciences. u Basis of first proposal for common application work in EGEE Work with LCG/GAG (Grid Applications group) in further refinement of HEP requirements u HEPCAL-2 requirements document for the use of grid by thousands of individual users. u In addition further refined the original HEPCAL document Developments of tutorials and documentation for the user community u WP8 has played a substantial role in course design, implementation and delivery
5
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 5 Use Case Analysis u EDG release 2.0 has been evaluated against the HEPCAL Use Cases u Of the 43 Use Cases: n 13 (was 10) are fully implemented n 4 (was 8) are largely satisfied, but with some restrictions or complications n 11 (was 8) are partially implemented, but have significant missing features n 15 (was 17) are not implemented u Missing functionality is mainly in: n Virtual data (not considered by EDG) n Metadata catalogues and file collections (still needs more work) n Authorisation, job control and optimisation (partly delivered but not integrated)
6
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 6 Lessons Learnt - General u Having real users on an operating testbed on a fairly large scale is vital – many problems emerged which had not been seen in local testing. u Problems with configuration are at least as important as bugs - integrating the middleware into a working system takes as long as writing it! u Grids need different ways of thinking by users and system managers. A job must run anywhere it lands. Sites are not uniform so jobs should make as few demands as possible.
7
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 7 Job Submission u Limitations seen in 1.4 are largely gone n Efficiency over 90% in stress tests (1600 jobs) n Failures are ~ 1% in normal use (after resubmission) n Most failures now at globus/site level, not broker u Can still be sensitive to poor or incorrect information from Information Providers n Info providers have improved, configuration generally better n No “black hole” sites lately (but still possible) u Still hard to diagnose errors (“invalid script response”???) u Advanced features (checkpointing, DAGMAN, interactivity, accounting, …) largely untested, some not integrated
8
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 8 Information Systems u R-GMA is a big improvement on MDS n Tables, SQL queries, much easier to publish, … n Largely a personal view, experiments have mostly not used it yet u Took a very long time to become stable – during the D8.4 evaluation R- GMA availability was O(75%) u Latest version installed for the EU review looks much better – total end-to- end efficiency now > 95%, R-GMA is ~100% (but testbed is now lightly loaded) u NO SECURITY! n And no Registry/schema replication u Need to check published information for accuracy (or at least sanity!) u GLUE schema is not in EDG/LCG control, and has proved very hard to change
9
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 9 Replica Management u Now mostly “just works” n Command line tools are fairly intuitive n Sometimes processes can hang n Orphan processes sometimes left behind when job ends n Some inconsistencies found when used with POOL u Interaction with SE schema is still unclear n Works, but gives artificial restrictions on NFS access u Bulk operations, mirroring and client-server architecture lost with GDMP u Java command-line tools are very slow (tens of seconds) u Fault tolerance is important: error conditions should leave things in a consistent state, failures should be re-tried where possible
10
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 10 Replica Catalogues u Oracle/MySQL catalogues are much better than LDAP in 1.4 u Tested up to O(100k) entries, no degradation seen n But need to cope with millions n At 10 seconds per file it would take ~ 4 months to register a million files! u Queries can be very slow due to inefficient transport of data n 30 minutes to return 45k entries n Java runs out of memory on bigger queries u Distributed LRC + RLI not deployed u NO SECURITY! (Integrated but not deployed) u Still no consistency checking against SE content
11
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 11 Mass Storage u Always the most problematic area, and still not solved u LCG2 still using “classic SE”, but only a stop-gap u SRM should be the solution (?), WP5 SE is the EDG version u Works, but many rough edges, really still a prototype n No disk space management n Error reporting is poor, not fault-tolerant n Too much logging, not helpful for a system manager n Configuration is complex and fragile n … u Also dCache, CASTOR SRM, Enstore SRM … n But still not production-quality? u What is the way forward?
12
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 12 VO Management u Current LDAP-based system works fairly well, but has many limitations n VO servers are a single point of failure u VOMS looks good, but not yet deployed or fully integrated n Or documented! u Middleware groups seem to have a different security model to VOMS designers n E.g. they usually assume one and only one VO n VO defines service (Replica Catalogue, SE namespace) and not authorisation u Experiments will need to gain experience about how a VO should be run
13
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 13 User View of the Testbed u Site configuration is very complex, there is usually one way to get it right and many ways to be wrong n LCFG is a big help in ensuring uniform configuration n Middleware should be self-configuring (and self-checking) as far as possible u Need well-defined certification procedures, checked on an ongoing basis (sites decay with a half-life of ~ a few weeks) u Services should fail gracefully when they hit resource limits n The grid must be robust against failures and misconfiguration. Large grids will ~ always be broken, so errors are not exceptional! u Many HEP experiments require outbound IP connectivity from worker nodes n Still no solution, discussion is needed u Scalability? Still only ~ 20 sites – 1 job/minute!
14
UCL workshop – 4-5 March 2004 – HEP Assessment of EDG – n° 14 Gaps u Disk space management on worker nodes n Some discussion, nothing appeared u Analysis of scheduling algorithms n EstimatedResponseTime is not optimal u Pre-replication by the broker u Information about networking at the LAN level n Where are the network bottlenecks? u Distribution of experiment software (now being tackled in LCG) u Enforcement of quotas (whose job is this?) u Documentation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.