Download presentation
Presentation is loading. Please wait.
Published byGeoffrey Stokes Modified over 9 years ago
1
DataGrid is a project funded by the European Commission under contract IST-2000-25182 GridPP Edinburgh 4 Feb 2004 WP8 HEP Applications Final Project evaluation of EDG middleware, and summary of workpackage achievements F HARRIS (OXFORD/CERN) Email: f.harris@cern.ch
2
F Harris GridPP Edinburgh 4 Feb 2004 - n° 2 Outline u Overview of objectives and achievements u Achievements by experiments and experiment independent team u Lessons learned u Summary of the exploitation of WP8 work and future HEP applications activity in LCG/EGEE u Concluding comments u Questions and discussion
3
F Harris GridPP Edinburgh 4 Feb 2004 - n° 3 Overview of objectives and achievements OBJECTIVES u Continue work in ATF for internal EDG architecture understanding and development u Reconstitute the Application Working Group(AWG) with the principal aim of developing use cases and defining a common high- level common application layer u Work with LCG/GAG(Grid Applications group) in further refinement of HEP requirements Developments of tutorials and documentation for the user community u Evaluate Testbed2, and integrate into experiment tests as appropriate; u Liase with LCG regarding EDG/LCG integration and the development of LCG1. For example the access of experiments using EDG software to the LCG infrastructure; u Prepare deliverable D 8.4 ready for Dec 2003. ACHIEVEMENTS u Active participation by WP8. Walkthroughs of HEP use cases helped to clarify interfacing problems. u This has worked very well. Extension of use cases covering key areas in Biomedicine and Earth Sciences. Basis of first proposal for common application work in EGEE u This group has produced a new HEPCAL use case document concentrating on ‘chaotic’ use of grid by thousands of individual users. In addition further refined the original HEPCAL document u WP8 has played a substantial role in course design, implementation and delivery
4
F Harris GridPP Edinburgh 4 Feb 2004 - n° 4 OBJECTIVES Evaluate EDG Application Testbed, and integrate into experiment tests as appropriate. Liase with LCG regarding EDG/LCG integration and the development of LCG service. Continue work with experiments on data challenges throughout the year. ACHIEVEMENTS u Further successful evaluation of 1.4.n throughout summer. Evaluation of 2.0 since October 20. u Very active cooperation. EIPs helped testing of EDG components on LCG Cert TB prior to LCG-1 start late September, and did stress tests on LCG-1. u All 6 experiments have conducted data challenges of different scales throughout 2003 on EDG or LCG flavoured beds in an evolutionary manner.
5
F Harris GridPP Edinburgh 4 Feb 2004 - n° 5 Comments on experiment work u Experiments are living in an international multi-grid world using other Grids (US Grid3,Nordugrid..) in addition to EDG and LCG, and have coped with interfacing problems n DataTag project is very important for inter-operability (GLUE schema) u Used EDG software in a number of grids n EDG Application Testbed n LCG Service (LCG-1 evolving to LCG-2) n Italian Grid.it (identical with LCG-1 release) u Having 2 running experiments (in addition to the 4 LHC experiments) involved in the evaluations has proved very positive n BaBar and work on Grid.it n D0 and work on EDG App TB
6
F Harris GridPP Edinburgh 4 Feb 2004 - n° 6 Evolution in the use of EDG App TB and the LCG service (and Grid.it) u Mar-June 2003 EDG 1.4 evolved with production use by Atlas, CMS and LHCb with efficiencies ranging from 40 – 90% - (higher in self- managed configurations) s LHCb 300K events Mar/Apr s Atlas 500K events May s CMS 2M events in LCG-0 configuration in summer u Sep 29 2003 LCG-1 service open, (and Grid.it installed LCG-1 release) s Used EDG 2.0 job and data management and partitioned MDS (+VDT) s All experiments have installed software in LCG-1 and accomplished +ve tests s ATLAS,ALICE,BaBar,CMS performed ‘production’ runs on LCG-1/Grid.it u Oct 20 2003 EDG 2.0 was released on App TB Oct 20 n R-GMA instead of ‘partitioned’ MDS n D0 and CMS evaluated ‘monitoring’ features of R-GMA n D0 did ‘production’ running u December 2003 Move to EDG 2.1 on EDG App TB n Fixing known problems n Enhance Data Management u Feb 2004 LCG-2 service release n Enhanced data management n Will include R-GMA for job and network monitoring n All experiments will use it for data challenges
7
F Harris GridPP Edinburgh 4 Feb 2004 - n° 7 ALICE evaluation on LCG-1 and Grid.it (September-November 2003) u Summary n Significant improvement in terms of stability with respect to tests in Spring 2003 n Simulation job(event) characteristics s 80K primary particles generated s 12 hr 1GHz CPU s Generated 2 GB data in 8 files n Performance was generally a step function for batches (either close to 0 or close to 100). With long jobs and multi files very sensitive to overall system stability (good days and bad days!) n Jobs were sensitive to space on worker nodes u PLOT of PERFORMANCE with JOB BATCHES IN SEP-NOV u Projected load on LCG1 during ALICE DC(start Feb 2004) when LCG-2 will be used n 10 4 events (1 event/job) n Submit 1 job/3’ (20 jobs/h; 480 jobs/day) n Run 240 jobs in parallel n Generate 1 TB output/day n Test LCG MS n Parallel data analysis (AliEN/PROOF) including LCG
8
F Harris GridPP Edinburgh 4 Feb 2004 - n° 8 ATLAS Evolution from 1.4 to LCG-1 and new DC2 production system n Use of EDG 1.4.11 (mod for RH7.3) s In May reconstructed 500 K events in 250 jobs with 85% 1 st pass efficiency s With privately managed configuration of 7 sites in Italy, Lyon and Cambridge n LCG-1(+ Grid.it) production s Have simulated 30000 events in 150 jobs of 200 events each (the jobs required ~20 hours each with efficiency ~80%) n LCG-2 plans s Start around April n Main features of new DC2 system s Common production database for all of ATLAS s Common ATLAS supervisor run by all facilities/managers s Common data management system a la Magda s Executors developed by middleware experts (LCG, NorduGrid, US). s Final verification of data done by supervisor
9
F Harris GridPP Edinburgh 4 Feb 2004 - n° 9 CMS work on LCG-0 and LCG-1 u LCG-0 (summer 2003) n Components from VDT 1.1.6 and EDG 1.4.11 n GLUE schemas and info providers (DataTAG) n VOMS n RLS n Monitoring: GridICE by DataTAG n R-GMA (for specific monitoring functions) u 14 sites configured and managed by CMS u Physics data produced n 500K Pythia 2000 jobs 8 hr n 1.5M CMSIM 6000 jobs 10 hr. u Comments on performance n Had substantial improvements in efficiency compared to first EDG stress test (~80%) n Networking and site configuration were problems, as was 1 st version of RLS u LCG-1 n Ran for 9 days on LCG-1 (20- 23/12/2003 – 7-12/1/2004) n In total 600,000 events (30-40 hr jobs) were produced (ORCA) using UIs in Padova and Bari n Sites used were mainly in Italy and Spain n Efficiency around 75% over Xmas hols (pretty pleased!) n Used GENIUS portal u LCG-2 n Will be used within data challenge starting in March
10
F Harris GridPP Edinburgh 4 Feb 2004 - n° 10 LHCb results and status on EDG and LCG u Tests on the EDG1.4 application testbed (Feb 2003): n Standard LHCb production tasks, 300K events produced; n ~35% success rate. (TB support running down) s Software installation by the running job; u EDG2.0 tests (November 2003): n Simple test jobs as well as standard LHCb production tasks: s 200 events per job, ~10 hours of CPU; n Submission of the jobs: s To EDG RB; s Directly to a CE with the CE status information obtained from the CE GRIS server: n This increased significantly the stability of the system up to 90% success rate. u LCG tests(now): n Software installation/validation either centralized or in a special LHCb job; n The basic functionality tests are OK: s LHCb production jobs with the output stored in a SE and registered in the RLS catalog; s Subsequent analysis jobs with input taken from previously stored files. n Scalability tests with large number of jobs are prepared to be run on the LCG-2 platform.
11
F Harris GridPP Edinburgh 4 Feb 2004 - n° 11 BaBar work on EDG and Grid.it u Integrating grid tools in an experiment running since 1999 n Hope to have access to more resources via grid paradigm (e.g. Grid.it) n Have used grid tools for data distribution (SLAC to IN2P3) u Strategy for first integration n Created ‘simulation’ RPM to be installed at sites n Data output stored on closest SE n Data copied to Tier-1 or SLAC using edg-copy n Logfiles via sandbox u Scheme first tested with EDG 1.4.11 on 5 Italian sites u Operation on Grid.it with LCG-1 release n RB at CNAF n Farms at 8 Italian sites n 1 week test with ~ 500 jobs n 95% success at Ferrara(site with central DB) n 60% success elsewhere s 33% failures due to network saturation due to simultaneous requests to remote Objectivity database (looking at distributed solutions) u Positive experience with use of GENIUS portal n https://genius.ct.infn.it u Analysis applications have been successfully tested on EDG App TB n input through sandbox n executable from Close SE n read Objy or Root data from local server n produce n-tuple and store to Close SE n send back output log file through sandbox
12
F Harris GridPP Edinburgh 4 Feb 2004 - n° 12 D0 evaluations on EDG App TB u Ran part of D0 Run 2 re-processing on EDG resources n Job requirements s Input-file is 700 Mbyte (2800 events) s 1 event is 25 GHz.sec/event (Pentium series assumed) s Job takes 19.4 hours on a 1 GHz machine n Frequent software updates so don’t use RPMs s Registered compressed tar archives in RLS as grid files s Jobs retrieve and install them before starting execution n Use RGMA for monitoring s Allows users and programs to publish information for inspection by other users, and for archiving in production database s Found it invaluable for matching output files to jobs in real production with some jobs failing u Results n Found EDG s/w generally satisfactory for task (with caveats) s Had to modify some EDG RM commands due to their (Globus) dependence on inbound IP connectivity. Should be fixed in 2.1.x s Used ‘Classic’ SE s/w while waiting for mass storage developments s Very sensitive to RGMA instabilities. Recently good progress with RGMA, and can run at ~90% efficiency when RGMA is up
13
F Harris GridPP Edinburgh 4 Feb 2004 - n° 13 Summary of middleware evaluations u Workload management n Tests have shown that software is more robust and scalable s Stress tests were successful with up to 1600 jobs in multiple streams – efficiencies over 90% s Problems with new sites during tests– VOs not set up properly (though site accepted job) u Data Management n Has worked well with respect to functionality and scalability (have registered ~100K files in ongoing tests) s Tests so far with only 1 LRC per VO implemented n Performance needs enhancement s Registrations and simple queries can take up to 10 seconds n We have lost (with GDMP) bulk transfer functions n Some functions needed inbound IP connectivity (Globus). D0 had to program round this (problem since fixed)
14
F Harris GridPP Edinburgh 4 Feb 2004 - n° 14 Summary of middleware evaluations(2) u Information System n Partitioned MDS has worked well for LCG following on from work accomplished within EDG (BD II work)(? Limit 100 sites) n R-GMA work is very promising for ‘life after MDS’, but needs ‘hardening’. s Ongoing work using D0 requirements for testing - How will this continue in LCG/EGEE framework? u Mass Storage support(crucial for data challenges) n We await ‘accepted’ uniform interface to disk and tape systems s Solution coming this week(?) with SRM/GFAL software s WP5 have made important contribution to the development of SRM interface n EDG 2.0 had mass storage access to CERN (Castor) and RAL(ADS) s The ‘Classic-SE’ has been a useful fallback (gridftp server) while waiting for commissioning of developments
15
F Harris GridPP Edinburgh 4 Feb 2004 - n° 15 Other important grid system management issues u Site Certification n Must have official, standard procedure as part of release n Tests are intimately tied in with what is published in information service, so relevant information must be checked n Tests should be run on a regular basis and sites failing should be put out automatically u Site Configuration n The configuration of software is very complex to do manually n Only way to check it has been to run jobs, pray and then look at errors n Could not middleware providers provide fewer parameters and provide defaults? And also check parameters while running? u Space management and publishing n Running out of space on SEs and WNs is still a problem. Jobs need to check availability before running
16
F Harris GridPP Edinburgh 4 Feb 2004 - n° 16 The Deliverables + ‘extra’ outputs from WP8 u The formal EU deliverables n D8.1 The original HEP requirements document n D8.2 ‘Evaluation by experiments after 1 st year’ n D8.3 ‘Evaluation by experiments after 2 nd year’ n D8.4 ‘Evaluation after 3 rd year’ u Extra key documents (being used as input to EGEE) n HEPCAL Use cases May 2002 (revised Oct 2003) n AWG Recommendations for middleware (June 2003) n AWG Enhanced use cases (for Biomed,ESA) Sep 2003 n HEPCAL2 Use cases for analysis (several WP8 people) u Generic HEP test suite used by EDG/LCG u Interfacing of 6 experiment systems to middleware u Ongoing consultancy from ‘loose cannons’ to all applications
17
F Harris GridPP Edinburgh 4 Feb 2004 - n° 17 Main lessons learned u Globally HEP applications feel it would have been ‘better’ to start with simpler prototype software and develop incrementally in functionality and complexity with close working relations with middleware developers u Applications should have played larger role in architecture in defining interfaces(so we could all learn together!). ‘Average’ user interfaces are at too low a level. Defining high level APIs from beginning would have been very useful u Formation of Task Forces(applications+middleware) was a very important step midway in project u Loose Cannons(team of 5) were crucial to all developments. Worked across experiments. This team comprised all the funded effort of WP8. u Information system is nerve centre of grid. We look to R-GMA developments for long term solution to scaling problems (LCG think partitioned MDS could work up to 100 sites) u We look to SRM/GFAL as solution to uniform mass storage interfacing u Must have flexible application s/w installation not coupled to grid s/w installation (experiments working with LCG. See work of LHCb, ALICE, D0..) u We observe that site configuration, site certification and space management on SEs and WNs are still outstanding problems that account for most of the inefficiencies (down times)
18
F Harris GridPP Edinburgh 4 Feb 2004 - n° 18 Exploitation of the work of WP8, and future HEP applications work in LCG/EGEE u All experiments have exploited the EDG middleware using WP8 effort u The HEPCAL and AWG documents are essential inputs to the future LCG/EGEE work u Future developments will be in the context of LCG/EGEE infrastructure n There will be a natural carrying over of experience and personnel within the HEP community u Experiments will use LCG/EGEE for data challenges in 2004 n ALICE now CMS Mar 1 Atlas and LHCb April u In parallel within the ARDA project middleware will be ‘hardened’ (including EDG components) and evaluated in development TB u The NA4 activity in EGEE will include dedicated people for interfacing middleware to experiments (8 people at CERN + others distributed in the community)
19
F Harris GridPP Edinburgh 4 Feb 2004 - n° 19 Concluding comments u Over the past 3 years HEP community has moved from an R/D approach to the use of the grid paradigm, to the use of grid services in physics production systems in world-wide configurations u Experiments are using several managed grids (LCG/EGEE,US Grids, Nordugrid) so inter-operability is crucial u Existing software can be used in production services, with parallel ‘hardening’ taking advantage of lessons learned (the ARDA project in LCG/EGEE) u THANKS n To the members of WP8 for all their efforts over 3 years n And to all the other WPs (middleware,testbed,networking,project office) for their full support and cooperation
20
F Harris GridPP Edinburgh 4 Feb 2004 - n° 20 Questions and discussion
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.