European DataGrid Project status and plans Peter Kunszt, CERN DataGrid, WP2 Manager
ACAT, Moscow – 26 June n° 2 Outline EU DataGrid Project EDG overview Project Organisation Objectives Current Status overall and by WP Plans for next releases and testbed 2 Conclusions
ACAT, Moscow – 26 June n° 3 The Grid vision Flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions, and resource From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals -- assuming the absence of… central location, central control, omniscience, existing trust relationships.
ACAT, Moscow – 26 June n° 4 Grids: Elements of the Problem Resource sharing Computers, storage, sensors, networks, … Sharing always conditional: issues of trust, policy, negotiation, payment, … Coordinated problem solving Beyond client-server: distributed data analysis, computation, collaboration, … Dynamic, multi-institutional virtual orgs Community overlays on classic org structures Large or small, static or dynamic
ACAT, Moscow – 26 June n° 5 EU DataGrid Project Objectives DataGrid is a project funded by European Union whose objective is to exploit and build the next generation computing infrastructure providing intensive computation and analysis of shared large-scale databases. Enable data intensive sciences by providing world wide Grid test beds to large distributed scientific organisations ( “Virtual Organisations, VO”) Start ( Kick off ) : Jan 1, 2001 End : Dec 31, 2003 Applications/End Users Communities : HEP, Earth Observation, Biology Specific Project Objetives: Middleware for fabric & grid management Large scale testbed Production quality demonstrations To collaborate with and complement other European and US projects Contribute to Open Standards and international bodies ( GGF, Industry&Research forum)
ACAT, Moscow – 26 June n° 6 DataGrid Main Partners CERN – International (Switzerland/France) CNRS - France ESA/ESRIN – International (Italy) INFN - Italy NIKHEF – The Netherlands PPARC - UK
ACAT, Moscow – 26 June n° 7 Research and Academic Institutes CESNET (Czech Republic) Commissariat à l'énergie atomique (CEA) – France Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI) Consiglio Nazionale delle Ricerche (Italy) Helsinki Institute of Physics – Finland Institut de Fisica d'Altes Energies (IFAE) - Spain Istituto Trentino di Cultura (IRST) – Italy Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany Royal Netherlands Meteorological Institute (KNMI) Ruprecht-Karls-Universität Heidelberg - Germany Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands Swedish Research Council - Sweden Assistant Partners Industrial Partners Datamat (Italy) IBM-UK (UK) CS-SI (France)
ACAT, Moscow – 26 June n° 8 Project Schedule Project started on 1/Jan/2001 TestBed 0 (early 2001) International test bed 0 infrastructure deployed Globus 1 only - no EDG middleware TestBed 1 ( now ) First release of EU DataGrid software to defined users within the project: HEP experiments (WP 8), Earth Observation (WP 9), Biomedical applications (WP 10) Successful Project Review by EU: March 1 st 2002 TestBed 2 (October 2002) Builds on TestBed 1 to extend facilities of DataGrid TestBed 3 (March 2003) & 4 (September 2003) Project stops on 31/Dec/2003
ACAT, Moscow – 26 June n° 9 EDG Highlights The project is up and running! All 21 partners are now contributing at contractual level total of ~60 man years for first year All EU deliverables (40, >2000 pages) submitted in time for the review according to the contract technical annex First test bed delivered with real production demos All deliverables (code & documents) available via requirements, surveys, architecture, design, procedures, testbed analysis etc.
ACAT, Moscow – 26 June n° 10 Working Areas Applications Middleware Infrastructure Management Testbed The DataGrid project is divided in 12 Work Packages distributed in four Working Areas
ACAT, Moscow – 26 June n° 11 Work Packages WP1: Work Load Management System WP2: Data Management WP3: Grid Monitoring / Grid Information Systems WP4: Fabric Management WP5: Storage Element WP6: Testbed and demonstrators WP7: Network Monitoring WP8: High Energy Physics Applications WP9: Earth Observation WP10: Biology WP11: Dissemination WP12: Management
ACAT, Moscow – 26 June n° 12 Objectives for the first year of the project Collect requirements for middleware Take into account requirements from application groups Survey current technology For all middleware Core Services testbed Testbed 0: Globus (no EDG middleware) First Grid testbed release Testbed 1: first release of EDG middleware WP1: workload Job resource specification & scheduling WP2: data management Data access, migration & replication WP3: grid monitoring services Monitoring infrastructure, directories & presentation tools WP4: fabric management Framework for fabric configuration management & automatic sw installation WP5: mass storage management Common interface for Mass Storage Sys. WP7: network services Network services and monitoring
ACAT, Moscow – 26 June n° 13 DataGrid Architecture Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authenticatio n and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Fabric Local Computing Grid Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index
ACAT, Moscow – 26 June n° 14 EDG Interfaces Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index Computing Elements SystemManagers Scientist s OperatingSystems File Systems StorageElements Mass Storage Systems HPSS, Castor User Accounts Certificate Authorities ApplicationDevelopers Batch Systems PBS, LSF
ACAT, Moscow – 26 June n° 15 WP1: Work Load Management Goals Maximise use of resources by efficient scheduling of user jobs Achievements Analysis of work-load management system requirements & survey of existing mature implementations Globus & Condor (D1.1) Definition of architecture for scheduling & res. mgmt. (D1.2) Development of "super scheduling" component using application data and computing elements requirements Issues Integration with software from other WPs Advanced job submission facilities Components Job Description Language Resource Broker Job Submission Service Information Index User Interface Logging & Bookkeeping Service Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index
ACAT, Moscow – 26 June n° 16 WP2: Data Management Goals Coherently manage and share petabyte-scale information volumes in high-throughput production- quality grid environments Achievements Survey of existing tools and technologies for data access and mass storage systems (D2.1) Definition of architecture for data management (D2.2) Deployment of Grid Data Mirroring Package (GDMP) in testbed 1 Close collaboration with Globus, PPDG/GriPhyN & Condor Working with GGF on standards Issues Security: clear methods handling authentication and authorization Data replication - how to maintain consistent up to date catalogues of application data and its replicas Components GDMP Replica Catalog SpitFire Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index
ACAT, Moscow – 26 June n° 17 WP3: Grid Monitoring Services Goals Provide information system for discovering resources and monitoring status Achievements Survey of current technologies (D3.1) Coordination of schemas in testbed 1 Development of Ftree caching backend based on OpenLDAP (Light Weight Directory Access Protocol) to address shortcoming in MDS v1 Design of Relational Grid Monitoring Architecture (R-GMA) (D3.2) – to be further developed with GGF GRM and PROVE adapted to grid environments to support end-user application monitoring Components MDS/Ftree R-GMA GRM/PROVE Collective Services Informat ion & Monitori ng Replica Manager Grid Schedul er Local Application Local Database Underlying Grid Services Computi ng Element Services Authorizat ion Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configurati on Manageme nt Configurati on Manageme nt Node Installation & Manageme nt Node Installation & Manageme nt Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Manageme nt Fabric Storage Manageme nt Fabric Storage Manageme nt Grid Application Layer Data Managem ent Job Managem ent Metadata Managem ent Object to File Mapping Service Index
ACAT, Moscow – 26 June n° 18 WP4: Fabric Management Goals manage clusters (~thousands) of nodes Achievements Survey of existing tools, techniques and protocols (D4.1) Defined an agreed architecture for fabric management (D4.2) Initial implementations deployed at several sites in testbed 1 Issues How to install reference platform and EDG software on large numbers of hosts with minimal human intervention per node How to ensure the node configurations are consistent and handle updates to the software suites Components LCFG PBS & LSF info providers Image installation Config. Cache Mgr Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index
ACAT, Moscow – 26 June n° 19 WP5: Mass Storage Management Goals Provide common user and data export/import interfaces to existing local mass storage systems Achievements Review of Grid data systems, tape and disk storage systems and local file systems (D5.1) Definition of Architecture and Design for DataGrid Storage Element (D5.2) Collaboration with Globus on GridFTP/RFIO Collaboration with PPDG on control API First attempt at exchanging Hierarchical Storage Manager (HSM) tapes Issues Scope and requirements for storage element Inter-working with other Grids Components Storage Element info. providers RFIO MSS staging Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index
ACAT, Moscow – 26 June n° 20 WP7: Network Services Goals Review the network service requirements for DataGrid Establish and manage the DataGrid network facilities Monitor the traffic and performance of the network Deal with the distributed security aspects Achievements Analysis of network requirements for testbed 1 & study of available network physical infrastructure (D7.1) Use of European backbone GEANT since Dec Initial network monitoring architecture defined (D7.2) and first tools deployed in testbed 1 Collaboration with Dante & DataTAG Working with GGF (Grid High Performance Networks) & Globus (monitoring/MDS) Issues Resources for study of security issues End-to-end performance for applications depend on a complex combination of components Components network monitoring tools: PingER Udpmon Iperf Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index
ACAT, Moscow – 26 June n° 21 WP6: TestBed Integration Goals Deploy testbeds for the end-to-end application experiments & demos Integrate successive releases of the software components Achievements Integration of EDG sw release 1.0 and deployment Working implementation of multiple Virtual Organisations (VOs) s & basic security infrastructure Definition of acceptable usage contracts and creation of Certification Authorities group Issues Procedures for software integration Test plan for software release Support for production-style usage of the testbed Components Globus packaging & EDG config Build tools End-user documents Collective Services Informat ion & Monitori ng Replica Manager Grid Schedul er Local Application Local Database Underlying Grid Services Computi ng Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configurati on Manageme nt Configurati on Manageme nt Node Installation & Manageme nt Node Installation & Manageme nt Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Manageme nt Fabric Storage Manageme nt Fabric Storage Manageme nt Grid Application Layer Data Managem ent Job Managem ent Metadata Managem ent Object to File Mapping Service Index WP6 additions to Globus GlobusEDG release
ACAT, Moscow – 26 June n° 22 Software Release Procedure Coordination meeting Gather feedback on previous release Review plan for next release WP meeting Take basic plan and clarify effort/people/dependencies Sw development Performed by WPs in dispersed institutes and run unit tests Software integration Performed by WP6 on frozen sw Integration tests run Acceptance tests Performed by Loose Cannons et al. Roll-out Present sw to application groups Deploy on testbed Coord. meeting Release Plan++ Release feedback Release Plan WP meetings WP1 WP3 WP7 Component 1 Component n Globus EDG release Distributed EDG release Software release Plan Roll-out. meeting testbed 1: Dec ~100 participants
ACAT, Moscow – 26 June n° 23 Grid aspects covered by EDG testbed 1 VO servers LDAP directory for mapping users (with certificates) to correct VO Storage Element Grid-aware storage area, situated close to a CE User Interface Submit & monitor jobs, retrieve output Replica Manager Replicates data to one or more CEs Job Submission Service Manages submission of jobs to Res. Broker Replica Catalog Keeps track of multiple data files “replicated” on different CEs Information index Provides info about grid resources via GIIS/GRIS hierarchy Information & Monitoring Provides info on resource utilization & performance Resource Broker Uses Info Index to discover & select resources based on job requirements Grid Fabric Mgmt Configure, installs & maintains grid sw packages and environ. Logging and Bookkeeping Collects resource usage & job status Network performance, security and monitoring Provides efficient network transport, security & bandwidth monitoring Computing Element Gatekeeper to a grid computing resource Testbed admin. Certificate auth.,user reg., usage policy etc.
ACAT, Moscow – 26 June n° 24 TestBed 1 Sites Status Web interface showing status of servers at testbed 1 sites
ACAT, Moscow – 26 June n° 25 DataGrid Testbed Dubna Moscow RAL Lund Lisboa Santander Madrid Valencia Barcelona Paris Berlin Lyon Grenoble Marseille Brno Prague Torino Milano BO-CNAF PD-LNL Pisa Roma Catania ESRIN CERN HEP sites ESA sites IPSL Estec KNMI (>40) - Testbed Sites
ACAT, Moscow – 26 June n° 26 Physicists from LHC experiments submit jobs with their application software that uses: User interface (job submission language etc.) Resource Broker & Job submission service Information Service & Monitoring Data Replication Initial testbed usage Add lfn/pfn to Rep Catalog Generate Raw events on local disk Raw/dst ? Job arguments Data Type : raw/dst Run Number :xxxxxx Number of evts :yyyyyy Number of wds/evt:zzzzzz Rep Catalog flag : 0/1 Mass Storage flag : 0/1 Write logbook On client node raw_xxxxxx_dat.log dst_xxxxxx_dat.log Read raw events Write dst events Get pfn from Rep Catalog Add lfn/pfn to Rep Catalog MS Move to SE, MS ? Write logbook On client node pfn local ? n y raw_xxxxxx_dat.log Copy raw data From SE to Local disk Generic HEP application flowchart SE Move to SE, MS? SE JDL]$ dg-job-submit gridpawCNAF.jdl Connecting to host testbed011.cern.ch, port 7771 Transferring InputSandbox files...done Logging to host testbed011.cern.ch, port =========dg-job-submit Success ============ The job has been successfully submitted to the Resource Broker. Use dg-job-status command to check job current status. Your job identifier (dg_jobId) is: ======================================== JDL]$ dg-job-get-output Retrieving OutputSandbox files...done ============ dg-get-job-output Success ============ Output sandbox files for the job: - have been successfully retrieved and stored in the directory: /sandbox/ First simulated ALICE event generated by using the DataGrid Job Submission Service
ACAT, Moscow – 26 June n° 27 Biomedical applications Data mining on genomic databases (exponential growth) Indexing of medical databases (Tb/hospital/year) Collaborative framework for large scale experiments (e.g. epidemiological studies) Parallel processing for Databases analysis Complex 3D modelling
ACAT, Moscow – 26 June n° 28 Earth Observations ESA missions: about 100 Gbytes of data per day (ERS 1/2) 500 Gbytes, for the next ENVISAT mission (launched March 1st) EO requirements for the Grid: enhance the ability to access high level products allow reprocessing of large historical archives improve Earth science complex applications (data fusion, data mining, modelling …)
ACAT, Moscow – 26 June n° 29 Development & Production testbeds Development Initial set of 5 sites will keep small cluster of PCs for development purposes to test new versions of the software, configurations etc. Production More stable environment for use by application groups more sites more nodes per site (grow to meaningful size at major centres) more users per VO Usage already foreseen in Data Challenge schedules for LHC experiments harmonize release schedules
ACAT, Moscow – 26 June n° 30 Planned intermediate release schedule TestBed 1:November 2001 Release 1.1: January 2002 Release 1.2: July 2002 Release 1.3: internal release only Release 1.4: August 2002 TestBed 2: October 2002 Similar schedule will be made for 2003 Each release includes feedback from use of previous release by application groups planned improvements/extension by middle- ware WPs more use of WP6 software infrastructure feeds into architecture group Plans for 2002 Extension of testbed more users, sites & nodes-per-site split testbed into development and production sites investigate inter-operability with US grids Iterative releases up to testbed 2 incrementally extend functionality provided via each Work Package better integrate the components improve stability Testbed 2 (autumn 2002) extra requirements interactive jobs job partitioning for parallel execution advance reservation accounting & Query optimization security design (D7.6)...
ACAT, Moscow – 26 June n° 31 Release Plan details Current release EDG Deployed on testbed under RedHat 6.2 Finalising build of EDG 1.2 GDMP 3.0 GSI-enabled RFIO client and server EDG 1.3 (internal) Build using autobuild tools – to ease future porting Support for MPI on single site EDG 1.4 (august) Support RH 6.2 & 7.2 Basic support for interactive jobs Integration of Condor DAGman Use MDS 2.2 with first GLUE schema EDG 2.0 (Oct) Still based on Globus 2.x (pre-OGSA) Use updated GLUE schema Job partitioning & check-pointing Advanced reservation/co-allocation See for further detailshttp://edms.cern.ch/document/333297
ACAT, Moscow – 26 June n° 32 Issues Support for production testbed Effort for testing Software Release Procedure: Integrated testing CA explosion, CAS introduction and policy support Packaging & distribution S/W licensing Convergence on Architecture Impact of OGSA
ACAT, Moscow – 26 June n° 33 Issues - Actions Support for production testbed – support team and dedicated site Effort for testing – test team Software Release Procedure: Integrated testing – expand procedure CA explosion, CAS introduction and policy support – security group’s security design Packaging & distribution – ongoing S/W licensing – has been addressed, see Convergence on Architecture – architecture group Impact of OGSA – design of OGSA services in WP2, WP3
ACAT, Moscow – 26 June n° 34 Future Plans Expand and consolidate testbed operations Improve the distribution, maintenance and support process Understand, refine Grid operations Evolve architecture and software on the basis of TestBed usage and feedback from users GLUE Converging to common documents with PPDG/GriPhyN OGSA interfaces and components Prepare for second test bed in autumn 2002 in close collaboration with LCG Enhance synergy with US via DataTAG-iVDGL and InterGrid Promote early standards adoption with participation to GGF and other international bodies Explore possible Integrated Project within FP6
ACAT, Moscow – 26 June n° 35 Learn more on EU-DataGrid For more information, see the EDG website EDG Tutorials at ACAT: Tuesday Wednesday EDG Tutorials at GGF5 in Edinburgh – see Cern School of Computing Vico Equense, Italy, September 2002 Programme includes Grid Lectures by Ian Foster and Carl Kesselman and a hands-on tutorial on DataGrid,