The Grid approach for the HEP computing problem Massimo Sgaravatto INFN Padova
HEP computing characteristics Large numbers of independent events to process trivial parallelism Large data sets, mostly read-only Modest floating point requirement SPECint performance Batch processing for production & selection - interactive for analysis Commodity components are just fine for HEP Masses of experience with inexpensive farms Long experience with mass storage systems Very large aggregate requirements – computation, data
The LHC Challenge Jump in orders of magnitude wrt. previous experiments Geographical dispersion of people and of resources Also a political issue Scale Petabytes per year of data Thousands of processors Thousands of disks Terabits/second of I/O bandwidth … Complexity Lifetime (20 years) …
CMS:1800 physicists 150 institutes 32 countries World Wide Collaboration distributed computing & storage capacity
The scale …
~10K SI processors Non- LHC technology-price curve (40% annual price improvement) LHC Capacity that can purchased for the value of the equipment present in 2000
Non- LHC technology-price curve (40% annual price improvement) LHC
Solution? Regional Computing Centres Serve better the needs of the world-wide distributed community Data available nearby Reduce dependence on links to CERN Exploit established computing expertise & infrastructure in national labs, universities Address political issues as well See
Regional Centres - a Multi-Tier Model Department Desktop CERN – Tier 0 Tier 1 INFN FNAL IN2P3 622 Mbps 2.5 Gbps 622 Mbps 155 mbps Tier2 Lab a Uni b Lab c Uni n
Open issues Various technical issues to address Resource Discovery Resource Management Distributed scheduling, optimal co-allocation of CPU, data and network resources, uniform interface to different local resource managers, … Data Management Petabyte-scale information volumes, high speed data moving and replica, replica synchronization, data caching, uniform interface to mass storage management systems, … Automated system management techniques of large computing fabrics Monitoring Services Security Authentication, Authorization … Scalability, Robustness, Resilience
Are Grids the solution ?
What is a Grid ? “Dependable, consistent, pervasive access to resources” Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals in the absence of central control, omniscience, trust relationships Make it easy to use diverse, geographically distributed, locally managed and controlled computing facilities as if they formed a coherent local cluster
What does the Grid do for you? You submit your work And the Grid “Partitions” your work into convenient execution units based on the available resources, data distribution, … if there is scope for parallelism Finds convenient places for it to be run Organises efficient access to your data Caching, migration, replication Deals with authentication and authorization to the different sites that you will be using Interfaces to local site resource allocation mechanisms, policies Runs your jobs Monitors progress Recovers from problems Tells you when your work is complete
State (HEP-centric view) circa 1.5 years ago Globus project Globus toolkit: core services for Grid tools and applications (Authentication, Information service, Resource management, etc…) Good basis to build on but: No higher level services Handling of lots of data not addressed No production quality implementations Not possible to do real work with Grids yet …
DataGrid Project (EDG) Project started Jan 2001, duration 3 years 9.8 millions € funded by EU Goals To build a significant prototype of the LHC computing model To collaborate with and complement other European and US projects To develop a sustainable computing model applicable to other sciences and industry: biology, earth observation etc. Specific project objectives Middleware for fabric & Grid management evaluation, test, and integration of existing M/W S/W and research and development of new S/W as appropriate Large scale testbed Production quality demonstrations Open source and technology transfer See
Main Partners CERN CNRS - France ESA/ESRIN - Italy INFN - Italy NIKHEF – The Netherlands PPARC - UK
Research and Academic Institutes CESNET (Czech Republic) Commissariat à l'énergie atomique (CEA) – France Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI) Consiglio Nazionale delle Ricerche (Italy) Helsinki Institute of Physics – Finland Institut de Fisica d'Altes Energies (IFAE) - Spain Istituto Trentino di Cultura (IRST) – Italy Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany Royal Netherlands Meteorological Institute (KNMI) Ruprecht-Karls-Universität Heidelberg - Germany Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands Swedish Natural Science Research Council (NFR) - Sweden Associated Partners Industry Partners Datamat (Italy) IBM (UK) Compagnie des Signaux (France)
The Middleware Working Group coordinates the development of the software modules leveraging, existing and long tested open standard solutions. Five parallel development teams implement the software: job scheduling, data management, grid monitoring, fabric management and mass storage management. The Infrastructure Working Group is focused on the integration of middleware software with systems and networks to provide testbeds to demonstrate the effectiveness of DataGrid in production quality operations over high performance networks. The Applications Working Group exploits the project developments to process large amounts of data produced by experiments in the fields of High Energy Physics (HEP), Earth Observations (EO) and Biology. The Management Working Group has in charge the coordination of the entire project on a day-to-day basis and the dissemination of the results among industries and research institutes. Applications Middleware Infrastructure Management Testbed Applications Middleware Infrastructure Management Testbed Applications Middleware Infrastructure Management Testbed Applications Middleware Infrastructure Management Testbed
DataGrid Middleware Services Fabric Management Mass Storage Management Data mgmt Workload mgmt Monitoring Services Other Grid middleware services (information, security)
DataGrid Architecture Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Fabric Local Computing Grid Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index
DataGrid achievements Testbed 1: first release of EDG middleware First workload management system “Super scheduling" component using application data and computing elements requirements File Replication Tools (GDMP), Replica Catalog, SQL Grid Database Service, … Tools for farm installation and configuration … Used for real production demos
Job submission scenario dg-job-submit myjob.jdl Myjob.jdl Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed "; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other.Architecture == "INTEL" && other.OpSys== "LINUX Red Hat 6.2"; Rank = other.FreeCPUs;
Other HEP Grid initiatives PPDG (US) GriPhyN (US) DataTag & iVDLG Transatlantic testbeds HENP InterGrid Coordination Board LHC Computing Grid Project
Grid approach not only for HEP applications …
Mathematicians Solve NUG30 Looking for the solution to the NUG30 quadratic assignment problem An informal collaboration of mathematicians and computer scientists Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites) 14,5,28,24,1,3,16,15, 10,9,21,2,4,29,25,22, 13,26,17,30,6,20,19, 8,18,7,27,12,11,23 MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin
Network for Earthquake Engineering Simulation NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other On-demand access to experiments, data streams, computing, archives, collaboration NEESgrid: Argonne, Michigan, NCSA, UIUC, USC
Global Grid Forum Mission To focus on the promotion and development of Grid technologies and applications via the development and documentation of "best practices," implementation guidelines, and standards with an emphasis on "rough consensus and running code" An Open Process for Development of Standards A Forum for Information Exchange A Regular Gathering to Encourage Shared Effort See
Summary Regional Centers – Multi Tier model as envisaged approach for the LHC computing challenge Many issues to be addressed The Grid approach Many problems still to be solved … R&D required … but some tools and frameworks already available Being used for real applications