The Grid approach for the HEP computing problem Massimo Sgaravatto INFN Padova
What is a Grid ? “Dependable, consistent, pervasive access to resources” Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals in the absence of central control, omniscience, trust relationships Make it easy to use diverse, geographically distributed, locally managed and controlled computing facilities as if they formed a coherent local cluster
What does the Grid do for you? You submit your work And the Grid “Partitions” your work into convenient execution units based on the available resources, data distribution, … if there is scope for parallelism Finds convenient places for it to be run Organises efficient access to your data Caching, migration, replication Deals with authentication and authorization to the different sites that you will be using Interfaces to local site resource allocation mechanisms, policies Runs your jobs Monitors progress Recovers from problems Tells you when your work is complete
Grid approach in many sciences and disciplines …
Mathematicians Solve NUG30 Looking for the solution to the NUG30 quadratic assignment problem An informal collaboration of mathematicians and computer scientists Condor-G delivered 3.46E8 CPU seconds in 7 days (peak 1009 processors) in U.S. and Italy (8 sites) 14,5,28,24,1,3,16,15, 10,9,21,2,4,29,25,22, 13,26,17,30,6,20,19, 8,18,7,27,12,11,23 MetaNEOS: Argonne, Iowa, Northwestern, Wisconsin
Network for Earthquake Engineering Simulation NEESgrid: national infrastructure to couple earthquake engineers with experimental facilities, databases, computers, & each other On-demand access to experiments, data streams, computing, archives, collaboration NEESgrid: Argonne, Michigan, NCSA, UIUC, USC
Grid approach to address the High Energy Physics (HEP) computing problem
HEP computing characteristics Large numbers of independent events to process Large data sets, mostly read-only Modest floating point requirement Batch processing for production & selection - interactive for analysis Commodity components are just fine for HEP Very large aggregate requirements – computation, data The LHC challenge Jump in orders of magnitude wrt. previous experiments Geographical dispersion of people and of resources Scale Petabytes per year of data Thousands of processors Thousands of disks Terabits/second of I/O bandwidth … Complexity Lifetime (20 years) …
CMS:1800 physicists 150 institutes 32 countries World Wide Collaboration distributed computing & storage capacity
Solution? Regional Computing Centres Serve better the needs of the world-wide distributed community Data available nearby Reduce dependence on links to CERN Exploit established computing expertise & infrastructure in national labs, universities See
Tier2 Centre ~1 TIPS Online System Offline Processor Farm ~20 TIPS CERN Computer Centre FermiLab ~4 TIPS France Regional Centre Italy Regional Centre Germany Regional Centre Institute Institute ~0.25TIPS Physicist workstations ~100 MBytes/sec ~622 Mbits/sec ~1 MBytes/sec There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Physics data cache ~PBytes/sec ~622 Mbits/sec or Air Freight (deprecated) Tier2 Centre ~1 TIPS Caltech ~1 TIPS ~622 Mbits/sec Tier 0 Tier 1 Tier 2 Tier 4 1 TIPS is approximately 25,000 SpecInt95 equivalents
Grid as a possible approach Various technical issues to address Resource Discovery Resource Management Distributed scheduling, optimal co-allocation of CPU, data and network resources, uniform interface to different local resource managers, … Data Management Petabyte-scale information volumes, high speed data moving and replica, replica synchronization, data caching, uniform interface to mass storage management systems, … Automated system mgmt techniques of large computing fabrics Monitoring Services Security Authentication, Authorization … Scalability, Robustness, Resilience Grid model to address such problems
State (HEP-centric view) circa 2.5 years ago Globus project Globus toolkit: core services for Grid tools and applications (Authentication, Information service, Resource management, etc…) Good basis to build on but: No higher level services Handling of lots of data not addressed No production quality implementations Not possible to do real work with Grids yet …
DataGrid Project (EDG) Project started Jan 2001, duration 3 years Goals To build a significant prototype of the LHC computing model To collaborate with and complement other European and US projects To develop a sustainable computing model applicable to other sciences and industry: biology, earth observation etc. Specific project objectives Middleware for fabric & Grid management evaluation, test, and integration of existing M/W S/W and research and development of new S/W as appropriate Large scale testbed Production quality demonstrations Open source and technology transfer See
Main Partners CERN CNRS - France ESA/ESRIN - Italy INFN - Italy NIKHEF – The Netherlands PPARC - UK
Research and Academic Institutes CESNET (Czech Republic) Commissariat à l'énergie atomique (CEA) – France Computer and Automation Research Institute, Hungarian Academy of Sciences (MTA SZTAKI) Consiglio Nazionale delle Ricerche (Italy) Helsinki Institute of Physics – Finland Institut de Fisica d'Altes Energies (IFAE) - Spain Istituto Trentino di Cultura (IRST) – Italy Konrad-Zuse-Zentrum für Informationstechnik Berlin - Germany Royal Netherlands Meteorological Institute (KNMI) Ruprecht-Karls-Universität Heidelberg - Germany Stichting Academisch Rekencentrum Amsterdam (SARA) – Netherlands Swedish Natural Science Research Council (NFR) - Sweden Associated Partners Industry Partners Datamat (Italy) IBM (UK) Compagnie des Signaux (France)
The Middleware Working Group coordinates the development of the software modules leveraging, existing and long tested open standard solutions. Five parallel development teams implement the software: job scheduling, data management, grid monitoring, fabric management and mass storage management. The Infrastructure Working Group is focused on the integration of middleware software with systems and networks to provide testbeds to demonstrate the effectiveness of DataGrid in production quality operations over high performance networks. The Applications Working Group exploits the project developments to process large amounts of data produced by experiments in the fields of High Energy Physics (HEP), Earth Observations (EO) and Biology. The Management Working Group has in charge the coordination of the entire project on a day-to-day basis and the dissemination of the results among industries and research institutes. Applications Middleware Infrastructure Management Testbed Applications Middleware Infrastructure Management Testbed Applications Middleware Infrastructure Management Testbed Applications Middleware Infrastructure Management Testbed
DataGrid Architecture Collective Services Information & Monitoring Replica Manager Grid Scheduler Local Application Local Database Underlying Grid Services Computing Element Services Authorization Authentication and Accounting Replica Catalog Storage Element Services SQL Database Services Fabric services Configuration Management Configuration Management Node Installation & Management Node Installation & Management Monitoring and Fault Tolerance Monitoring and Fault Tolerance Resource Management Fabric Storage Management Fabric Storage Management Grid Fabric Local Computing Grid Grid Application Layer Data Management Job Management Metadata Management Object to File Mapping Service Index
DataGrid achievements Testbed 1: first release of EDG middleware First workload management system “Super scheduling" component using application data and computing elements requirements File Replication Tools (GDMP), Replica Catalog, SQL Grid Database Service, … Tools for farm installation and configuration … Used for real productions Towards testbed 2: new functionalities and increased reliability
Job submission scenario dg-job-submit myjob.jdl Myjob.jdl Executable = "$(CMS)/exe/sum.exe"; InputData = "LF:testbed "; ReplicaCatalog = "ldap://sunlab2g.cnaf.infn.it:2010/rc=WP2 INFN Test Replica Catalog,dc=sunlab2g, dc=cnaf, dc=infn, dc=it"; DataAccessProtocol = "gridftp"; InputSandbox = {"/home/user/WP1testC","/home/file*”, "/home/user/DATA/*"}; OutputSandbox = {“sim.err”, “test.out”, “sim.log"}; Requirements = other.Architecture == "INTEL" && other.OpSys== "LINUX Red Hat 6.2"; Rank = other.FreeCPUs;
Other HEP Grid initiatives PPDG (US) GriPhyN (US) DataTag & iVDGL Transatlantic testbeds (to address interoperability) LCG (LHC Computing Grid Project)
The Grid World: current status Dozens of major Grid projects in scientific & technical computing/research & education Considerable consensus on key concepts and technologies Open source Globus Toolkit™ a de facto standard for major protocols & services Industrial interest emerging rapidly Opportunity: convergence of eScience and eBusiness requirements & technologies
Problems Almost all projects have developed specialized services which have been layered on top of standard services (security, remote job execution, etc.) Patchwork of protocols and non- interoperable “standards” and difficult to re-use “implementations” Exploit Web Services
Web Services Increasingly popular standards-based framework for accessing network applications W3C standardization; Microsoft, IBM, Sun, others WSDL: Web Services Description Language Interface Definition Language for Web services SOAP: Simple Object Access Protocol XML-based RPC protocol; common WSDL target WS-Inspection Conventions for locating service descriptions UDDI: Universal Desc., Discovery, & Integration Directory for Web services
Open Grid Service Architecture (OGSA) Service orientation Computational resources, storage resources, networks, programs, databases, etc. all represented as services Allows standard interface definition mechanisms: multiple protocol bindings, multiple implementations, local/remote transparency Grid service: web service with semantic for service interactions Management of transient instances (& state)
Global Grid Forum Mission To focus on the promotion and development of Grid technologies and applications via the development and documentation of "best practices," implementation guidelines, and standards with an emphasis on "rough consensus and running code" An Open Process for Development of Standards A Forum for Information Exchange A Regular Gathering to Encourage Shared Effort See