Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational Grids for High Energy Physics Jim Basney, Miron Livny, Paolo Mazzanti
Background This work is the result of an ongoing collaboration between the Condor Team at the University of Wisconsin Madison and the Bologna section of INFN Collaboration started in 1996 An INFN Condor pool with more than 170 CPUs is serving the INFN community ( ) New features were developed and tested as a result of this collaboration
Data Transfer Challenge In order to harness for HEP the processing capacity of large collections of commodity (desktop and clusters) computing resources we need effective mechanisms and policies to manage the transfer and placement of checkpoint and data files and means to established affinity between execution sites and data storage sites.
Need to take into account › Network topology and capabilities › Distribution, capabilities and availability of storage resources › Distribution, capacity and availability of computing resources › Impact on interactive users
The Condor HTC System Condor is a distributed job and resource management system that employs a novel matchmaking approach to allocate resources to jobs. Symmetric - Requests and Offers Open - No centralized schema Dynamic - Easy to change information and semantics Expressive - Full power of Boolean expressions
ClassAd examples Resource Offer [ OpSys = "Solaris2.6"; Arch = "Sun4u"; Memory = 256; LoadAvg = 0.25; Cluster = "UWCS"; Requirements = My.LoadAvg < 0.3 Rank = (Target.Group == "AI”); ] Resource Request [ Group= ”AI"; Requirements = Target.Memory > 80 && Target.OpSys == "Solaris2.6" && Target.Arch == "Sun4u"; Rank = (Target.Cluster == "UWCS”); ]
Checkpoint Domains › Every Computational resource belongs to a checkpoint domain › Jobs can start on any resource › Checkpoint is saved to the local (domain) checkpoint server › Jobs are restarted only on local (domain) computational resources › Checkpoints can migrate
I/O Domains › Each resource belongs to an I/O domain. A domain may consist of a single machine. › User stages input data on storage devices and updates the ClassAds of the jobs and/or the resources to reflect the location and availability of the data. › User is responsible for moving output data to storage system › Condor monitors and reports I/O activity performed via remote I/O.
Ongoing I/O related work › Improve performance and mapping capabilities of Remote I/O capabilities of Condor. › Provide interfaces to SRB (SDSC), SAM (FERMI) and CORBA (LBL) data storage systems. › Support co-scheduling of processing and network resources › Develop staging services and interface them with the matchmaking frame work. › Extend reporting and monitoring capabilities
Visit us at