Download presentation
Presentation is loading. Please wait.
Published byFerdinand Neil McDonald Modified over 9 years ago
1
Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN Bologna M. Sgaravatto - INFN Padova C. Vistoli - INFN Cnaf CHEP 2000, Padova 7-11 February, 2000
2
Massimo Sgaravatto INFN Padova Introduction HTC system needed –to meet the requirements of INFN users –to exploit the huge CPU capacity distributed in all INFN sites with distributed ownership –Candidate: Condor Condor philosophy and characteristics meet INFN requirements Condor on WAN project: collaboration with Condor Team, Univ. of Wisconsin - Madison
3
Massimo Sgaravatto INFN Padova Condor Supporting High Throughput Computing in large, distributively owned environments –Harnesses the power of non-dedicated resources Distinguishing features: –Checkpointing –Remote I/O –ClassAds
4
Massimo Sgaravatto INFN Padova Test phase Implementation of an experimental Condor WAN pool Objectives: –Verify reliability and robustness on WAN –Verify suitability to INFN requirements
5
Massimo Sgaravatto INFN Padova Test phase: results Very good performances for CPU intensive jobs Less efficient CPU usage for I/O intensive jobs Uniform file system Caching, dedicated file systems,... Necessary to: –have the possibility to guarantee priorities on resource usage for specific applications –guarantee overall efficiency of the system adequate location of checkpoint servers
6
Massimo Sgaravatto INFN Padova Implementation phase Characteristics of the INFN Condor pool: Single pool –To optimize CPU usage of all INFN hosts Sub-pools –To define policies/priorities on resource usage Checkpoint domains –To guarantee the performance and the efficiency of the system –To reduce network traffic for checkpointing activity
7
Massimo Sgaravatto INFN Padova Sub-pool Collaboration machines (i.e. workstations belonging to the same research group) configured to prioritize collaboration user jobs –Local to a single INFN site –Distributed between different sites Possibility to define different policies on resource usage –Example: High priority: Condor jobs of a specific research group Middle priority: Condor jobs of local (same site) users Low priority: Condor jobs of remote users
8
Massimo Sgaravatto INFN Padova WAN Checkpoint needs Checkpoint accomplished in short time (even for “huge” checkpoint files) –to let the owner to access machine without delay –to increase the probability to have a successful checkpoint (without losing the ckpt file because of network timeout) Limit and control network traffic Checkpoint policies don’t have to reduce job computing throughput
9
Massimo Sgaravatto INFN Padova Checkpoint domains Solution: checkpoint domains –Pool partitioned in checkpoint domains (a dedicated ckpt server for each domain) –Definition of a checkpoint domain according: Presence of a sufficiently large CPU capacity Presence of a set of machines with an efficient network connectivity Sub-pools
10
GARR-B Topology 155 Mbps ATM based Network access points (PoP) main transport nodes TORINO PADOVA BARI PALERMO FIRENZE PAVIA MILANO GENOVA NAPOLI CAGLIARI TRIESTE ROMA PISA L’AQUILA CATANIA BOLOGNA UDINE TRENTO PERUGIA LNF LNGS SASSARI LECCE LNS LNL USA 155Mbps T3 SALERNO COSENZA S.Piero FERRARA PARMA CNAF Central Manager INFN Condor Pool on WAN: checkpoint domains ROMA2 10 40 15 4 65 5 Default CKPT domain @ Cnaf CKPT domain # hosts 10 2 3 6 3 2 USA 3 5 1 15 EsNet machines 500-1000 machines 6 ckpt servers 25 ckpt servers
11
Massimo Sgaravatto INFN Padova Network as resource In distributed environment the network is a resource –Bandwidth between executing machine and checkpoint server is a ClassAds attribute, dynamically updated The job allocates a machine taking into account also its checkpoint characteristics Job checkpoint policy defined in the job submitting file
12
Massimo Sgaravatto INFN Padova Job policies/1 Nearest policy –job prefers to select machine in the same checkpoint domain, always selecting the one with the highest bandwidth to the Ckpt server. rank = (CkptServer =?= LastCkptServer) *100 + CkptBW
13
Massimo Sgaravatto INFN Padova Job policies/2 At least N-Mbps policy –job prefers to select machine in the same checkpoint domain, always selecting the one with the highest bandwidth > N to the ckpt server rank = (CkptServer =?= LastCkptServer) *100 + CkptBW requirements = CkptBW > N
14
Massimo Sgaravatto INFN Padova Job policies/3 Fixed policy –job only selects machines in the same checkpoint domain, always selecting the one with the highest bandwidth to the checkpoint server: a job can’t move between checkpoint domains (suitable for very large jobs) requirements = (CkptServer =?= LastCkptServer || LastCkptServer =?= UNDEFINED) rank = CkptBW
15
Massimo Sgaravatto INFN Padova Checkpointing: next step Distributed dynamic checkpointing –Pool machines select the “best” checkpoint server (from a network view) –Association between execution machine and checkpoint server dynamically decided
16
Massimo Sgaravatto INFN Padova Distributed dynamic checkpointing –Network Manager Create and keep up-to-date the Network Class-Ads, between pool machines and checkpoint servers Network Class-Ads used by pool machines to select the “closest” checkpoint server N NM C Central Manager CARA Executing Machine Submitting Machine Collector Negotiator Network Manager Customer AgentResource Agent
17
Massimo Sgaravatto INFN Padova INFN Condor pool usage Different kinds of applications Condor users are happy: high computing throughput is achieved Allocation time for Condor jobs since February 99: > 36 years
18
Massimo Sgaravatto INFN Padova
19
Example
20
Massimo Sgaravatto INFN Padova Conclusions Efficiency and robustness of the Condor pool on WAN has been verified Single pool Efficient usage of all resources Network as resource Optimization of checkpoint operations Sub-pools Policies on resource usage http://www.infn.it/condor
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.