QoS in the Tier1 batch system(LSF) Alessandro Italiano (INFN-CNAF) Tier1 - Farming Group
QoS definition From WikiPedia: http://en.wikipedia.org/wiki/QoS Quality of service is the ability to provide different priority to different applications and users in resource usage. QoS mechanisms are not required if there is not resource contention
Tier1 scenario More than 20 different Experiments(Application) Each Experiment has several computing activities with different priorities Each year the Tier1 committee defines the highest amount of resources that each Experiment can use
From LSF Documentation FairShare definition From LSF Documentation Fairshare scheduling divides the processing power of the LSF cluster among users to provide fair access to resources, so that no user or subgroup of users can monopolize the resources of the cluster
Hierarchical FairShare a first level of QoS Define dynamic priorities for every group/subgroup Dynamically grants a resource quota to each group/subgroup Used only where there is resource contention Optimized resource usage
Hierarchical Fairshare: Parameters Share assigned Resource percentage assigned to every group e subgroup Resource usage Time Window time slot used to compute the total amount of resources used by every group Normalization factors Dynamic priority formula: Share DP = (ResourceUsage x Nf) + 1
Hierachical Fairshare: How it works cms rel share = 15 abs share = 4.5 cmsprd rel share = 70 abs share = 21 cmssgm alice abs share = 4.05 alicesgm rel share = 85 abs share = 22.95 Available Resources CMS share = 30 ALICE share = 27 SHARE_INFO_FOR: SLC4_GLOBAL/ USER/GROUP SHARES PRIORITY STARTED RESERVED CPU_TIME RUN_TIMEgroup_test 10000 3236.948 0 0 3621.6 4588group_admin 1000 333.333 0 0 0.0 0group_dteam 1000 333.325 0 0 0.8 4group_egee 1000 325.245 0 0 16.4 3836group_ops 1000 142.119 1 0 26107.9 53260group_magic 174 55.994 0 0 4420.6 5522group_ams 45 15.000 0 0 0.0 0group_ingv 31 10.333 0 0 0.0 0group_theophys 31 10.333 0 0 0.0 0group_biomed 31 10.333 0 0 0.0 0group_t1bio 31 10.333 0 0 0.0 0group_cdfcaf 1616 3.266 34 0 7585199.5 20034766group_infngrid 6 2.000 0 0 0.3 6group_pamela 35 1.111 0 0 821776.2 1464160group_lhcb 1355 0.883 153 0 813004.1 55167538u_cms 1665 0.809 449 0 14599556.0 36415071group_babar 1691 0.784 388 0 15616334.0 50845584u_atlas 1514 0.640 451 0 21058258.0 51814853group_alice 1041 0.638 439 0 1478367.4 15974277group_argo 401 0.637 159 0 3155703.0 7686056group_virgo 348 0.392 33 0 51287.3 40428843
Hierarchical Fairshare: constraint In case of no intra-VO resources contention, one user could use all the resources available to his experiment. In this way all the others users, also those belong to high priority group, could wait for a long time before to run a job
LSF SLA Second level of QoS LSF Service Level Agreement are batch system functionalities which can provide different service level goals oriented. There are four goals available: Deadline: complete a specified #jobs in a time window Velocity: maintain #jobs running in a time window. Used for short jobs Throughput: complete #jobs per hour. Used for medium and long jobs Combination of different goals
the specific SLA to each user or subgroup LSF SLA: Constraint You can’t configure a specific queue or user subgroup to use a SLA, because SLAs can be only invoked at submission time. To avoid this limitation the batch manager can easily provide an automatic hook in order to grant the specific SLA to each user or subgroup
A detail which can improve QoS: One queue for each Application in order to customized execution environment and make easier the administration of application requirements Run time resources limits Dedicate computing resources Use specific computing architectures Queue administrator Scheduling parameters Pre and post execution script ……
How GRID can match the right service class ? LSF QoS Role: cms QoS: Low Priority Role: cmsprd QoS: Medium Priority Role: cmssgm QoS: High Priority
Matching service class: Statically GRID LSF QoS Role: cms QoS: Low Priority lcmaps configuration file "/VO=cms/GROUP=/cms/ROLE=lcgadmin" cmssgm"/VO=cms/GROUP=/cms/ROLE=production" .cmsprd"/VO=cms/GROUP=/cms/HeavyIons/" .cms Role: cmsprd QoS: Medium Priority Role: cmssgm QoS: High Priority
Matching service class: Dynamically GRID LSF QoS Role: cms QoS: Low Priority GPBox Role: cmsprd QoS: Medium Priority Role: cmssgm QoS: High Priority