Download presentation
Presentation is loading. Please wait.
Published byChristal Horn Modified over 9 years ago
1
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Castor External Operation Face-to-Face Meeting, CNAF, October 29-31, 2007 CASTOR2 Disk Cache Scheduling LSF, Job Manager and Python Policies. Dennis Waldron CERN / IT
2
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 2Outline LSF limitations, pre 2.1.3 v2.1.3: –Resource Monitoring and Shared Memory. –LSF changes and New Scheduler Plugin. –Python Policies. v2.1.4: –Scheduling Requirements/Problems –Job Manager v2.1.6+ –Future Developments (v2.1.6 & v2.1.7)
3
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 3 LSF Limitations, pre 2.1.3 releases What was killing us: The LSF queue was limited to ~2000 jobs, more then this resulted in instabilities. LSF jobs remained in PSUSP after timeout between stager and rmmaster (#17153) Poor submissions rates into LSF, ~10 jobs/second. Half of the advertised LSF rate. RmMaster did not keep node status after restart (#15832) Database latency between LSF plugin (schmod_castor) and stager DB resulted in poor scheduling performance. These were just the start!!! Additional Information available at: http://castor.web.cern.ch/castor/presentations/2006/
4
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 4 Resource Monitoring and Shared Memory In 2.1.3 both the LSF plugin and Resource Monitor (rmMasterDaemon) now share a common area of memory for exchanging information between the two processes. –Advantage: Access to monitoring information from inside the LSF Plugin is now a pure memory operation on the scheduler machine. (extremely fast!) –Disadvantage: the rmMasterDaemon and LSF must operate on the same machine! (no possibility for LSF failover) Changes to daemons in 2.1.3: –rmmaster became a pure submission daemon. –rmMasterDaemon was introduced for collecting monitoring information. –rmnode was replaced by rmNodeDaemon on all diskservers
5
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 5 Resource Monitoring Cont. New monitoring information contains –On diskservers : ram(total + free), memory(total + free), swap(total + free), load, status and adminStatus. –For each filesystem : space(total + free), nbRead/ReadWrite/WriteStreams, read/writeRate, nbMigrators, nbRecallers, status and adminstatus. Monitoring intervals : –1minute for slow moving info (total*, *status) –10s for fast moving info (*Streams, *rate, load) Status can be Production, Draining or Down Admin status can be None, Force or Deleted –Set via rmAdminNode. –Force prevents updates from monitoring. –Deleted, deletes it from the DB –Release allows to move back from Force to None By default, new diskservers are in status DOWN and admin status FORCE.
6
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 6 Added multiple LSF queues, one per svcclass. –Not for technical reasons!!! –Allows for user restrictions at queue level and better visualization of jobs on a per svcclass basis via bqueues. Utilisation of External Scheduler options during job submission. –Recommended by LSF experts. –Increased job submission from 10 to 14 jobs/second. –Calls to LSF (mbatchd) from CASTOR2 components reduced from 6 to 1. As a result queue limitations no longer needed. (Not totally disappeared!!) –Removed the need for message boxes, i.e. jobs are no longer suspended and resumed at submission time. –Requires LSF_ENABLE_EXTSCHEDULER to be enabled in lsf.conf (both scheduler and rmmaster machines) LSF changes and New Scheduler Plugin
7
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 7 LSF Changes Cont. Filesystem selection now transferred between LSF and the job (stagerJob) via the SharedLSFResource. –The location of the SharedLSFResource can be defined in castor.conf –Can be a shared filesystem e.g NFS or web server Why is it needed? –LSF is CPU aware not filesystem aware. –The LSF scheduler plugin has all the logic for filesystem selection based on monitoring information and policies. –The final decision needs to be transferred between the Plugin and the LSF execution host. –Could have been LSF messages boxes or the SharedLSFResource. Neither are great! But, we select the lesser of two evils!
8
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 8 LSF Python Policies Why? –Filesystem selection has moved from the Stager DB to the Plugin. The Plugin must now take over its functionality. –Scheduling needs to be sensitive to other non scheduled activity and respond accordingly. Initial implementation was a basic equation with coefficients set in castor.conf. –Advantage: Simplicity –Disadvantages Simplicity Every new internal release during testing of 2.1.3 required changes to this equation inside the code!! We couldn’t ask the operations team to make these changes during runtime so another language was need for defining policies. The winner was Python!
9
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 9 Python Policies Cont. Examples: /etc/castor/policies.py.example Policies are defined on a per svcclass level. Many underestimate there importance! Real example: 15 diskservers, 6 LSF slots each, all slots occupied transferring 1.2GB files in both read and write directions. Expected throughput per stream ~ 20MB/s (optimal) Problems: –At 20 MB/s migration and recall streams suffer. –Migrations and Recalls are unscheduled activities. Solution: –Define a policy which favours migration and recall streams by restricting user activity on the disk server allowing more resources (bandwidth, disk I/O) to be used by migrations and recalls.
10
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 10 The LSF queue was limited to ~2000 jobs, more then this resulted in instabilities. No messages boxes, 6 to 1 LSF calls LSF jobs remained in PSUSP after timeout between stager and rmmaster (#17153) Poor submissions rates into LSF, ~10 jobs/second. Half of the advertised LSF rate. Now at 14 jobs/second RmMaster did not keep node status after restart (#15832). States now stored in the Stager DB for persistence Database latency between LSF plugin (schmod_castor) and stager DB resulted in poor scheduling performance. Shared memory implementation These were just the start!!! Additional Information available at: http://castor.web.cern.ch/castor/presentations/2006/ LSF Limitations, pre 2.1.3 releases What was killing us:
11
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 11 Job submission rates still not at the advertised LSF rate of 20 jobs per second. Jobs remain in a PEND’ing status indefinitely in LSF if no resources exist to run them (#15841) Administrative actions such as bkills do not notify the client of a request termination (#26134) CASTOR cannot throttle requests if they exceed a certain amount (#18155) - infamous LSF meltdown Scheduling Requirements/Problems A requirement was needed for a daemon to manage and monitor jobs whilst in LSF and take appropriate actions where needed.
12
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 12 Job Manager - Improvements The stager no longer communicates directly with the submission daemon. –All communication is done via the DB making the jobManager stateless. –Two new statues exist in the subrequest table SUBREQUEST_READYSHCED13 SUBREQUEST_BEINGSCHED14 –No more timeouts between stager and rmmaster resulting in duplicate submissions and rmmaster meltdowns. Utilises a forked process pool for submitting jobs into LSF. –The Previous rmmaster forked a process for each submission into LSF which is expensive. –The number of LSF related process is now restricted to 2 x the number of submission processes. –Improved submission rates from 14 to 18.5 jobs/second New functionality added to detect when a job has been terminated by an administrator, `bkill` and notify the client to the jobs termination. –New error code: 1719 - 'Job killed by service administrator'
13
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 13 Job Manager – Improvements Cont. Jobs can now be killed if they remain in LSF for too long in a PEND’ing status. –The timeout value can be defined on a per svcclass basis. –The user receives error code: 1720 - 'Job timed out while waiting to be scheduled‘. Jobs whose resource requirements can no longer be satisfied can be terminated: –Error code: 1718 - 'All copies of this file are unavailable for now. Please retry later‘ –Must be enabled in castor.conf via option JobManager/ResReqKill Multiple JobManagers can operate in parallel for a redundant, high availability solution. All known rmmaster related bugs closed!
14
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 14 Future Developments 2.1.6+ Disk-2-Disk copy scheduling Support for multiple rmMasterDaemons running in parallel on a single CASTOR 2 instance.
15
CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it 15 Comments, questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.