Download presentation
Presentation is loading. Please wait.
Published bySamson McLaughlin Modified over 9 years ago
1
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t DBES L. Betev, A. Grigoras, C. Grigoras, P. Saiz, S. Schreiner AliEn Extreme Job Broker
2
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 210 Nov 2011 Pablo Saiz ALICE offline week Two tasks: –Tells CE how many JobAgents to submit –Gives user jobs to JobAgents –Using job priorities Using Classads matchmaking Pull mode –Waits for requests Critical service –Without it, user jobs would wait in the queue Multiple instances, in several hosts –Using DB to solve concurrency problems AliEn Job Broker
3
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 310 Nov 2011 Pablo Saiz ALICE offline week xrootd Job execution Job Manager JOB TASKQUEUE Job Broker CE MonALISA xrootd Site A JOB MonALISA xrootd Site B MonALISA Site C File catalogue LFN GUID Meta data JOB CE JA
4
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 410 Nov 2011 Pablo Saiz ALICE offline week Current Job Broker concerns Increased number of jobs –More users, more analysis, more resources… Load on server(s) –We could deploy more instances. –No problems with database performance Understand why jobs stay in WAITING
5
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 510 Nov 2011 Pablo Saiz ALICE offline week Let’s go back to splitting… Splitting per SE –Per combination of SE –Users can specify max number of files Although usually jobs have even less input files Other splitting algorithms are trivial –‘production’ similar jdl in all subjobs –‘file’ one subjob per file –‘directory’ one subjob per directory –‘user defined’ as user specifies
6
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 610 Nov 2011 Pablo Saiz ALICE offline week Current situation Works nicely if one replica per file Job Manager JOB A bit more complex with 3 SE and 2 replicas Job Manager JOB And a lot more with 50 SE and 3 replicas Job Manager JOB
7
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 710 Nov 2011 Pablo Saiz ALICE offline week What if...? Split jobs without specifying input data per subjob Decide files to be analyzed in the last moment –Based on location, and files already analyzed
8
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 810 Nov 2011 Pablo Saiz ALICE offline week Example Site ASite BSite C File 1 File 2 File 3 File 4 File 5 Current schema Submit 4 jobs: File1 File 4 File2File3File 5 Broker per file Submit 3 empty subjobs File1, 2,4,5 When a job starts, analyze as much as possible File 3 If nothing left, just exit
9
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 910 Nov 2011 Pablo Saiz ALICE offline week Even more options ‘Each subjob with at least 5 files’ –And read them remotely ‘Analyze only 70 out of 100 files’ –And stop once the number has been reached ‘Stop the analysis after 3 hours when 90% of files have been processed’ Ideas from: Jan Fiete Grosse-Oetringhaus
10
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 1010 Nov 2011 Pablo Saiz ALICE offline week Broker per file Benefits: –Less subjobs –Easier to achieve max number entries –Can define min number (remote reading) –Can reduce the tail of execution Challenges: –Keep track of files analyzed –Increase complexity of Broker –Still to be implemented
11
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 1110 Nov 2011 Pablo Saiz ALICE offline week Load on Job Broker Increase with v2-19: –Security: SSL connections
12
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 1210 Nov 2011 Pablo Saiz ALICE offline week Old Classad Matchmaking PrioJobsJDL 99300 [executable=“aliroot”; user=‘alidaq’; inputdata=; …; Requirements= TTL>9000 and other.SE=‘ALICE::FZK::SE’ ] 975 [executable=“aliroot”; user=‘psaiz’, inputdata= …; Requirements= TTL>300 and ( other.SE=‘ALICE::Prague::SE’ or other.SE=‘ALICE::Torino::SE’) ] 9520 [executable=“aliroot”; user=‘blabla’; inputdata = …; Requirements= TTL>300 and other.SE=‘ALICE::CERN::SE’ ] … 45300 [executable=“aliroot”; user=‘aliprod’; Requirements=TTL>80000] Match all entries, starting with highest priority [ Uname = "2.6.32-33-server"; CE = "ALICE::CERN::aliendb5"; CloseSE = { "ALICE::CERN::Castor2", "ALICE::CERN::SE", … }; Version = "v2-19.81"; Type = "machine"; Price = 1.000000000000000E+00; Host = "aliendb5d"; WNHost = "aliendb5.cern.ch"; LocalDiskSpace = 370226072; Platform = "Linux-x86_64"; TTL = 43200; Packages = { "proof@aaf-aliroot::0.4.3", …} ; Requirements = ( other.Type == "Job" ) ] JOB AGENT JDL
13
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 1310 Nov 2011 Pablo Saiz ALICE offline week Current Classad Matchmaking Match all entries with that SE, starting with highest priority PrioJobsJDLSE 99300 [executable=“aliroot”; user=‘alidaq’; packages=; …; Requirements= TTL>9000 and other.SE=‘ALICE::FZK::SE’ ] FZK 975 [executable=“aliroot”; user=‘psaiz’, packages=; …; Requirements= TTL>300 and ( other.SE=‘ALICE::Prague::SE’ or other.SE=‘ALICE::Torino::SE’) ] Torino, Prague 9520 [executable=“aliroot”; user=‘blabla’; packages=; …; Requirements= TTL>300 and other.SE=‘ALICE::CERN::SE’ ] CERN … 45300 [executable=“aliroot”; user=‘aliprod’; Requirements=TTL>80000]
14
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 1410 Nov 2011 Pablo Saiz ALICE offline week And in v2-20 Let the database do most of matchmaking –Extract most common fields from the JDL: TTL, CE, SE, user, partition, size, packages.. –First match should do it –Less flexibility in requirements… PrioJobsJDLSEUser CE …Partition 99300 [executable=“aliroot”; …]FZKalidaqFZK 975 [executable=“aliroot”; … ]Torino, Prague psaizTorino, Prague 9520 [executable=“aliroot”; … ]CERNblablaCERN … 45300 [executable=“aliroot”; … ]aliprod
15
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 1510 Nov 2011 Pablo Saiz ALICE offline week Other changes in v2-20 ‘Resubmit’ uses same jobid CE submits JA even if packages not installed Atomic transaction for job assignment
16
CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t ES 1610 Nov 2011 Pablo Saiz ALICE offline week Summary AliEn Job Broker –Decides where jobs get executed –Pull model, classads and priorities In next version of AliEn –Broker per file to be analyzed Less subjobs More input dat per subjob More options (min. files, max waiting time) –Simplified classad matching –Atomic job assignment transaction To be deployed in January
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.