Towards an agent integrated speculative scheduling service L á szl ó Csaba L ő rincz, Attila Ulbert, Tam á s Kozsik, Zolt á n Horv á th ELTE, Department of Programming Languages and Compilers Budapest, Hungary
Motivation Data intensive, parameter-sweep applications Data intensive, parameter-sweep applications Decisions: Decisions: where to put the data (and when) where to put the data (and when) where to execute the job where to execute the job Optimization: Optimization: Replication Replication Resource utilization Resource utilization Response time (get the results ASAP) Response time (get the results ASAP) … Centralized approach vs. decentralized approach Centralized approach vs. decentralized approach More information, better optimization More information, better optimization
Overview Gather information about the behavior of the job and the resources of the grid Gather information about the behavior of the job and the resources of the grid Schedule the job Schedule the job MonitorAnalyzeSchedule log job description
Monitoring The developer do not have to provide the data access pattern descriptions The developer do not have to provide the data access pattern descriptions Information gathered: Information gathered: Data access: I/O operations Data access: I/O operations CPU & memory consumption CPU & memory consumption Possible approaches: Possible approaches: altering the source code, altering the source code, compiler, compiler, run-time system, run-time system, operating system. operating system. Legacy (binary) libraries and applications Legacy (binary) libraries and applications
Monitoring Altering the run-time system: Altering the run-time system: Black-box approach (source code is not needed) Black-box approach (source code is not needed) Transparent Transparent Special shared library -> platform-dependent Special shared library -> platform-dependent File handling: File handling: stdio.h, fcntl.h, unistd.h stdio.h, fcntl.h, unistd.h CPU & memory: CPU & memory: /proc/cpuinfo (BogoMips) /proc/cpuinfo (BogoMips) /proc /proc
The analyzer Processes the log: O(n) Processes the log: O(n) Strategy detection: O(1) Strategy detection: O(1) Direction of file access Direction of file access Block size Block size Timing characteristics Timing characteristics Configurable: Configurable: Detailed or more abstract (compact) description Detailed or more abstract (compact) description Behavior variation Behavior variation Progress detection threshold Progress detection threshold Access & datablock log size Access & datablock log size
Extended job description <datablock min_pos_absolute="0" max_pos_absolute=" " <datablock min_pos_absolute="0" max_pos_absolute=" " min_pos_relative="0" max_pos_relative=" " min_pos_relative="0" max_pos_relative=" " step="5000" size="5000" /> step="5000" size="5000" /> <timing op_time="0" op_mips="0" <timing op_time="0" op_mips="0" avg_op_time=" " avg_op_mips=" " /> avg_op_time=" " avg_op_mips=" " /> </file_out> <datablock min_pos_absolute="0" max_pos_absolute="905001" <datablock min_pos_absolute="0" max_pos_absolute="905001" min_pos_relative="0" max_pos_relative=" " min_pos_relative="0" max_pos_relative=" " step="4999" size="4999" /> step="4999" size="4999" /> <datablock min_pos_absolute="910000" max_pos_absolute=" " <datablock min_pos_absolute="910000" max_pos_absolute=" " min_pos_relative=" " max_pos_relative=" " min_pos_relative=" " max_pos_relative=" " step="4999" size="4999" /> step="4999" size="4999" /> <timing op_time="0" op_mips="0" <timing op_time="0" op_mips="0" avg_op_time=" " avg_op_mips=" " /> avg_op_time=" " avg_op_mips=" " /> </file_in>
Scheduling strategies Scheduler: Scheduler: Choose the Computing Element for the job execution Choose the Computing Element for the job execution Replication commands (for the Replica Manager) Replication commands (for the Replica Manager) Assumptions: Assumptions: A job (single thread) utilizes 100% of a CPU A job (single thread) utilizes 100% of a CPU Files are opened at the beginning of the execution and closed when the job terminates Files are opened at the beginning of the execution and closed when the job terminates Preceding jobs are finished -> input files can be transferred Preceding jobs are finished -> input files can be transferred
Scheduling Based on the job description and the current resource consumption in the grid Based on the job description and the current resource consumption in the grid Job Job description SE2 file3 Replica manager Scheduler CE1CE2 GIS SE3 file4 SE1 file1 file2 replicate schedule 100Mbit 1Gbit 100Mbit 10Mbit
Scheduling – static data feeder Based on the job description and the performance of CEs, estimate the execution time Based on the job description and the performance of CEs, estimate the execution time Output: list of CEs + commands for the Replica Manager Output: list of CEs + commands for the Replica Manager Job execution: Job execution: 1 input files to SEs (download) 2 run the job 3 outputs to the destination (upload)
Scheduling – agent integrated FileAccessAgent: FileAccessAgent: transfer files among Storage and Computing Elements transfer files among Storage and Computing Elements source agent: on a SE source agent: on a SE destination agent: collect the necessary files destination agent: collect the necessary files replicate files of a CE node is possible replicate files of a CE node is possible filtered data transfer (copy relevant file parts) filtered data transfer (copy relevant file parts) take into account the status of multiple jobs take into account the status of multiple jobs JobManagementAgent: JobManagementAgent: coordinate the destination FileAccessAgent coordinate the destination FileAccessAgent start the FAA start the FAA
Scheduling – agent integrated Static DF + FileAccessAgent: Static DF + FileAccessAgent: 1. the user submits the job 2. find the target CE using the job description 3. if the CE does not provide enough disk space, collect the best SE to which input files should be mirrored 4. FileAccessAgent is sent to source SE and destination node 5. the FileAccessAgent of the job copies the input files (to the CE or the SE if necessary/possible) 6. the job is executed 7. the FileAccessAgent copies the output files to the destination node; the next job can be started
Agent integrated – Condor-G Java Agent DEvelopment Framework (JADE): Java Agent DEvelopment Framework (JADE): Java-based Java-based can be distributed across machines can be distributed across machines FIPA communication model (FIPA is an IEEE Computer Society standards organization that promotes agent- based technology and the interoperability of its standards with other technologies) FIPA communication model (FIPA is an IEEE Computer Society standards organization that promotes agent- based technology and the interoperability of its standards with other technologies) support for HTTP-based transport support for HTTP-based transport agents are submitted as an input file of a shell script agents are submitted as an input file of a shell script
Agent integrated – Condor-G universe = vanilla executable = runjade output = jadesend.out error = jadesend.err log = jadesend.log arguments = -host n01 -container a1:A1 transfer_input_files = agent.jar,jade.jar,jadeTools.jar,iiop.jar,commons- codec-1.3.jar WhenTOTransferOutput = ON_EXIT requirements = (machine == "n02") queue
Simulation Extended OptorSim v2.0; CE configurations were extended with MIPS values Extended OptorSim v2.0; CE configurations were extended with MIPS values EDG topology EDG topology static data feeder and agent integrated scheduler static data feeder and agent integrated scheduler single source shortest path searching, 300 MB input file single source shortest path searching, 300 MB input file 4 simulation groups: 4 simulation groups: two different job descriptions two different job descriptions 1/6 and 4/6 of the jobs have behavior description 1/6 and 4/6 of the jobs have behavior description 100, 500, 1000 jobs in a group 100, 500, 1000 jobs in a group
Simulation
Conclusions and future work Optimization based on extended job description Optimization based on extended job description Static and dynamic data feeder strategies; implementation: Hungarian ClusterGrid Static and dynamic data feeder strategies; implementation: Hungarian ClusterGrid Monitoring and analysis will be unified and implemented in the Grid middleware Monitoring and analysis will be unified and implemented in the Grid middleware Refined job descriptions Refined job descriptions Communication patterns for parallel applications Communication patterns for parallel applications More thorough analysis of the scheduling methods: More thorough analysis of the scheduling methods: What happens when things go wrong? What happens when things go wrong? How should it be handled? How should it be handled?