Software Tools for Dynamic Resource Management Irina V. Shoshmina, Dmitry Yu. Malashonok, Sergay Yu. Romanov Institute of High-Performance Computing and Information Systems
State of the art Resources: n CONVEX(es) n Parsytec CC/16 n Parsytec CCid n Parsytec Power Mouse System n SPP1600 n SGI OCTANE Workstations n SunUltra 450 n Paritet (intel cluster) Scientific problems: n hydroaerodynamics n plasma n nuclear physics n medicine n biology n chemistry n astronomy
Difficulties n shortage of resources for soluble scientific problems n unsatisfactory management of tasks (the majority of tasks are parallel)
Shortage of resources integrate computational resources of several scientific centres Advantages of integration n increase access and activity of usage of computational resources, n promote an integration of scientific community, n increase the range of resolving scientific and technical problems
Management of tasks optimisation of task distribution on computational nodes Tools n Codine n SunGridEngine n PBS n Condor Disadvantages of tools n weak support of migration of parallel tasks n unsatisfactory load balancing n dependence on versions of PVM and MPI
Main goals of the project n increase of efficiency of use of computing resources n improvement of quality of service of the users Main tasks n migration of parallel tasks n optimisation of distributed resource management n integration resources of several scientific centres
Dynamite software developed by University of Amsterdam in the Esprit project Dynamite advantages n migration and checkpointing of PVM tasks n automatic work-load balancing of PVM tasks (on a cluster of workstations) n migration of dynamically linked tasks n migration of communication end points n reallocation of tasks
Dynamite disadvantages n dependence on the PVM versions n absence of migration of MPI tasks n absence of satisfactory monitoring system n absence of advanced scheduling system n absence of modules of global distribution
Main steps of the project n Migration of MPI and PVM tasks n Checkpointing of parallel tasks n Monitoring n Resource management n Addition architectures
Two-level system Global level Local level
Main problems of migration n migration of PVM tasks n migration of MPI tasks n independence from versions and realisations of PVM and MPI n addition of architectures Migration of PVM and MPI tasks n files n sockets n kernel supported threads and etc.
Checkpointing of parallel tasks n trace development of parallel tasks n migrate parallel tasks at two levels u migrate of a process of a parallel task (local level) u migrate of a parallel task wholly (global level) n process extreme situations
Checkpointing of parallel tasks Global level local level
Monitoring Parameters of n computational resources (loading of processors, memory, network), n tasks and queues, n users
Resource management n distribution of tasks and queues at the moment n long-time scheduling n dynamic load balancing at global and local levels
Integration with Globus Global environment local level Globus local level