Does the implementation give solutions for the requirements? Flexibility GridRPC enables dynamic join/leave of QM servers. GridRPC enables dynamic expansion of a QM server.Robustness GridRPC detects errors and application can implement a recovery code by itself.Efficiency GridRPC can easily handle multiple clusters. Local MPI provides high performance on a cluster by fine grain parallelism.
Strategy for long run QM simulation will migrate to the other cluster either by intentionally or unintentionally. intentional migration Exceeds the maximum runtime for the cluster Reservation period has expired unintentional migration Any error/fault is detected The next cluster will be selected by either reservation or simple selection algorithm. Selection algorithm considers number of available cpus number of requested cpus records of past utilization Simulation reads a host information file in every time step. A cluster can join to/leave from the experiment on-the-fly.
National Institute of Advanced Industrial Science and Technology Experiments - target simulation - - testbed - - results and lessons learned -
Grid-enabled SIMOX Simulation on Japan- US Grid Testbed at SC2005 A technique to fabricate a micro structure consisting of Si surface on the thin SiO 2 insulator Allows to create higher speed with lower power consumption device
SIMOX simulation on the Grid Simulate SIMOX by implanting five oxygen atoms with their initial velocities much smaller than the usual values. The incident positions of the oxygen atoms relative to the surface crystalline structure of Si differ. 5 QM regions are initially defined Size and No. of QM regions are changed during the simulation 0.11million atoms in total Results of the experiments will demonstrate the sensitivity of the process on the incident position of the oxygen atom when its implantation velocity is small.
Testbed for the experiment Phase 1 Phase 2 Phase 3Phase 4 AIST Super Clusters P32 (2144 CPUs), M64 (528 CPUs), F32 (536 CPUs) TeraGrid Clusters PSC clusters (3000 CPUs), NCSA clusters(1774 CPUs) USC Clusters USC (7280 CPUs) Japan Clusters U-Tokyo (386 CPUs), TITECH (512 CPUs) QM1P32 USC ISTB S QM2P32 NCS A USC Prest o QM3M64 QM4P32 TCS USC P32 QM5P32 TCS USC P32 Reserv e F32 P32 F32
Result of the experiment Phase 1Phase 2Phase 3Phase 4 Experiment Time: days Simulation steps: 270 (~ 54 fs) Longest Calculation Time: 4.76 day
Results of the experiment (cont d) Behavior of oxygen atoms strongly depends on the incident position QM 1 QM 2 QM 3 QM 4 QM 5 v/v 0 Time step Expanding/Dividing QM regions at every 5 time steps Expansion: 47 times, Division: 8 times) Time Step No. of CPUs No. of QM Atoms 270 No. of CPUs/Atoms Succeeded in long-run by intentional/unintentional resource migration Intentional migration Migration triggered by faults
Summary of the experimental results We could verify that our strategy for long run is practical approach Continue the simulation by migrating one cluster to the other one based on reservation We could verify the programming using GridRPC and MPI could implement real Grid- enabled application Dynamic resource allocation / migration Recover from faults Manage hundreds of CPUs on distributed sites
Status and Future Plans Ninf-G Version 5 will be coming! What are differences with Ninf-G4? Lower prerequisites for installation Ninf-G4 needs Globus Library since it uses Globus IO for client/server communications. Ninf-G5 can be installed without Globus. i.e., Ninf-G5 can be installed according to the underlying software environments Three major components (remote process invocation, information retrieval, and client/server communication) can be pluggable. e.g. without Globus, without TCP Work efficiently from a single supercomputer to Grid Other new features will be supported Connection less (client server) Client-side check pointing Ninf-G 5.0.0alpha will be available in this March.
For more info, related links Ninf project ML Ninf-G Users ML (subscribed member s only) Ninf project home page Open Grid Forum GGF GridRPC WG