Presentation is loading. Please wait.

Presentation is loading. Please wait.

NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory

Similar presentations


Presentation on theme: "NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory"— Presentation transcript:

1 NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory http://www.cs.utk.edu/netsolve

2 Objectives Harnessing vast computational resources on the network Hardware Software Convenient for scientific computing community Reducing installation and programming overhead Masking complexity related to distributed computing

3 Computation-Sharing Models Proxy Computing Data Code Data Code Client Server Computation on the server

4 Computation-Sharing Models Code Shipping Code Data ClientServer Computation on the client Code

5 Computation-Sharing Models Remote Computation Data ClientServer Computation on the server Code

6 Design issues Platform independence to accommodate heterogeneity User friendly Extensibility Load balancing Fault tolerance

7 NetSolve Architecture “OS” Resources

8 NetSolve Organization and Operation

9 NetSolve Client Interface C, Fortran, Java, Matlab, and Mathematica >> a = rand(100); b= rand(100,1); >> x = netsolve(’ax = b’, a, b); >> a = rand(100); b= rand(100,1); >> request = netsolve_nb (’send’, ’ax = b’, a, b); >> x = netsolve_nb(’probe’, request); Not ready >> x= netsolve_nb(’wait’, request);

10 NetSolve Wrappers Problem description file for extensibility @PROBLEM ipars @INCLUDE ”ipars.h” @LIB /home/user/lib/libipars.a @DECRIPTION Parallel Sub-Surface Flow Simulator @INPUT 2 @OBJECT STRING CHAR model @OBJECT FILE CHAR infile Compiled into wrappers around scientific libraries XDR for platform-independent data transfer

11 NetSolve Load Balancing Assigning a task to the “best” machine Establishing a performance model Network delay, server properties, task properties Measuring and monitoring dynamic system states Load balancing at a finer granularity Parallelism through non-blocking interface Task migration

12 NetSolve Fault Tolerance Inter-server fault tolerance Fault tolerance among NetSolve servers Intra-server fault tolerance Fault tolerance within a NetSolve server

13 NetSolve Fault Tolerance Inter-server Fault Tolerance Performed by NetSolve agents Basic approach Failure detection + task reallocation Overload detection + task migration Introducing NetSolve storage servers Store checkpoints or any information related to fault tolerance (must be platform-independent) No reliance on failed or overloaded server for task migration

14 NetSolve Fault Tolerance Intra-server Fault Tolerance Not a new problem Could be invisible to NetSolve Can take advantage of platform-specific features for fault tolerance Possible integration with inter-server fault tolerance

15 Diskless Checkpointing Checksums and Reverse Computation Diskless checkpointing eliminates the need for stable storage N servers + a checkpointing server At any point, consistent checkpoints taken at N servers (stored in memory) A checksum of checkpoints stored at the checkpointing server Rollback using reverse computation State recovery using the checksum

16 Applications MCell with NetSolve Large code, small data Matlab with NetSolve Tradeoffs between parallelism and overhead IPARS with NetSolve ImageVision with NetSolve

17

18 Integration with ScaLAPACK

19 Integration with Condor

20 Integration with Ninf

21 Conclusion An interesting infrastructure for sharing computational resources Both software and hardware Convenience, performance, and reliability Playground for fault tolerance Both general and specific


Download ppt "NetSolve Henri Casanova and Jack Dongarra University of Tennessee and Oak Ridge National Laboratory"

Similar presentations


Ads by Google