Virtualization in MetaSystems Vaidy Sunderam Emory University, Atlanta, USA
Credits and Acknowledgements Distributed Computing Laboratory, Emory University Dawid Kurzyniec, Piotr Wendykier, David DeWolfs, Dirk Gorissen, Maciej Malawski, Vaidy Sunderam Collaborators Oak Ridge Labs (A. Geist, C. Engelmann, J. Kohl) Univ. Tennessee (J. Dongarra, G. Fagg, E. Gabriel) Sponsors U. S. Department of Energy National Science Foundation Emory University
Virtualization Fundamental and universal concept in CS, but receiving renewed, explicit recognition Machine level Single OS image: Virtuozo, Vservers, Zones Full virtualization: VMware, VirtualPC, QEMU Para-virtualization: UML, Xen (Ian Pratt et. al, cl.cam.uk) “Consolidate under-utilized resources, avoid downtime, load- balancing, enforce security policy” Parallel distributed computing Software systems: PVM, MPICH, grid toolkits and systems Consolidate under-utilized resources, avoid downtime, load- balancing, enforce security policy + aggregate resources
Virtualization in PVM Historical perspective – PVM 1.0, 1989
Key PVM Abstractions Programming model Timeshared, multiprogrammed virtual machine Two-level process space Functional name + ordinal number Flat, open, reliable messaging substrate Heterogeneous messages and data representation Multiprocessor emulation Processor/process decoupling Dynamic addition/deletion of processors Raw nodes projected Transparently Or with exposure of heterogeneous attributes
Parallel Distributed Computing Multiprocessor systems Parallel distributed memory computing Stable and mainstream: SPMD, MPI Issues relatively clear: performance Platforms Applications Correspondingly tightly coupled
Parallel Distributed Computing Metacomputing and grids Platforms Parallelism Possibly within components, but mostly loose concurrency or pipelining between components (PVM: 2-level model) Grids: resource virtualization across multiple admin domain Moved to explicit focus on service orientation “Wrap applications as services, compose applications into workflows”; deploy on service oriented infrastructure Motivation: service/resource coupling Provider provides resource and service; virtualized access
Virtualization in PDC What can/should be virtualized? Raw resource CPU : process/task instantiation => staging, security etc Storage : e.g. network file system over GMail Data : value added or processed Service Define interface and input-output behavior Service provider must operate the service Communication Interaction paradigm with strong/adequate semantics Key capability: Configurable/reconfigurable resource, service, and communication
The Harness II Project Theme Virtualized abstractions for critical aspects of parallel distributed computing implemented as pluggable modules, (including programming systems) Major project components Fault-tolerant MPI: specification, libraries Container/component infrastructure: C-kernel, H2O Communication framework: RMIX Programming systems: FT-MPI + H2O, MOCCA (CCA + H2O), PVM
DVM-enabling components Virtual layer Harness II Provider B Provider A Provider C Cooperating users FT-MPI PVM Comp. Active objects... Applications App 1App 2 Programming model Aggregation for Concurrent High Performance Computing Hosting layer Collection of H2O kernels Flexible/lightweight middleware Equivalent to Distributed Virtual Machine But only on client side DVM pluglets responsible for (Co) allocation/brokering Naming/discovery Failures/migration/persistence Programming environments: FT- MPI, CCA, paradigm frameworks, distributed numerical libraries
H2O Middleware Abstraction Providers own resources Independently make them available over the network Clients discover, locate, and utilize resources Resource sharing occurs between single provider and single client Relationships may be tailored as appropriate Including identity formats, resource allocation, compensation agreements Clients can themselves be providers Cascading pairwise relationships may be formed Network Providers Clients
H2O Framework Resources provided as services Service = active software component exposing functionality of the resource May represent „added value” Run within a provider’s container (execution context) May be deployed by any authorized party: provider, client, or third-party reseller Provider specifies policies Authentication/authorization Actors kernel/pluglet Decoupling Providers/providers/clients Container Provider host Deploy Lookup & use Provider Client > B A Provider > A B Container Lookup & use Client Deploy Provider, Client, or Reseller Provider host Traditional model H2O model
Example usage scenarios n Resource = computational service n Reseller deploys software component into provider’s container n Reseller notifies the client about the offered computational service n Client utilizes the service n Resource = raw CPU power n Client gathers application components n Client deploys components into providers’ containers n Client executes distributed application utilizing providers’ CPU power n Resource = legacy application n Provider deploys the service n Provider stores the information about the service in a registry n Client discovers the service n Client accesses legacy application through the service
Model and Implementation H2O nomenclature container = kernel component = pluglet Object-oriented model, Java and C-based implementations Pluglet = remotely accessible object Must implement Pluglet interface, may implement Suspendible interface Used by kernel to signal/trigger pluglet state changes Model Implement (or wrap) service as a pluglet to be deployed on kernel(s) Pluglet Functional interfaces Kernel Clients [Suspendible] Interface Pluglet { void init(ExecutionContext cxt); void start(); void stop(); void destroy(); } Interface Suspendible { void suspend(); void resume(); } Interface StockQuote { double getStockQuote(); } (e.g. StockQuote)
Accessing Virtualized Services Request-response ideally suited, but Stateful service access must be supported Efficiency issues, concurrent access Asynchronous access for compute intensive service Semantics of cancellation and error handling Many approaches focus on performance alone and ignore semantic issues Solution Enhanced procedure call/method invocation Well understood paradigm, extend to be more appropriate to access metacomputing services
The RMIX layer H2O built on top of RMIX communication substrate Provides flexible p2p communication layer for H2O applications Enable various message layer protocols within a single, provider-based framework library Adopting common RMI semantics Enable high performance and interoperability Easy porting between protocols, dynamic protocol negotiation Offer flexible communication model, but retain RMI simplicity Extended with: asynchronous and one-way calls Issues: Consistency, Ordering, Exceptions, Cancellation RPC clients Web Services SOAP clients... Java H2O kernel A C B EFD RMIX Networking RMIX Networking RPC, IIOP, JRMP, SOAP, …
RMIX Overview Extensible RMI framework Client and provider APIs uniform access to communication capabilities supplied by pluggable provider implementations Multiple protocols supported JRMPX, ONC-RPC, SOAP Configurable and flexible Protocol switching Asynchronous invocation ONC-RPC Web Services SOAP clients GM RMIX RMIX XSOAP RMIX RPCX RMIX Myri RMIX JRMPX Java Service Access
RMIX Abstractions Uniform interface and API Protocol switching Protocol negotiation Various protocol stacks for different situations SOAP: interoperability SSL: security ARPC, custom (Myrinet, Quadrics): efficiency Harness Kernel Internet security firewall efficiency H2O Pluglet Client or Server H2O Pluglet Client or Server H2O Pluglet Client or Server H2O Pluglet Client or Server Asynchronous access to virtualized remote resources
Parameter marshalling Data consistency Also in PVM, MPI etc Exceptions/cancellation Critical for stateful servers Conservative vs. best effort Other issues Execution order Security Virtualizing communications Performance/familiarity vs. semantic issues :stub :param create() asyncCall() modify() read() Asynchronous RMIX :stub “started” :target “completed” ClientServer Disregard At Client-Side Interrupt Client I/O Disregard At Server-Side Interrupt Server Thread Interrupt Server I/O Ignore Result Reset server state Result Delivery Result Unmarshalling Parameter Marshalling Parameter Unmarshalling Result Marshalling Method Call Call Initiation Cancellation at various stages of the call
Programming Models: CCA and H2O Common Component Architecture Component standard for HPC Uses and provides ports described in SIDL Support for scientific data types Existing tightly coupled (CCAFFEINE) and loosely coupled, distributed (XCAT) frameworks H2O Well matched to CCA model
MOCCA implementation in H2O Each component running in separate pluglet Thanks to H2O kernel security mechanisms, multiple components may run without interfering Two-level builder hierarchy ComponentID: pluglet URI MOCCA_Light: pure Java implementation (no SIDL)
Performance: Small Data Packets Factors: SOAP header overhead in XCAT Connection pools in RMIX
Large Data Packets Encoding (binary vs. base64) CPU saturation on Gigabit LAN (serialization) Variance caused by Java garbage collection
Use Case 2: H2O + FT-MPI Overall scheme: H2O framework installed on computational nodes, or cluster front-ends Pluglet for startup, event notification, node discovery FT-MPI native communication (also MPICH) Major value added FT-MPI need not be installed anywhere on computing nodes To be staged just-in-time before program execution Likewise, application binaries and data need not be present on computing nodes The system must be able to stage them in a secure manner
Staging FT-MPI runtime with H2O FT-MPI runtime library and daemons Staged from a repository (e.g. Web server) to the computational node upon user’s request Automatic platform type detection; appropriate binary files are downloaded from the repository as needed Allows users to run fault tolerant MPI programs on machines where FT-MPI is not pre-installed Not needing login account to do so: using H2O credentials instead
Launching FT-MPI applications with H2O Staging applications from a network repository Uses URL code base to refer to a remotely stored application Platform-specific binary transparently uploaded to a computational node upon client request Separation of roles Application developer bundles the application and puts it into a repository The end-user launches the application, unaware of heterogeneity
Interconnecting heterogeneous clusters Private, non-routable networks Communication proxies on cluster front-ends route data streams Local (intra-cluster) channels not affected Nodes use virtual addresses at the IP level; resolved by the proxy
Initial experimental results Proxied connection versus direct connection Standard FT-MPI throughput benchmark was used within a Gig-Ethernet cluster: proxies retain 65% of throughput
Summary Virtualization in PDC Devising appropriate abstractions Balance pragmatics and performance vs. model cleanness The Harness II Project H2O kernel Reconfigurability, by clients/tpr’s very valuable RMIX communications framework High level abstractions for control comms (native data comms) Multiple programming model overlays CCA, FT-MPI, PVM Concurrent computing environments on demand