Presentation is loading. Please wait.

Presentation is loading. Please wait.

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20141 Data Management Services for VPH Applications Marian Bubak,

Similar presentations


Presentation on theme: "Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20141 Data Management Services for VPH Applications Marian Bubak,"— Presentation transcript:

1 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20141 Data Management Services for VPH Applications Marian Bubak, Adam Belloum, Spiros Koulouzis, Piotr Nowakowski, Dmitry Vasyunin Department of Computer Science and Cyfronet AGH Krakow, PL Informatics Institute, University of Amsterdam WP2 - VPH-Share dice.cyfronet.pl/projects/VPH-Share www.vph-share.eu

2 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20142 2 Infostructure for Virtual Physiological Human VPH-Share project website: http://www.vph-share.eu /http://www.vph-share.eu /

3 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20143 Install/configure each application service (which we call an Atomic Service) once – then use them multiple times in different workflows; Direct access to raw virtual machines is provided for developers, with multitudes of operating systems to choose from (IaaS solution); Install whatever you want (root access to Cloud Virtual Machines); The cloud platform takes over management and instantiation of Atomic Services; Many instances of Atomic Services can be spawned simultaneously; Large-scale computations can be delegated from the PC to the cloud/HPC via a dedicated interface; Smart deployment: computations can be executed close to data (or the other way round). Developer Application Install any scientific application in the cloud End user Access available applications and data in a secure manner Administrator Cloud infrastructure for e-science Manage cloud computing and storage resources Managed application Basic functionality of the cloud platform Nowakowski P, Bartynski T, Gubala T, Harezlak D, Kasztelnik M, Malawski M, Meizner J, Bubak M: Cloud Platform for Medical Applications, eScience 2012; Atmosphere platform: http://dice.cyfronet.pl/products/atmospherehttp://dice.cyfronet.pl/products/atmosphere

4 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20144 VPH-Share federated cloud Manages computational cloud resources OpenStack @ USFD OpenStack @ Cyfronet LOBCDER OpenStack @ Vienna Other commercial Amazon cloud (EC2; S3) RackSpace CloudFiles Atmosphere Cloud Platform Manages data cloud resources

5 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20145 Most of the VPH workflows consist of three tasks: obtain clinical or biomedical data analyze the data with models or simulations produce clinical output In all of these tasks data sharing and access play a vital role. A. Belloum, M. Inda, D. Vasunin, V. Korkhov, Z. Zhao, H. Rauwerda, T. Breit, M. Bubak, and L. Hertzberger, “Collaborative e-science experiments and scientific workflows,” Internet Computing, IEEE, vol. 15, no. 4, pp. 39 –47, July-August 2011. VPH applications

6 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20146 Challenges The heterogeneity of data access technologies and distribution of datasets makes sharing and unified access difficult Ad-hoc client implementations are ofen developed to consume data from different sources It is impractical to force storage providers to install and maintain large software stacks so that different clients can have access In VPH-Share datasets are not located in a single storage infrastructure and are available through different technologies Vendor lock-in is an issue when using clouds A variety of legacy applications can only access data from a local disk

7 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20147 Objectives We want to build on top of existing infrastructure to provide large-scale advanced storage capabilities This requires flexible, distributed, decentralized, scalable, fault tolerant, self- optimizing, solutions that will address the issues of scientific communities Issues Data volume Data availability Data retrievability Data integrity Sharing Privacy Federation of datasets

8 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20148 The architecture must be: loosely coupled flexible distributed easy to use standards compliant Storage federation: aggregation of a pool of independent resources in a client-centric manner Exposure of storage resources via standardized protocols File system abstraction Introducing a common management layer that interfaces independent storage resources common management Requirements

9 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20149 LOBCDER The Large OBject Cloud Data storagE fedeRation is a lightweight, easily deployable streaming service loosely couples a variety of storage technologies transparently presents a distributed file system applications perceive a system which mimics a local FS handles locating files and transporting data, with Access transparency Location transparency Concurrency transparency Heterogeneity Replication transparency Migration transparency

10 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201410 Structure of LOBCDER Spiros Koulouzis, Dmitry Vasyunin, Reginald Cushing, Adam Belloum and Marian Bubak, Cloud Data Storage Federation for Scientific Applications, Euro-Par 2013 Workshop Proceedings; pp. 13-22.

11 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201411 The frontend provides access control, authentication and authorization It uses WebDAV which provides interoperability as an RFC standard It enables network transparency through clients that are able to mount WebDAV It supports versioning, locking, and custom properties The authentication service authenticates users according to a security token The authentication service validates the token and returns information about the user For clients that want control over infrastructure-dependent properties we have implemented a RESTful interface LOBCDER frontend

12 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201412 The resource layer creates a logical representation of the physical storage space and manages physical files WebDAVResourceFactory, and the WebDAVResource provide a WebDAV representation of LogicalResource ResourceCatalog connects to the persistence layer and queries LogicalResources The Task component manages the physical files; it schedules file replication and deletion LogicalResources hold basic metadata The PDRI component represents physical data The StorageSite component provides a description of the underlying storage resources LOBCDER resource layer

13 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201413 The backend layer provides the necessary abstraction to uniformly access physical storage resources. It is a Virtual Resource System API VFSClient can perform file system operations on physical data. Different VFSDriver implementations enable transparent access to storage resources The persistence layer is a relational database which holds the logical data that are represented by the LogicalResource It provides Atomicity, Consistency, Isolation and Durability (ACID) guarantees. LOBCDER backend

14 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201414 To balance requests and maintain a high level of accessibility we use an elastic distributed architecture Workers handle physical file transfer Worker instances can be deployed on arbitrary resources Workers are stateless instances that serve GET requests Distributed architecture of LOBCDER LOB Master DB LOB Worker1 Stor. Site 1 Stor. Site 3 Stor. Site 2 LOB Worker2 LOB Worker3

15 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201415 Data access for large binary objects LOBCDER host (149.156.10.143) LOBCDER service backend Resource catalogue WebDAV servlet Resource factory Storage driver Storage driver (SWIFT) SWIFT storage backend Core component host (vph.cyfronet.pl) Data Manager Portlet (VPH-Share Master Interface component) Atomic Service Instance (10.100.x.x) Service payload (VPH-Share application component) External host Generic WebDAV client GUI-based access Mounted on local FS (e.g. via davfs2) In VPH-Share, the federated data storage module (LOBCDER) enables data sharing in the context of VPH-Share applications The module is capable of interfacing various types of storage resources and supports SWIFT cloud storage as well as Amazon S3 LOBCDER exposes a WebDAV interface and can be accessed by any DAV-compliant client. It can also be mounted as a component of the local client filesystem using any DAV-to-FS driver (such as davfs2) Encryption keys REST-interface Master Interface component Ticket validation service Auth service

16 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201416 LOBCDER in VPH-Share Online for 2762 hours (3.8 months) 800 GB transferred 97700 requests served At least 30 active users (some acounts are used by at least 5 people)

17 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201417 Evaluation of LOBCDER

18 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201418 Scalability of distributed LOBCDER 1,024 users request a file simultaneously The number of users in queue decreases with the number of active workers

19 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201419 Data reliability and integrity (DRI) Long-term persistence of medical data sets requires reliability and integrity mechanisms to be built on top of Cloud storage. DRI is designed to fullfill these requirements by performing the following tasks: periodic and request-driven integrity checks on data sets facilitating storage of multiple copies of data on various Cloud platforms tracking the history and origin of binary data sets Dataset validation availability of each dataset at its (multiple) locations the integrity of each dataset’s file (checksum-based)

20 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201420 DRI service DRI Service characteristics stateless RESTful service built on top of the Atmosphere computational cloud metadata registry (AIR) and LOBCDER storage resources autonomous (periodic checks) and controllable via API access

21 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201421 Provides a mechanism which keeps track of binary data stored in cloud infrastructure Monitors data availability Advises the cloud platform when instantiating atomic services Binary data registry LOBCDER Amazon S3OpenStack SwiftCumulus Register files Get metadata Migrate LOBs Get usage stats (etc.) Distributed Cloud storage Store and marshal data End-user features (browsing, querying, direct access to data, checksumming) VPH Master Int. Data management portlet (with DRI management extensions) DRI Service A standalone application service, capable of autonomous operation. It periodically verifies access to any datasets submitted for validation and is capable of issuing alerts to dataset owners and system administrators in case of irregularities. Validation policy Configurable validation runtime (registry-driven) Runtime layer Extensible resource client layer Metadata extensions for DRI DRI architecture

22 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201422 Features supported requests: dataset validation and adding dataset under management (computing its checksums), integrity checks based on SHA-256 hash of the file, asynchronous operations added to queue of a simple scheduler, report notifications (validation success/failure or finished adding dataset under management) as e-mail messages, Current REST API compute_dataset_checksums/{datasetID} – asynchronously compute dataset file checksums and dispatch notification by e-mail, validate_dataset/{datasetID} – asynchronously validate dataset and dispatch notification by e-mail. K. Styrc, P. Nowakowski, M. Bubak: Managing data reliability and integrity in federated cloud storage. In: M. Bubak, M. Turała, K. Wiatr (Eds) CGW’13 Proceedings, ACK CYFRONET AGH, Kraków, ISBN 978-83-61433-08-8, pp. 43-44 (2013) DRI implementation

23 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201423 Distributed applications have a global shared view of the entire available storage space Applications can be developed locally and deployed on the computation platform without reimplementing their data access logic Storage space is used efficiently with the copy-on-write strategy Replication of data can be based on efficiency cost measures Reduced risk of vendor lock-in in clouds Summary

24 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201424 Status:Production Number of users (current, target):Current and target: around 100 Default and Maximum quota:n/a Linux/Mac/Win user ratio:Approximately 50/50 Linux/Win Desktop clients/Mobile Clients/Web access ratio: Mostly desktop clients with some web applications Technology:OpenStack SWIFT, Amazon S3 Target communities:Medical researchers and members of the VPH NoE Integration in your current environment (examples): WebDAV external interfaces; robust GUIs embedded in VPH-Share MI Risk factors:n/a Most important functionality:Federated cloud storage Missing functionality (if any):Storage encryption Service summary

25 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201425 Users interact with the presented data storage services in almost any production-mode VPH-Share application Since the VPH-Share data storage components are designed to be transparent to end users, it is difficult to obtain direct feedback regarding their usability – instead, users contact us in case of any data access problems and we operate by the „no news is good news” rule Our current concern is sustainability after VPH-Share User feedback

26 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201426 Example: sensitivity analysis application DataFluo Listener RabbitMQ DataFluo Server AS RabbitMQ Worker AS RabbitMQ Worker AS Cloud Facade Atmosphere Management Service (Launches server and automatically scales workers) Atmosphere Scientist Launcher script Secure API Problem: Cardiovascular sensitivity study: 164 input parameters (e.g. vessel diameter and length) First analysis: 1,494,000 Monte Carlo runs (expected execution time on a PC: 14,525 hours) Second Analysis: 5,000 runs per model parameter for each patient dataset; requires another 830,000 Monte Carlo runs per patient dataset for a total of four additional patient datasets – this results in 32,280 hours of calculation time on one personal computer. Total: 50,000 hours of calculation time on a single PC. Solution: Scale the application with cloud resources. VPH-Share implementation: Scalable workflow deployed entirely using VPH- Share tools and services. Consists of a RabbitMQ server and a number of clients processing computational tasks in parallel, each registered as an Atomic Service. The server and client Atomic Services are launched by a script which communicates directly withe the Cloud Facade API. Small-scale runs successfully competed, large- scale run in progress.

27 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201427 Deployment of the OncoSimulator Tool on VPH-Share resources: Uses a custom Atomic Service as the computational backend. Features integration of data storage resources OncoSimulator AS also registered in VPH-Share metadata store P-Medicine Portal P-Medicine users VITRALL Visualization Service VPH-Share Computational Cloud Platform Cloud Facade Atmosphere Management Service (AMS) AIR registry OncoSimulator Submission Form P-Medicine Data Cloud Visualization window Storage resources Cloud HN Cloud WN OncoSimulator ASI LOBCDER Storage Federation Storage resources Launch Atomic Services Store output Mount LOBCDER and select results for storage in P-Medicine Data Cloud Deployment of OncoSimulator in the cloud

28 Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 201428 More information at dice.cyfronet.pl documentation, publications, links to manuals, videos, etc. www.vph-share.eu Your one-stop entry to all VPH-Share functionality. You can log in with your BioMedTown account (available to all members of the VPH NoE)


Download ppt "Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20141 Data Management Services for VPH Applications Marian Bubak,"

Similar presentations


Ads by Google