Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, 20141 Data Management Services for VPH Applications Marian Bubak,

Slides:



Advertisements
Similar presentations
LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Advertisements

Creating HIPAA-Compliant Medical Data Applications with Amazon Web Services Presented by, Tulika Srivastava Purdue University.
WP2 Team of VPH-Share Project dice.cyfronet.pl/projects/VPH-Share
Futures – Alpha Cloud Deployment and Application Management.
High Performance Computing Course Notes Grid Computing.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Overview Of Microsoft New Technology ENTER. Processing....
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
The Architecture of Transaction Processing Systems
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 7 Configuring File Services in Windows Server 2008.
CGW’12, Cracow, October 22-24, Oct-12 Managing Cloud Resources for Medical Applications P. Nowakowski, T. Bartyński, T. Gubała, D. Harężlak, M.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
INTRODUCTION TO CLOUD COMPUTING Cs 595 Lecture 5 2/11/2015.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
DESIGN OF A PLATFORM OF VIRTUAL SERVICE CONTAINERS FOR SERVICE ORIENTED CLOUD COMPUTING Carlos de Alfonso Andrés García Vicente Hernández.
Towards auto-scaling in Atmosphere cloud platform Tomasz Bartyński 1, Marek Kasztelnik 1, Bartosz Wilk 1, Marian Bubak 1,2 AGH University of Science and.
Opensource for Cloud Deployments – Risk – Reward – Reality
Distributed Cloud Environment for PL-Grid Applications Piotr Nowakowski, Tomasz Bartyński, Tomasz Gubała, Daniel Harężlak, Marek Kasztelnik, J. Meizner,
Summer School on Grid and Cloud Workflows and Gateways, Budapest, 1-6 July July 2013 Enabling building and execution of VPH applications on federated.
CIRRUS Workshop, Vienna, Austria119 Nov 2013 Security in the Cloud Platform for VPH Applications Marian Bubak Department of Computer Science and Cyfronet,
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
CGW 2003 Institute of Computer Science AGH Proposal of Adaptation of Legacy C/C++ Software to Grid Services Bartosz Baliś, Marian Bubak, Michał Węgiel,
Cloud Computing. What is Cloud Computing? Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Overview of Atmosphere.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
ANSTO E-Science workshop Romain Quilici University of Sydney CIMA CIMA Instrument Remote Control Instrument Remote Control Integration with GridSphere.
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
1 Introduction to Microsoft Windows 2000 Windows 2000 Overview Windows 2000 Architecture Overview Windows 2000 Directory Services Overview Logging On to.
Experience with the OpenStack Cloud for VPH Applications Jan Meizner 1, Maciej Malawski 1,2, Piotr Nowakowski 1, Paweł Suder 1, Marian Bubak 1,2 AGH University.
DataNet – Flexible Metadata Overlay over File Resources Daniel Harężlak 1, Marek Kasztelnik 1, Maciej Pawlik 1, Bartosz Wilk 1, Marian Bubak 1,2 1 ACC.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
EC-project number: Universal Grid Client: Grid Operation Invoker Tomasz Bartyński 1, Marian Bubak 1,2 Tomasz Gubała 1,3, Maciej Malawski 1,2 1 Academic.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Atmosphere: A Platform for Development, Execution and Sharing of Applications in Federated Clouds Marian Bubak Piotr Nowakowski, Marek Kasztelnik, Tomasz.
Lightweight construction of rich scientific applications Daniel Harężlak(1), Marek Kasztelnik(1), Maciej Pawlik(1), Bartosz Wilk(1) and Marian Bubak(1,
Federating PL-Grid Computational Resources with the Atmosphere Cloud Platform Piotr Nowakowski, Marek Kasztelnik, Tomasz Bartyński, Tomasz Gubała, Daniel.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Chapter 8 – Cloud Computing
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
Vignesh Ravindran Sankarbala Manoharan. Infrastructure As A Service (IAAS) is a model that is used to deliver a platform virtualization environment with.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Development, Execution and Sharing of VPH Applications in the Cloud with the Atmosphere Platform Piotr Nowakowski, Tomasz Bartyński, Marian Bubak, Tomasz.
Methods and Tools for Data Intensive Science on Distributed Resources Methods and Tools for Data Intensive Science on Distributed Resources Marian Bubak.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
PaaS services for Computing and Storage
Onedata Eventually Consistent Virtual Filesystem for Multi-Cloud Infrastructures Michał Orzechowski (CYFRONET AGH)
Jean-Philippe Baud, IT-GD, CERN November 2007
Department of Computer Science AGH
Model Execution Environment Current status of the WP2 Infrastructure Platform Marian Bubak1, Daniel Harężlak1, Marek Kasztelnik1 , Piotr Nowakowski1, Steven.
Introduction to Distributed Platforms
From VPH-Share to PL-Grid: Atmosphere as an Advanced Frontend
Netscape Application Server
Model Execution Environment for Investigation of Heart Valve Diseases
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Joseph JaJa, Mike Smorul, and Sangchul Song
Introduction to Data Management in EGI
PROCESS - H2020 Project Work Package WP6 JRA3
AWS Cloud Computing Masaki.
Final Review 27th March Final Review 27th March 2019.
Presentation transcript:

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Data Management Services for VPH Applications Marian Bubak, Adam Belloum, Spiros Koulouzis, Piotr Nowakowski, Dmitry Vasyunin Department of Computer Science and Cyfronet AGH Krakow, PL Informatics Institute, University of Amsterdam WP2 - VPH-Share dice.cyfronet.pl/projects/VPH-Share

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Infostructure for Virtual Physiological Human VPH-Share project website: / /

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Install/configure each application service (which we call an Atomic Service) once – then use them multiple times in different workflows; Direct access to raw virtual machines is provided for developers, with multitudes of operating systems to choose from (IaaS solution); Install whatever you want (root access to Cloud Virtual Machines); The cloud platform takes over management and instantiation of Atomic Services; Many instances of Atomic Services can be spawned simultaneously; Large-scale computations can be delegated from the PC to the cloud/HPC via a dedicated interface; Smart deployment: computations can be executed close to data (or the other way round). Developer Application Install any scientific application in the cloud End user Access available applications and data in a secure manner Administrator Cloud infrastructure for e-science Manage cloud computing and storage resources Managed application Basic functionality of the cloud platform Nowakowski P, Bartynski T, Gubala T, Harezlak D, Kasztelnik M, Malawski M, Meizner J, Bubak M: Cloud Platform for Medical Applications, eScience 2012; Atmosphere platform:

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, VPH-Share federated cloud Manages computational cloud resources USFD Cyfronet LOBCDER Vienna Other commercial Amazon cloud (EC2; S3) RackSpace CloudFiles Atmosphere Cloud Platform Manages data cloud resources

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Most of the VPH workflows consist of three tasks: obtain clinical or biomedical data analyze the data with models or simulations produce clinical output In all of these tasks data sharing and access play a vital role. A. Belloum, M. Inda, D. Vasunin, V. Korkhov, Z. Zhao, H. Rauwerda, T. Breit, M. Bubak, and L. Hertzberger, “Collaborative e-science experiments and scientific workflows,” Internet Computing, IEEE, vol. 15, no. 4, pp. 39 –47, July-August VPH applications

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Challenges The heterogeneity of data access technologies and distribution of datasets makes sharing and unified access difficult Ad-hoc client implementations are ofen developed to consume data from different sources It is impractical to force storage providers to install and maintain large software stacks so that different clients can have access In VPH-Share datasets are not located in a single storage infrastructure and are available through different technologies Vendor lock-in is an issue when using clouds A variety of legacy applications can only access data from a local disk

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Objectives We want to build on top of existing infrastructure to provide large-scale advanced storage capabilities This requires flexible, distributed, decentralized, scalable, fault tolerant, self- optimizing, solutions that will address the issues of scientific communities Issues Data volume Data availability Data retrievability Data integrity Sharing Privacy Federation of datasets

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, The architecture must be: loosely coupled flexible distributed easy to use standards compliant Storage federation: aggregation of a pool of independent resources in a client-centric manner Exposure of storage resources via standardized protocols File system abstraction Introducing a common management layer that interfaces independent storage resources common management Requirements

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, LOBCDER The Large OBject Cloud Data storagE fedeRation is a lightweight, easily deployable streaming service loosely couples a variety of storage technologies transparently presents a distributed file system applications perceive a system which mimics a local FS handles locating files and transporting data, with Access transparency Location transparency Concurrency transparency Heterogeneity Replication transparency Migration transparency

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Structure of LOBCDER Spiros Koulouzis, Dmitry Vasyunin, Reginald Cushing, Adam Belloum and Marian Bubak, Cloud Data Storage Federation for Scientific Applications, Euro-Par 2013 Workshop Proceedings; pp

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, The frontend provides access control, authentication and authorization It uses WebDAV which provides interoperability as an RFC standard It enables network transparency through clients that are able to mount WebDAV It supports versioning, locking, and custom properties The authentication service authenticates users according to a security token The authentication service validates the token and returns information about the user For clients that want control over infrastructure-dependent properties we have implemented a RESTful interface LOBCDER frontend

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, The resource layer creates a logical representation of the physical storage space and manages physical files WebDAVResourceFactory, and the WebDAVResource provide a WebDAV representation of LogicalResource ResourceCatalog connects to the persistence layer and queries LogicalResources The Task component manages the physical files; it schedules file replication and deletion LogicalResources hold basic metadata The PDRI component represents physical data The StorageSite component provides a description of the underlying storage resources LOBCDER resource layer

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, The backend layer provides the necessary abstraction to uniformly access physical storage resources. It is a Virtual Resource System API VFSClient can perform file system operations on physical data. Different VFSDriver implementations enable transparent access to storage resources The persistence layer is a relational database which holds the logical data that are represented by the LogicalResource It provides Atomicity, Consistency, Isolation and Durability (ACID) guarantees. LOBCDER backend

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, To balance requests and maintain a high level of accessibility we use an elastic distributed architecture Workers handle physical file transfer Worker instances can be deployed on arbitrary resources Workers are stateless instances that serve GET requests Distributed architecture of LOBCDER LOB Master DB LOB Worker1 Stor. Site 1 Stor. Site 3 Stor. Site 2 LOB Worker2 LOB Worker3

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Data access for large binary objects LOBCDER host ( ) LOBCDER service backend Resource catalogue WebDAV servlet Resource factory Storage driver Storage driver (SWIFT) SWIFT storage backend Core component host (vph.cyfronet.pl) Data Manager Portlet (VPH-Share Master Interface component) Atomic Service Instance ( x.x) Service payload (VPH-Share application component) External host Generic WebDAV client GUI-based access Mounted on local FS (e.g. via davfs2) In VPH-Share, the federated data storage module (LOBCDER) enables data sharing in the context of VPH-Share applications The module is capable of interfacing various types of storage resources and supports SWIFT cloud storage as well as Amazon S3 LOBCDER exposes a WebDAV interface and can be accessed by any DAV-compliant client. It can also be mounted as a component of the local client filesystem using any DAV-to-FS driver (such as davfs2) Encryption keys REST-interface Master Interface component Ticket validation service Auth service

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, LOBCDER in VPH-Share Online for 2762 hours (3.8 months) 800 GB transferred requests served At least 30 active users (some acounts are used by at least 5 people)

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Evaluation of LOBCDER

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Scalability of distributed LOBCDER 1,024 users request a file simultaneously The number of users in queue decreases with the number of active workers

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Data reliability and integrity (DRI) Long-term persistence of medical data sets requires reliability and integrity mechanisms to be built on top of Cloud storage. DRI is designed to fullfill these requirements by performing the following tasks: periodic and request-driven integrity checks on data sets facilitating storage of multiple copies of data on various Cloud platforms tracking the history and origin of binary data sets Dataset validation availability of each dataset at its (multiple) locations the integrity of each dataset’s file (checksum-based)

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, DRI service DRI Service characteristics stateless RESTful service built on top of the Atmosphere computational cloud metadata registry (AIR) and LOBCDER storage resources autonomous (periodic checks) and controllable via API access

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Provides a mechanism which keeps track of binary data stored in cloud infrastructure Monitors data availability Advises the cloud platform when instantiating atomic services Binary data registry LOBCDER Amazon S3OpenStack SwiftCumulus Register files Get metadata Migrate LOBs Get usage stats (etc.) Distributed Cloud storage Store and marshal data End-user features (browsing, querying, direct access to data, checksumming) VPH Master Int. Data management portlet (with DRI management extensions) DRI Service A standalone application service, capable of autonomous operation. It periodically verifies access to any datasets submitted for validation and is capable of issuing alerts to dataset owners and system administrators in case of irregularities. Validation policy Configurable validation runtime (registry-driven) Runtime layer Extensible resource client layer Metadata extensions for DRI DRI architecture

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Features supported requests: dataset validation and adding dataset under management (computing its checksums), integrity checks based on SHA-256 hash of the file, asynchronous operations added to queue of a simple scheduler, report notifications (validation success/failure or finished adding dataset under management) as messages, Current REST API compute_dataset_checksums/{datasetID} – asynchronously compute dataset file checksums and dispatch notification by , validate_dataset/{datasetID} – asynchronously validate dataset and dispatch notification by . K. Styrc, P. Nowakowski, M. Bubak: Managing data reliability and integrity in federated cloud storage. In: M. Bubak, M. Turała, K. Wiatr (Eds) CGW’13 Proceedings, ACK CYFRONET AGH, Kraków, ISBN , pp (2013) DRI implementation

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Distributed applications have a global shared view of the entire available storage space Applications can be developed locally and deployed on the computation platform without reimplementing their data access logic Storage space is used efficiently with the copy-on-write strategy Replication of data can be based on efficiency cost measures Reduced risk of vendor lock-in in clouds Summary

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Status:Production Number of users (current, target):Current and target: around 100 Default and Maximum quota:n/a Linux/Mac/Win user ratio:Approximately 50/50 Linux/Win Desktop clients/Mobile Clients/Web access ratio: Mostly desktop clients with some web applications Technology:OpenStack SWIFT, Amazon S3 Target communities:Medical researchers and members of the VPH NoE Integration in your current environment (examples): WebDAV external interfaces; robust GUIs embedded in VPH-Share MI Risk factors:n/a Most important functionality:Federated cloud storage Missing functionality (if any):Storage encryption Service summary

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Users interact with the presented data storage services in almost any production-mode VPH-Share application Since the VPH-Share data storage components are designed to be transparent to end users, it is difficult to obtain direct feedback regarding their usability – instead, users contact us in case of any data access problems and we operate by the „no news is good news” rule Our current concern is sustainability after VPH-Share User feedback

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Example: sensitivity analysis application DataFluo Listener RabbitMQ DataFluo Server AS RabbitMQ Worker AS RabbitMQ Worker AS Cloud Facade Atmosphere Management Service (Launches server and automatically scales workers) Atmosphere Scientist Launcher script Secure API Problem: Cardiovascular sensitivity study: 164 input parameters (e.g. vessel diameter and length) First analysis: 1,494,000 Monte Carlo runs (expected execution time on a PC: 14,525 hours) Second Analysis: 5,000 runs per model parameter for each patient dataset; requires another 830,000 Monte Carlo runs per patient dataset for a total of four additional patient datasets – this results in 32,280 hours of calculation time on one personal computer. Total: 50,000 hours of calculation time on a single PC. Solution: Scale the application with cloud resources. VPH-Share implementation: Scalable workflow deployed entirely using VPH- Share tools and services. Consists of a RabbitMQ server and a number of clients processing computational tasks in parallel, each registered as an Atomic Service. The server and client Atomic Services are launched by a script which communicates directly withe the Cloud Facade API. Small-scale runs successfully competed, large- scale run in progress.

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, Deployment of the OncoSimulator Tool on VPH-Share resources: Uses a custom Atomic Service as the computational backend. Features integration of data storage resources OncoSimulator AS also registered in VPH-Share metadata store P-Medicine Portal P-Medicine users VITRALL Visualization Service VPH-Share Computational Cloud Platform Cloud Facade Atmosphere Management Service (AMS) AIR registry OncoSimulator Submission Form P-Medicine Data Cloud Visualization window Storage resources Cloud HN Cloud WN OncoSimulator ASI LOBCDER Storage Federation Storage resources Launch Atomic Services Store output Mount LOBCDER and select results for storage in P-Medicine Data Cloud Deployment of OncoSimulator in the cloud

Workshop on Cloud Services for File Synchronization and Sharing, CERN, November 17-18, More information at dice.cyfronet.pl documentation, publications, links to manuals, videos, etc. Your one-stop entry to all VPH-Share functionality. You can log in with your BioMedTown account (available to all members of the VPH NoE)