Virtual Research Environments as-a-Service Pasquale Pagano, CNR pasquale.pagano@cnr.it EGI Community Forum 10-13 November 2015 Bari, Italy
… as a Service Capabilities Virtual Research Environment Outline Context E-Infrastructure History D4Science … as a Service Capabilities Virtual Research Environment gCube Features Numbers
e-Infrastructure An operational combination of digital technologies (hardware and software), resources (data and services), communications (protocols, access rights and networks), and people and organizational structures needed to support research efforts and collaboration in the large
Testbed: Virtual Research Environment Genealogy DILIGENT 2004-2007 Testbed: Virtual Research Environment D4Science 2008-2010 Operational: several use cases (fisheries), gCube became an open source project D4Science-II 2010-2012 Operational Ecosystem: use cases (marine biodiversity use cases), D4Science born to go beyond project lifetime iMarine 2012-2014 Operational HDI: exploit D4Science, iMarine CoP, >1500 active users
D4Science operates VREs for … +2000 scientists in 44 countries, integrating +50 heterogeneous data providers, executing +20,000 processes/month; providing access to over a billion quality records in repositories worldwide, with 99,7% service availability. D4Science hosts +40 VREs
Born to serve user needs I need to host my applications in a secure and scalable environment I need to maintain my database I need to backup my data I need to securely delivery my data to a set of known people I want to offer a flexible sharing, storage, reporting, search and retrieval tool I need to manage and analyze data I need to manage the full data life-cycle from import to validation, curation, harmonization and publication I need to offer to my team a powerful tool to manage code-lists I need to reduce the costs of data maintenance of my dept. Capacities Applications I need to access authoritative data I need to simplify the access to my data I need to mash-up statistical and geospatial data I need to analyse my big datasets I need to validate my datasets and provide a standard access to them Data
Distinguishing capabilities of the e-infrastructure D4Science
The D4Science infrastructure Hybrid Data Infrastructure combining over 500 software components into a coherent and centrally managed system of hardware, software, and data resources
D4Science enables e-infrastructure by ... Integrating geographically distributed computing infrastructure Overcoming administrative boundaries Exploiting private and commercial providers Providing service allocations, deployment, monitoring, and operation Ensuring uniform resource and data access Operation Built on SLAs Support monitoring, auditing, reporting, and notification Trust Privacy, governance, and attribution Security, trusted network
to host and maintain data Storage as Service to host and maintain data Database Cloud Storage Geographical DB High-availability Standard Ready-to-use Scalable Reliable Secure Policies Standard Privacy and Attribution
Applications as a Service to curate and manage data Metadata Generation Geospatial Data Biodiversity Data Statistical Data Textual Data Harmonization Disambiguate Validate Integrate and Consistency Check Data Exchange OGC protocols DarwinCore SDMX DublinCore
to process and extract knowledge Computing as Service to process and extract knowledge Scalable Easy to Manage Across Boundaries Tailored Elastic Assignment of Computing Assignment of Processors Virtual Research Environment Heterogeneous High Throughput Map-Reduce Parallel R
Computational Engine Not another cloud computer platform but a platform where executions can be repeated, compared, discussed, logged Not another computational engine but a platform where interdisciplinary tools and services can be easily contributed by the communities
Two exploitation models Dispatcher Tools (R, Java, …) must be uploaded to the storage Executable is deployed on the worker nodes assigned to the VRE Data are made accessible to the worker nodes according to the specification provided Monitoring, accounting, failures management, partial re-execution, sharing, and repeatability are granted Application Framework Predefined data splitting models are provided A large array of models and algorithms can be exploited to define custom workflows Large array of algorithms to compare results are provided
Virtual Research Environment to access, share and collaborate Share Database Tables Workflow Files Communicate Post Favourite Connection Organize Dynamic Secure Policy Driven
Virtual Research Environment a distributed and dynamically created environment where subset of resources (data, services, computational, and storage resources) regulated by tailored policies (e.g. data encryption with VRE specific key, quota on service calls and storage usage, …) are assigned to a subset of users via interfaces for a limited timeframe at little or no cost for the providers of the participatory data e-infrastructures L. Candela, D. Castelli, P. Pagano (2013) Virtual Research Environments: An Overview and a Research Agenda. Data Science Journal, Vol. 12
Metadata Applications Data Configuration VRE Definition Simple and effective process to define a new environment Data Configuration
Applications vs Services Logical View Applications Data Registry Hardware Configuration Physical View Software, Tools, Services Data
Application Bundles https://www.gcube-system.org/catalogue-of-applications AppsCube BiolCube ConnectCube To develop applications interfacing gCube facilities To aid modelling and analysing of distribuition data, comparing checklists, and producing maps To facilitate data publication with appropriate tools including semantic technologies GeosCube StatsCube IceCube To assist tabular data validation, data enrichment ad efficient analytical tools To support deployment, operation & mgmt of a gCube-based infrastructure To properly access, consume and produce geospatial information
VRE Exploitation Exploited for Public VREs (used to offer an application environment to a subset of users of a community) and Private VREs (used for experiments, data access and preparation, and data analytics) Fully operational VRE available in one hour Software deployment and hardware setup completely hidden Evolving needs of its users completely supported
Entity as Resource Entity Server, Storage Container Software Data As a resource Publication/Discovery Lifecycle management Failure management Authorization-accounting As a service Access Orchestrate Reference Software as Resource: transforms servlets-based applications/services in e-Infrastructure resource Container as Resource: transforms standard servlets-based container in e-Infrastructure resource Federated Sources as Resource: transforms external DBs and Repositories in e-Infrastructure resource Algorithm as Resource: for any new algorithm, model, procedure, workflow, … it is possible to manage policies and assign dedicated Hardware and Storage resources Dataset and single product as Resource: for any dataset, map, timeseries, code list, …. It is possible to manage policies and monitor their exploitation
SmartGears “a set of Java libraries that turn Servlet-compliant containers and applications into infrastructure resources, transparently.” gCube Wiki turn software and containers into resources what does it mean ?
Software-as-Resource Container-as-Resource Actual Solution SmartGears [cont.] Software-as-Resource Container-as-Resource Actual Solution Zero constraints software and nodes we can discover use without hardcoded knowledge monitor and control take actions when not operational dedicate to user groups change policies, assign roles human solutions not practical, often impossible automated solutions local enabling software, remotely controlled management tasks compile and publish descriptions track and change status enforce policies
gCube: One stable open-source platform gCube enables the D4Science HDI Statistics form openhub.net/p/gCube
Multi-tenant Delivery Model Infrastructure as a Service Dynamic deployment Hosting Resource Lifecycle Monitoring Accounting Security Software as a Service VRE BiolCube ConnectCube GeosCube StatsCube Platform as a Service FeatherWeightStack SmartGears ApplicationSupportLayer SOA3
References / Links D4Science: http://www.d4science.org Policies https://wiki.d4science.org/D4Science_Deployment_and_Operation:_Policies Procedures https://wiki.d4science.org/D4Science_Deployment_and_Operation gCube: http://www.gcube-system.org Catalogue of Applications https://www.gcube-system.org/catalogue-of-applications Software Key Features https://wiki.gcube-system.org/GCube_Features Developer Guide https://wiki.gcube-system.org/Developer%27s_Guide FeatherWeightStack https://wiki.gcube-system.org/Featherweight_Stack SmartGears https://wiki.gcube-system.org/SmartGears gCube APIs https://wiki.gcube-system.org/GCube_Application_Programming_Interface Administration Guide https://wiki.gcube-system.org/Administrator%27s_Guide
Thank you for your attention Questions?