Presentation is loading. Please wait.

Presentation is loading. Please wait.

D4Science: An e-Infrastructure for Facilitating Data Management, Process, Sharing, and Access Pasquale Pagano National Research Council of Italy

Similar presentations


Presentation on theme: "D4Science: An e-Infrastructure for Facilitating Data Management, Process, Sharing, and Access Pasquale Pagano National Research Council of Italy"— Presentation transcript:

1 D4Science: An e-Infrastructure for Facilitating Data Management, Process, Sharing, and Access Pasquale Pagano National Research Council of Italy pasquale.pagano@isti.cnr.it Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 FAO (Rome) www.d4science.eu

2 2 D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Assumptions Consolidated facts:  Very rich applications and data collections are currently maintained by a multitude of authoritative providers  Different problems require different execution paradigms: batch, map- reduce, synchronous call, message-queue, …  Key distributed computation technologies exist: grid (gLite and Globus), distributed resource management (Condor), clusters (Hadoop), …  Several standards are adopted in the same domain Societal observations A rich variety of protocols, models, and formats Create barriers in the usage of resources Delay dramatically new exploitation patterns Technical observations  Protocols, models, and formats heterogeneity increases load,  Load increases failures

3 3 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 D4Science Vision D4Science objectives:  hide heterogeneity, i.e. abstract over differences in location, protocol, and model;  embrace heterogeneity, i.e. allow for multiple locations, protocols, and models; Technical goals  no bottlenecks: scale no less than the interfaced resources  no outages: keep failures partial and temporary  autonomicity: system reacts and recovers

4 4 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 From a testbed to a production ecosystem DiligentD4ScienceD4Science II Oct.’04Nov.’07Jan.’08Dec.’09Oct.’09Sept.’11 Testbed Empower the grid middleware to: > manage data and metadata as primary resources > virtualise the VO environment Production Stabilize gCube by supporting two large user communities: > FARM > EM Production Promote interoperability across e-Infrastructures by empowering large user communities Prototype => gCube 0.9 Software Framework => gCube 1.6 (stable and open source) => d4science e- Infrastructure Open Platform => gCube 2.0 (feature reach and interop.) => d4science ecosystem

5 5 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 From a testbed to a production ecosystem functionality gLite gCube DiligentD4ScienceD4Science II Oct.’04Nov.’07Jan.’08Dec.’09Oct.’09Sept.’11

6 6 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Infrastructure Exploitation 30 Nodes CNR NKUA ESA FAO UNIBASEL 25 Data EEA MERIS AATSR 69 Metadata es ISO19115 eiDB 15 Data AquaMaps Fact Sheets Country Maps 28 Metadata FARM_dc aquamaps NodesCollectionsFunctionality 29 Nodes CNR NKUA FAO UNIBASEL Integration with gPod Geographical and text search Search by metadata Personal workspace Objects annotation Report generation Maps Generation Time Series management Production More than 500 autonomic Web Services

7 7 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 A Digital Library System is a possibly distributed system that collects, manages and preserves for the long term rich digital content, and offers to its user communities specialised functionality on that content, of measurable quality and according to codified policies [The Digital Library Reference Model] The gCube data infrastructure enabling framework provides DL functionality by: gCube as a Digital Library System Federating exiting digital content Supporting the generation of new digital content Providing discovery and access capabilities maintained in a variety of tailored repository systems by exploiting heterogeneous computational platforms on diversely described and modeled digital content

8 8 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 gCube as an e-Infrastructure ecosystem enabling framework By bridging a number of well-established systems and standards from various domains including high-energy physics, biodiversity, fishery and aquaculture resources management gCube realises an e-Infrastructure ecosystem

9 9 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Each community (VO) registers its own resources under its domain, registers and authorises its users. Starting from this set of resources (hardware, data and applications) VREs can be dynamically set up and activated Each user logins to the VO’s personalized environment and from there, the user will search, elaborate and store shared and personal information. Later on the community administrators can dynamically add or remove resources and users from their domain. How does it work ?

10 10 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Why sharing through VREs is a key? Through the VRE, groups of users have controlled access to distributed data and services integrated under a personalised interface.

11 11 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Why sharing through VREs is a key? A Virtual Research Environment (VRE) supports cooperative activities  Metadata cleaning, enrichment, and transformation by exploiting mapping schema, controlled vocabulary, thesauri, and ontology  Processes refinement and show cases implementation (restricted to a set of users);  Data assessment (required to make data publically exploitable by VO members);  Expert users validation of products generated through data elaboration or simulation.

12 12 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Why sharing through VREs is a key? VREs integrated environment put at disposal a functionality set to support and perform research activities:  the ability to integrate heterogeneous data and services  the ability to process information on-demand ingesting the results,  to share data and process with other users,  to customize collection of information,  to store user actions and exploit them for further use,  to aggregate relevant information into ad-hoc information sources and keeping them updated. VREs integrated environment put at disposal a functionality set to support and perform research activities:  the ability to integrate heterogeneous data and services  the ability to process information on-demand ingesting the results,  to share data and process with other users,  to customize collection of information,  to store user actions and exploit them for further use,  to aggregate relevant information into ad-hoc information sources and keeping them updated.

13 13 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Building Virtual Research Environments Lifetime & Description Information Space MetadataFunctionalityQoS

14 14 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 … TransformationStorage VRE Facilities Tools supporting specific tasks A virtual live document to describe research results A virtual desktop to organize the working environment Workspace Species Maps Generation Time Series Management Report Management SearchAnnotationVisualisation SearchAnnotationVisualisation AnnotationSearch Storage Visualisation Transformation Storage

15 15 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Workspace  A collaboration-oriented suite providing for  seamless access and organisation facilities on a rich array of objects (e.g. Information Objects, Queries, Files, Templates)  mediation between external world objects, systems and infrastructures (import/export/publishing)  support common file manager (drag & drop, contextual menu)  support an effective rich object sharing facility

16 16 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 AquaMaps is an application*  tailored to predict global distributions of marine species initially designed for marine mammals and subsequently generalised to marine species,  that generates color-coded species range maps using a half-degree latitude and longitude blocks  by interfacing several databases and repository providers Species Distribution Maps Generation * Algorithm by Kashner et al. 2006

17 17 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 AquaMaps execution is based on the gCube Ecological Niche Modelling Suite which allows the extrapolation of known species occurrences Species Distribution Maps Generation ◦ to determine environmental envelopes (species tolerances) ◦ to predict future distributions by matching species tolerances against local environmental conditions (e.g. climate change and sea pollution) Very large volume of input and output data: HSPEC native range 56,468,301 - HSPEC suitable range 114,989,360 Very large number of computation: One multispecies map computed on 6,188 half degree cells (over 170k) and 2,540 species requires 125 millions computations (Eli E. Agbayani, FishBase Project/INCOFISH WP1, WorlFish Center)

18 18 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Time Series Management Offers a set of tools to manage capture statistics  Supports the complete TS lifecycle  Supports validation, curation, and analysis  Provides support for data reallocation  Produces uniform data-set

19 19 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Time Series Offers a set of tools to operate on capture statistics  Multiple key families support  Filtering, grouping, and aggregation  Union  Mining  Produce automatically provenance information

20 20 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Report Management  A collaboration-oriented suite providing for  template-oriented, feature-rich and flexible document format definition  effective and infrastructure-integrated report compilation (drag & drop workspace items)  collaborative and distributed editing (workspace based)  standard-based report materialisation (HTML, OpenXML)

21 21 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 VREs, Workspaces and Report in Action

22 22 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 BEHIND THE SCENE

23 23 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 PE2ng Definition Process Execution Engine (PE2ng, pronounced as ‘peng’) is a system to manage the execution of software elements in a distributed infrastructure under the coordination of a composite plan that defines the data dependencies among its actors. Close relatives:  Job Management Systems (Condor)  Distributed Computing Frameworks (MPI, MapReduce)

24 24 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 More Info PE2ng motivation is the instantiation of a liberal computational infrastructure that:  Builds on existing infrastructures  Integrates existing technologies  Supports several software paradigms without performance compromises  Provides a powerful, flow-oriented processing model PE2ng’s dual nature:  Coordinator of external computational infrastructures  Native computational infrastructure provider and manager

25 25 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 PE2ng and the Cloud Exploits all modern clouds paradigms (PaaS, SaaS, IaaS) Provides a PaaS:  Based on Streams (gCube Resultset – gRS2)  Support for dynamic infrastructure reorganisation  Offloaded to Cloud Management decision making  Direct interaction with cloud management : under implementation Supports SaaS via a combination of gCube services Fits several Infrastructures:  No built-in dependencies for computation or storage

26 26 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Binding together infrastructures Single Infrastructure  Utilise capacities to the fullest  Bound “for better or for worst”  Bend business logic to fit  One size fits all? Infrastructure ecosystem  Don’t hide Infrastructures  Not yet another layer  Choose infrastructure to fit needs  Turn Infrastructure into a utility  Unrestrictive Meta-Infrastructure  Single submission, monitoring, access  Single language for “Programming in the Large” and “Small” … ?

27 27 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Terms use on PE2ng Workflow: a high level plan that binds together conceptual operations for the implementation of a task. Execution Plan: a plan for the invocation of code components (aka invocables, i.e. services, binary executables, scripts, …) that ensures that prerequisite data are prepared and delivered to their consumers by defining the flow of data and/or control. Resource: Software, data, network, systems… Registry: A directory service where resources are enlisted for discovery

28 28 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 System Architecture Overview Storage Execution Engine Workflow Engine Registry Transport Processing Workflow Presentation System Security Proxying Delegation Resource Model State Network State Workflow Plan Query Invoke Store Transfer Comp. Process Software / Callable Execution Engine Workflow Engine Pluggable Domain Logic Pluggable Domain Logic Adaptors Domain-specific Business Logic Layers Search, Maintenance, Administration, … Domain-specific Business Logic Layers Search, Maintenance, Administration, … SOAP calls, Java calls, HTTP API, Shell Invocations… Execution Plan Workflow Language Adaptor Specific Resources Clients, Applications Domain Specific Language

29 29 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Infrastructure A A PE2ng infrastructure 30/9/2010 External Advisory Board Meeting PE2ng  Execution “Boundary”: The distributed “node” of PE2ng Executables Other Infrastructures Storage Registry Node PE2ng Node PE2ng Worker Node # Grid ui x x x x x x Infrastructure B Node PE2ng Worker Node # Hadoop gw x x x x x x HDFS Storage Adapter FTP Server Storage Adapter SE Storage Adapter Node PE2ng x x x x x x x x x x x x Node PE2ng Node x x x x x x x x x x x x Registry Resource Model Adaptor

30 30 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 gCube Data Transformation Service (gDTS) A service to tackle with the issue of transformation of data among various manifestations Features:  Distributed (PE2ng based)  Manifestation and transformation agnostic  “Intelligent”, objective-driven operation Why so important ?  Plays vital role to several data staging steps within the infrastructure  Seems to cover out of the box several needs of “interoperability” as conceived by the communities

31 31 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 A Transformers Registry gDTS case 30/9/2010 External Advisory Board Meeting T1 BA T2 CA T3 DC T4 EB Conf A T4 CB Conf B D T2 CA T3 DC T1 BA T4 CB Conf B T3 DC Input Output 3 hops 2 hops

32 32 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 VRE Sumamry D4Science approach: Heterogeneous resources are accessible in a common ecosystem of resources despite their locations, technologies, and protocol Different communities have access to different views according to the conditions under which the sharing can occur Each community can define its own virtual research environment to satisfy specific needs for a limited timeframe and at no cost for the providers of the resource Several virtual research environments can coexist without interfering each other even by competing for the same resources

33 33 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Conclusions Facts  Very rich services and data collections are currently maintained by a multitude of authoritative providers  Several standards are adopted in the same domain Interoperability approaches are key to exploit such richness D4Science offers a variety of patterns, tools, and solutions  to interconnect  Heterogeneous digital content  Heterogeneous repository systems  Heterogeneous computation platforms with a rich set of free-to-use tailored services  to decrease the cost of adoption  to reduce the time to market of new ideas  to deal with plethora of standards

34 34 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Supported Standards  WS-*  WSRF  WS-BPEL  JDL  JSDL  Glue Schema (part)  X-*  DC, TEI, ISO etc  JSR (several)  GSI-Security  XACML  SAML  OpenSearch  OGC related Comply with:  OAI-PMH  OAI-ORE

35 35 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Supported Standards WSRF Specifications WS-ResourceProperties (WSRF-RP) WS-ResourceLifetime (WSRF-RL) WS-ServiceGroup (WSRF-SG) WS-BaseFaults (WSRF-BF) JSR 168 : Simple Portlets 286 : 186 update 160 : JMX WSN Specifications: WS-BaseNotification WS-Topics (WS-BrokeredNotification) …. WS-* Standards SOAP WSDL WS-Addressing …. ISO: ISO3166 countries ISO4217 currencies ISO19115 geo-location …. X-* XML XSD XSL XSLT xPath xQuery OGC Web Coverage Processing Service Web Coverage Service Web Feature Service Web Map Context Web Map Service Web Map Tile Service Web Processing Service Web Service Common OGF Standard: Glue Schema (2) ………. Comply with:  OAI-PMH  OAI-ORE

36 36 www.d4science.eu D4Science Digital Repositories – Linked Open Data: the possible Role of D4Science 16-17 December 2010 Find us www.gcube-system.org www.d4science.eu Donatella Castelli D4Science-II Project Director donatella.castelli@isti.cnr.it Pasquale Pagano D4Science-II Technical Director pasquale.pagano@isti.cnr.it Thank You For Your Attention


Download ppt "D4Science: An e-Infrastructure for Facilitating Data Management, Process, Sharing, and Access Pasquale Pagano National Research Council of Italy"

Similar presentations


Ads by Google