Download presentation
Presentation is loading. Please wait.
1
South African Research Data Infrastructure
Open Data Platform Architecture Wim Hugo CDIO, SAEON Systems Architect, DIRISA Vice-Chair, ICSU-WDS Anwar Vahed Manager, DIRISA CSIR Meraka Institute
2
“Free and Open Access” Tax-Funded Data Reproducibility of Science
Context “Free and Open Access” Tax-Funded Data Reproducibility of Science
3
Governance: Stakeholder Groupings
ICSU World Data System African Networks DST Research Institutions SARIR Programme(s) SARVA BEA DoE (REDIS) DEA (NSIF, SANBI, CC M&E), O&C Shared Platform Stakeholders DRDLR SASDI Custodians NRF, HEIs and National Facilities NEDICC DIRISA ASSAf DST SA-GEO SAEOSS GEO-BON/ GEOSS Communities of Practice
4
Consolidated Roadmaps
5
Emerging E&EO Research Data Infrastructure
Services External Systems ORCID, DataCite, … Data Providers Services/ Composites APIs Linked Open Data Open Data Platform Shared Metadata Data Hosting Services Components Portals Gateways Guidance Physical Infrastructure Brokers and Harvesters Global Registries GEOSS, ICSU WDS, … SARVA Earth and Environmental Sciences Global Infrastructures BioEnergy Atlas SAEOSS SAEON Data Portal Gateways DIRISA Other Disciplines Community or Thematic Portals DEA SASDI SARIR
6
Six DIRISA Architectures
Requirements and Specifications – detailing the scope that needs to be addressed by DIRISA, as well as the typical solutions and specifications or standards that will apply; The RDI Landscape – summarising international best practice and precedents, and reviewing local infrastructure, initiatives, and status quo; Structuring a Data Alliance – dealing in more detail with governance, community participation, and capacity building requirements. Software Specifications and Standards Hardware Specifications and Standards Hardware and Networks Software and Systems Business and Governance Guidance, Capacity, and Soft Skills Accreditation
7
Generalised Scientific Data Infrastructure Use Case
Access/ Download Data/ Services Analise/ Visualise “Bind” “Publish” Process Metadata Discover “Find” “Predictable Assembly from Reliable Components”
8
Generalised Scientific Data Infrastructure Use Case
Curate Cite Access/ Download Data/ Services Analise/ Visualise “Bind” “Publish” Process Metadata Discover Assess/ Rate “Find” “Predictable Assembly from Reliable Components”
9
Generalised Scientific Data Infrastructure Use Case
Curate Mediate Cite Access/ Download Data/ Services Analise/ Visualise “Bind” “Publish” Process Metadata Discover Assess/ Rate “Find” “Predictable Assembly from Reliable Components”
10
Technical Standards Support
Harvesters/ Discovery CS/W OAI-PMH REST Service API Metadata Standards ISO 19115/ p2 SANS 1878 EML FGDC Dublin Core DDI Darwin Core DataCite Data Services OGC WxS, KML, GeoRSS, GeoJSON, GML NetCDF/ HDF5 (Multidimensional Data) Time Series/ Signal Data Media (Images, Video, Audio) Document Objects Tabular Data (CSV, Excel, …) Linked Open Data DataCite DOIs ORCID Digital Samples Vocabulary Registries
11
Options for Participation
Meta-Data Management Discovery Data Hosting, Visualisation and Download Reporting No Infrastructure Portal Portal Portal Portal Shared Infrastructure Embedding* Embedding Embedding Embedding Adapters and Harvesters CS/W Adapter Standard Data Services Mostly Own Infrastructure REST Services REST Services Plug-Ins/ Own Development REST Services
12
Open Data Architecture
13
Hardware Architecture: Research Cloud(s)
RIMS/ NRF HTTP Service-Based Integration Service and Component Interfaces (REST/ JSON, XML, Javascript) DIRISA (SAEON) Service and Portal Infrastructure DEPOSIT | DISCOVERY | APPLICATION | REPORTING DataFirst SANBI DMP Tools (Multiple) Component-based Integration DOI Registration/ DataCITE Distributed Data Cloud Management (iRODS, Resonant, or equivalent) DEA ORCID/ Re3data/ … Research Cloud HTTP/ FTP Basic Portals ----- Meeting Notes (2016/03/09 12:24) ----- Personal cloud - integration with OpenCloud ----- Meeting Notes (2016/03/09 12:40) ----- Add tier 0 clouds ----- Meeting Notes (16/07/27 13:12) ----- iRODS Establish a registry as a RDMS and define services for external systems Decision (iRODS or HaDOOP) ZFS and XFS - any Linux compatible application iRODS Machine - CF Plug-In - CF/ MM to set up on SAEON media Repository WebDav SAEON DIRISA Other SASDI/ NSIF DataCite SA SAEOSS T2 T1 T2/3 Full Function Portals Middleware – Archiving, Backup Other Desktop Deposit Tool (OwnCloud) SARVA BEA SAEON DIRISA Accr.
14
Main Use Cases: #0 - Registration
Request data from ORCID Registry of Repositories (including DIRISA) re3data WDS DSA NRF I have an ORCID already Request Data from BI Staging Grant(s) Registered in RIMS Select a Type of Participation Researchers changing institutions Private sector contributions- similar to Vendors in SASDI Deal with consortia (multiple parents for a child) Individual Researcher Supplement known information Select 1 or more Repositories Institutional Participant Register ORCID
15
Main Use Cases: #1 - Deposit
Optional DOI Registration Optional Data Upload Repositories of Last Resort Online Resource URLs and Pointers Online Capture Manual Meta-Data Provision Social Sciences and Humanities File Upload Earth and Environ-mental Sciences Health and Bio-Informatics REST Services Push National Aggregate Applied Sciences, Built Env., Engineering Physics, Chemistry, Astronomy Automated Meta-Data Processes Standard Harvesting Protocols Business Science, Law, Economics Web Folders and FTP Semi-Automated Meta-Data Processes DMPs, RIMS Institutional or Domain Repositories
16
Main Use Cases: #2 - Discovery
DOI Resolvers Repositories of Last Resort Social Sciences and Humanities Citations Earth and Environ-mental Sciences Health and Bio-Informatics Portal Interfaces National Aggregate Indexed Meta-Data Applied Sciences, Built Env., Engineering Physics, Chemistry, Astronomy Standardised Search Interfaces (Machines) Business Science, Law, Economics Search and Discovery Options REST Services Standard Harvesting End Points Institutional or Domain Repositories GEOSS Broker, ICSU WDS, …
17
Main Use Cases: #3 - Application
Citation Repositories of Last Resort Google Analytics Application Request Previews Indexed Meta-Data Application Options Chain into Web Processes Event Logs and User Feedback Download Brokers and Mediators Institutional or Domain Repositories
18
Main Use Cases: #4 - Reporting
RIMS (Grant Administration) Indexed Meta-Data Google Analytics Event Logs and User Feedback CrossRef/ DataCite Reporting Scope Application Statistics Meta-Data Status/ Search History Portal-Based Depositor Summaries Page Views and User Behaviour Citations and Mentions Reporting Options REST-Based Statistics User Rating and Comments, Data Quality Context and Knowledge Network Depositor Summaries Grant Policy Compliance
19
Accreditation: Minimum Scope of Evaluation
Security and ICT Management Access and Licensing Policies External Expertise Quality Assurance Conference and Publication Record Ingest and Publication Networking and Sharing Products and Services ICSU-WDS Communicaton and Outreach Depositor Authenticity TRAC Data Seal of Approval Interoper-ability Preservation Practice Hardware Infrastructure Infrastructure Legal Compliance Software Infrastructure Sustainability Business Continuity Planning Host Organisation Funding Mechanisms
20
Accreditation Options
Nature of Accreditation Process Local (NRF) Context Notes and Comments ISO 16363:2012 On-Site Audit Exceeds Requirement Expensive but a medium-term goal NESTOR/ DIN Remote Audit Not applicable locally, mainly used in Europe/ Germany ICS-WDS (World Data System) Remote confirmation with peer review Meets Requirement Traditionally Earth and Environmental Science. Allows ‘Network Members’ DSA (Data Seal of Approval) Traditionally Social Science and Humanities TRAC Self-Evaluation Meets Requirement, but needs subsequent formalisation (NRF?) Not Recommended
21
Some Questions Remain …
True scalability: research infrastructure maintenance is human-resource intensive and cannot remain so; Universally accepted, machine-readable licenses for non-open data: the equivalent of Creative Commons licenses for data that is legitimately restricted in one of several generic ways (privacy and ethics, commercial interest, and classified information) does not exist, but are required for large-scale, automated processing. External dependencies: in an increasingly interconnected systems environment, how do we sustainably fund critical components of globally shared infrastructure (for example vocabulary services or persistent identifier resolvers).
22
Department of Science and Technology, and CSIR/ Meraka Institute
? Funded by NRF/ SAEON, Department of Science and Technology, and CSIR/ Meraka Institute
23
Biodiversity Data Management
24
Elements of Interoperability
Syntax Describes service protocols, parameters Schema Describes structure of content Semantic Describes the meaning of content Temporal – easy Spatial – easy Topic - difficult
25
Essential Biodiversity Variables
Genetic composition Co-ancestry, Allelic diversity, Population genetic differentiation, Breed and variety diversity Species populations Species distribution, Population abundance, Population structure by age/size class Species traits Phenology, Body mass, Natal dispersion distance, Migratory behavior, Demographic traits, Physiological traits Community composition Taxonomic diversity, Species interactions Ecosystem function Net primary productivity, Secondary productivity, Nutrient retention, Disturbance regime Ecosystem structure Habitat structure, Ecosystem extent and fragmentation, Ecosystem composition by functional type Not all traditional spatial data! Not all remotely sensed!
26
Simple or Core Information Model
Genes and Alleles Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena
27
Example: Taxon Abundance, Presence and Absence
Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena
28
Example: Phylogenetic Data
Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena
29
Example: Morphology Genes and Alleles Relationship Species and Taxons
Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena
30
Example: Biome Definition, Ecosystem Services
Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena
31
Generic Dimensions of Data
Spatial Coverage XYZ Temporal Coverage: T Topic or Semantic/ Ontological Coverage P: Phenomenon mostly physical, chemical, or other contextual data B: Biological Tx: Species and Taxonomy (with some extensions) Al: Allele/ Genome/ Phylogenetic Ch: Characteristics, Traits, and, and Life Stages Each unique combination of these, supported by a vocabularies/ ontology is a generic data family Continuous or Near-Continuous: Uppercase Discrete or dispersed: Lowercase
32
Some Generic Data Families and Crosswalk Requirements
Typical Dimensions/ Content Typical Infrastructure Typical Syntax/ Schema Cube Data XYZ, t, P OPeNDAP Multi-dimensional S-DB Traditional Spatial XY, t, P WxS O&M Signals XYZ, t, P/ B SOS MetaCat General Ecosystem XYZ, t, P/ B CSV GBIF Index XYZ, T, Tx Occurrence DwC GenBank XYZ, T, Al Genetic FTP/ ASN.1 Still Thinking About: ✪ HDF-5 for Everything ✪ Directed Graphs/ RDF for Everything
33
Typical Guidance http://bit.ly/1W8YPxx For Each EBV …
Work started within GEO BON WG 8
34
Vocabulary and Name Services
Important to Limit Diversity of Interfaces Mappings of Vocabularies to Schema RDA has started thinking about this …
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.