The Preservation and Sustainability of Research Data in Australia Dr Markus Buchhorn, Director, ICT Environments Australian National University; Also in.

Slides:



Advertisements
Similar presentations
Criteria for the trustworthiness of data centres Jens Klump Helmholtz Centre Potsdam German Research Centre for Geosciences (GFZ) DataCite Summer Meeting.
Advertisements

April 2010 MRC Data Sharing Policy Peter Dukes Policy Lead – Data Sharing & Preservation.
A centre of expertise in data curation and preservation DCC/NeSC eScience Workshop, June 2008 Working in partnership with the eScience community This work.
Breakout 1 Socio-legal etc. Every discipline will be different & each data centre will have different answers to questions. Use a questionnaire and send.
A centre of expertise in digital information management UKOLN is supported by: Digital Futures for MLAs? A snapshot in real time. Dr Liz.
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project.
EResearch – isn’t that just “research”? Derek Whitehead Vice President, ALIA.
EInfrastructures (Internet and Grids) US Resource Centers Perspective: implementation and execution challenges Alan Blatecky Executive Director SDSC.
December 2008 MRC Data Support Services (DSS) Chris Morris 13 th February 2009 Sharing Research Data: Pioneers, Policies and Protocols The seventh cat.
A centre of expertise in data curation and preservation MIS Seminar :: University of Edinburgh :: 2 October 2006 Funded by: This work is licensed under.
A Very Brief Introduction to iRODS
Dr Gordon Russell, Napier University Unit Data Dictionary 1 Data Dictionary Unit 5.3.
E-Research Coordinating Committee report to CAUL April 2006 Cathrine Harboe-Ree.
An Overview of eResearch Activities in Australia Paul Davis, GrangeNet Jane Hunter, Uni of Qld.
Grid Applications and Repositories Head, ANU Internet Futures, Lead, APAC Information Infrastructure Program, APAC Grid Services.
1 TDR: What does partial compliance mean?
Report for Wed. Question 1 In your view/experience what parts of data life cycle, data citation and data integration implementations/applications or frameworks.
Head, ANU Internet Futures Ex-Lead, APAC Information Infrastructure Program APAC Services Architect Grid Services Coordinator,
Depositing and Disseminating Digital Resources Alan Morrison Collections Manager AHDS Subject Centre for Literature, Linguistics and Languages.
1 The Australian Partnership for Sustainable Repositories Margaret Henty Digital Futures Industry Briefing November 8, 2006.
1 CS 502: Computing Methods for Digital Libraries Lecture 27 Preservation.
Data preservation & the Virtual Observatory Bob Mann Wide-Field Astronomy Unit Royal Observatory Edinburgh
THE JOINED UP WORLD OF E-RESEARCH Professor Neil McLean National Technical Standards Adviser to the Department of Education Science and Training (DEST)
Institutional Perspective on Credit Systems for Research Data MacKenzie Smith Research Director, MIT Libraries.
ELearning Frameworks: What, Why and Who, Where & When Daniel Rehak, Learning Systems Architecture Lab, USA Kerry Blinco, Dept. Education Science and Training,
Science as an Open Enterprise: Open Data for Open Science Professor Brian Collins CB, FREng UCL, June 2012 Emerging conclusions from a Royal Society Policy.
Australian Partnership for Sustainable Repositories AUSTRALIAN PARTNERSHIP FOR SUSTAINABLE REPOSITORIES Caul Meeting 2005/2 Brisbane 15.
Database Systems: Design, Implementation, and Management Ninth Edition
Digital Identity Management Strategy, Policies and Architecture Kent Percival A presentation to the Information Services Committee.
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
Scientific Data Infrastructure in CAS Dr. Jianhui Scientific Data Center Computer Network Information Center Chinese Academy of Sciences.
A centre of expertise in digital information management UKOLN is supported by: Benefits of Research360 Catherine Pink Institutional Data.
Managing the Record of Research At the Smithsonian Using SIdora SAA Research Forum August 12, 2014.
Using ISO/IEC to Help with Metadata Management Problems Graeme Oakley Australian Bureau of Statistics.
Enabling E Research ANU Data Commons. What is it ? Building a repository for data sets o data can be deposited o updated o published to Research Data.
Open Data from Reliable Records Anne Thurston. The Open Data movement, a key aspect of Open Government, is now a top development interest across the world.
Managing Research Data – The Organisational Challenge at Oxford James A J Wilson Friday 6 th December,
The repositories Landscape: where are Repositories now and what’s around the corner? UKDA-store Louise Corti UKDA, University of Essex MIMAS OPEN FORUM.
Data Management David Nathan & Peter Austin & Robert Munro.
The management of health and biomedical data in Tanzania: Need for a national scientific data policy Leonard E.G. Mboera Directorate of Information Technology,
The International e-Depot to Guarantee Permanent Access to Scholarly Publications Marcel Ras Tartu, June 2012.
Because good research needs good data Funded by: Digital Curation for Researchers, 28th February 2013 The Shifting Research Data Management Policy Landscape.
BUILDING ON COMMON GROUND: EXPLORING THE INTERSECTION OF ARCHIVES AND DATA CURATION Lizzy Rolando & Wendy Hagenmaier 6/3/2015IASSIST 2015.
Experts in numerical algorithms and High Performance Computing services Challenges of the exponential increase in data Andrew Jones March 2010 SOS14.
Cyberinfrastructure What is it? Russ Hobby Internet2 Joint Techs, 18 July 2007.
Information Infrastructure Evolution ARIIC is working towards – a distributed electronic research environment that allows researchers to share, annotate,
Digital Preservation across the technologies, strategies, open standards & interoperability aspects including the legal issues Pratik Shrivastava Scientist.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
Dataset citation Clickable link to Dataset in the archive Sarah Callaghan (NCAS-BADC) and the NERC Data Citation and Publication team
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Nov 26, Health-y sharing of human data. 2 Plan ahead.. It can be done in many cases, to great success and benefit!
11 Researcher practice in data management Margaret Henty.
NeDIC – the way forward. Structure Augmenting the draft presented earlier SWOT Policy Issues Roles and Responsibilities –Government, funding agencies,
Working Group 4 Data and metadata lifecycle management  1. Policies and infrastructure for data and metadata changes  2. Supporting file and data formats.
Institutional data curation implementation 1st African Digital Curation Conference 12 February 2008.
MAPS Middleware Action Plan & Strategy Project Middleware Action Plan & Strategy Project (MAPS) Patricia McMillan, Project Manager.
Issues in RDM This work is licensed under a Creative Commons Attribution 4.0 International LicenseCreative Commons Attribution 4.0 International License.
Preserving your research data for future use This work is licensed under a Creative Commons Attribution 3.0 Unported License.Creative Commons Attribution.
Monash.edu Research data ecosystem David Groenewegen Director, Research, University Library.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
Publish your Data on the Tropical Data Hub Seeding the Commons Project Australian National Data Service e-Research Centre James Cook University This work.
Digital Repository Certification Schema A Pathway for Implementing the GEO Data Sharing and Data Management Principles Robert R. Downs, PhD Sr. Digital.
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Jarek Nabrzyski Director, Center for Research Computing
Sharing Anxiety: Do Legal Obligations and Open Standards Help?
GSAF Grid Storage Access Framework
Australia's National Information Infrastructure for Research Markus Buchhorn Director, ICT Environments, The Australian National University (and APAC,
Research Infrastructures: Ensuring trust and quality of data
Brian Matthews STFC EOSCpilot Brian Matthews STFC
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Presentation transcript:

The Preservation and Sustainability of Research Data in Australia Dr Markus Buchhorn, Director, ICT Environments Australian National University; Also in Formerly: Head, ANU Internet Futures Grid Services Architect, APAC Grid Services Coordinator, Grangenet This talk is based in parts on the “AERES” survey and report for APSR with Paul McNamara

Research Data This is not about publications but primary, derived or simulated data, Which (may) lead to publication Scholarly inputs and outputs Why is it different? Data has a very different lifestyle Why is it hard? Data has very different, and more complex, problems Just “in Australia?” It’s a global problem! E-Research infrastructure? Transparent and appropriate access to all resources, to enhance research processes and build greater knowledge

We sort of know this… A Repository Object Store Files, DB, streams, instruments Metadata DB Scientific, Management, Annotation, Preservation, Access,… Access Interface Presentation Interface Disk, Tape, HSM, RAM, … A (good) Repository is the sum of these things, and more… Interfaces and services for management and curation, processes, security, standards, support, etc.

Rep. IRP Repository Federation “Portal” or Federation interface AAA Services Metadata-flows Users Computing Collaboration Visualisation Access protocols Queries, Curation AAA flows …and we can architect things around it… proxy This all applies even with a single repository

…and we can identify the services

Acquire Digitise Create Observe Purchase Pre-process Calibrate Validate Version 5.1 Markus Buchhorn Reposition Cache Mirror Thumbnail Store Ingest Access Present Register Accept Download Virtualise Federate Stream Preview Permit Publish Announce Organise Approve Manage, Curate Data Mgmt Metadata Mgmt Storage Mgmt Access Mgmt Archive Sustain Value Delete Preserve Post-process Compute Analyse Merge Present Use Visualise Collaborate Author Derive Individual Source Policy Institutional Domain Standards Govt We can classify the processes Funders AnonymiseProcess Mgmt

Let’s look at Application Areas Geosciences Minerals, oils and gases, tectonics, Govt, Surveys, Industry Many data sources (spatial and physical) and simulations Bioinformatics Genomics, proteomics, … Public datasets, private queries, private annotations Chemistry Simulation, need data services mainly High Energy Physics Large expensive instruments, projects Massive data, computation and simulation

Application Areas - 2 Earth Systems Sciences Massive remote sensing data sets, large and complex simulations Astronomy Big data, complex reduction process, big simulations, long-term research Financial Many sources, SX, FX, news, … Timeliness and long time scales are important Music, Arts, Sports Performance, formal and practice Education focus

Application Areas - 3 Linguistics, Musicology Archives of digitised cultural material Complex analyses Social Science Data Census, health, surveys, … Complex data structures, qualitative data Archaeology Digitised physical materials, spatial and chronological data

Consider just some Issues - Longevity Sustainability Data formats Descriptions, C ompression, lifetimes Simplex vs Complex (compound) objects Software Algorithms, implementations, OS Versioning Recalculation, interpretation, validation, derivatives Underlying infrastructure, technologies Storage Facilities Mirroring for protection – policy and technical issues Geo, Bio, ESS, Astro, Ling, SS, Arch, Fin, Mus.

Issues- Metadata Varied research schemas 1 is nice, but most have zero or five… Baseline DC is a lmost non-existent.. Scientific description Itself contentious… Provenance and processing Preservation, curation and valuation Subjective metadata, annotations Geo, Bio, ESS, Astro, Ling, SS, Arch, Fin, Mus.

Issues - Rights Needs AAA to be working, to scale Authentication, Authorisation and Accounting Requires identities and roles to be understood Privacy, Security Personal information leakage Anonymised data, needs to stay usable Ownership Not always (almost never!) with the researcher Time-varying Data sourced under old agreements Rights vary by status of source people die, agreements expire, … Geo, Bio, HEP, ESS, Astro, Ling, SS, Arch, Fin, Mus.

Why do this anyway? Create opportunities For re-analysis, re-use; expected or otherwise Solve problems Waste of $$, people and collection effort Loss of irretrievable data Inability to verify research Requirements (have to do it) National good, cultural heritage, input to policy Reference materials Atlas, catalogues, … Value not just in collection but in accessibility

Is it happening already? Data re-use/re-analysis Ever more examples, some very good, some horror stories… Policy conflicts Data must be kept Data must be deleted (anything involving people) But… New culture This data has value outside of my domain, or after my project? New capabilities, provided by the Internet Discovery of who has useful data Accessibility of useful data New (and old) fears by users (see later) New data is easier to cope with than old data Introduce new workflows and processes starting now Recover old data as/when needed

Some of the players: Government and funders Strengths: Control $$, Control Policy Define requirements, enforceability, and encouragement! Set frameworks for ethics Can of worms in its own right (c’tees getting involved in technical elements; too many c’tees at different layers, contradictory rulings) Control some data (ABS, BoM, GA, RTA, AADC. …) And can be data triggers (tobacco, regulators, …)

Government and funders Weaknesses: Policy politely suggests publically-funded data should be well managed and accessible No teeth No infrastructure to back it up No recognition of good effort Funding is project oriented, infrastructure has to be systemic One-off grant for lifetime support?

Government and funders Opportunities: Effective policy, with $$ to back it up Build a coordinated and sustainable infrastructure Build skills, expertise Save money Increase research effectiveness Increase leverage of investment

Government and funders Threats: Loss of irretrievable data Waste of $$ and effort in collecting the same data Insufficient data for policy input Environment, healthcare, education, security, … Loss of research effectiveness Other countries are doing this UK, US, Asia (Taiwan, Korea, …)

Another key player: Organisations, Institutions Not just Universities Employ the staff that collect the data Manage the funds acquired by staff May have obligations, especially long-term (beyond staff tenure) Legal and moral (is research data a ‘record’?) Probably “own” the data Certainly have opportunities Have existing funding models Shuffling between buckets…

And Users, who are human… Fear of missed “nuggets” in their data Milk it for everything, for ever and ever Fear of missed errors Probably varies by domain and career-stage Fear unknown custodians/stewards Can’t do as good a job as my PhD students Fear inappropriate leaks Privacy/ethics, first-to-market, relationship to data providers (drug users, fishermen, …) Fear the cost of effort Takes time (and money) away from what they’re good at Fear lack of recognition I’ve done it for the national good, how about some accolades? Fear of trusting somebody else’s data That person, or their repository may have done something wrong

Recognition “We” require data to be effectively deposited But don’t have anything to back up this requirement Implies an effective place to deposit Recognition (certification) of repositories How good, and how sustainable? What are the metrics? Implies an effective process of deposit Recognition of the deposit effort How well is it deposited? 1 star deposit into a 5 star repository? Recognition of the deposit content Depositor gets recognition, somewhat like a paper Which requires a sufficiently good effort, and a citable repository Interesting question of who “owns” the data, and hence accrues recognition Who carries out recognition, certification? Domain-specific skills, technology-specific skills Curation, preservation skills

Valuation What to keep? Ideal model keeps everything, for ever Pragmatism dictates some data deletion Who has the right to make that decision, and takes on the responsibility Especially if later proven wrong Cost is going down Storage (physical media) is getting cheaper Processes for management are starting to scale Especially for the basic storage/access services Keeping everything is becoming reasonable Keeping it for ever is becoming manageable

Sustainability Follow the $$$ Govt top-slice, or top-up to institution/user Fund fewer people to do more things? Fund the same number to do more with less? Create a whole new funding stream? Institutional top-slice, or top-up Same questions. Leave it to users/communities Where there’s a will, … But we need to support areas where there isn’t a will as well

Implementation Get users out of data management at some level Scale costs on infrastructures, services and skills that are sufficiently common Deal with user fears Some of it needs education, some of it needs trust to be established E.g. Scalable AAA mechanisms are now coming along nicely Users provide domain specific skills and domain policies Coordination role within a domain – required! But need technical backing when it crosses some boundary

Implementation - 2 All repositories don’t need to do everything Some can be more equal than others. By domain, by technology, by fundamental services… As long as the sum of them exceeds the sum of needs Most technical problems can be solved today. Policy is the main hurdle. Achieving the goal What are the carrots and sticks that actually work? Who are best placed to wield them? Sustaining the goal The answer is money, but what is the question?

Is anybody thinking about this? Universities and partnerships APSR and other groups Federal and State Govt DEST, PMSEIC, NCRIS (SII), eResearch-CC, Productivity Commission, … Funders and managers ARC, NHMRC, AVCC Here’s hoping…