Research Data Allience Why and what Peter Wittenburg.

Slides:



Advertisements
Similar presentations
National Institute of Statistics, Geography and Informatics (INEGI) Implementation of SDMX in Mexico.
Advertisements

DRIVER Building a worldwide scientific data repository infrastructure in support of scholarly communication 1 JISC/CNI Conference, Belfast, July.
Interoperability aspects in the The Virtual Language Observatory Dieter Van Uytvanck Max Planck Institute for Psycholinguistics
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Welcome to the Conference !! Juan Bicarregui Chair, APA Executive.
Information Types and Registries Giridhar Manepalli Corporation for National Research Initiatives Strategies for Discovering Online Data BRDI Symposium.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
DASISH Common Solutions to Common Problems. DASISH – Data Service Infrastructure for the Social Sciences and Humanities DASISH brings together 5 ESFRI.
RDA Data Foundation and Terminology (DFT) IG: Introduction Prepared for RDA Plenary San Diego, March 9, 2015 Gary Berg-Cross, Raphael Ritz, Co-Chairs DFT.
CLARIN Centers for a Sustainable Infrastructure Daan Broeder, MPI for Psycholinguistics Jan Odijk, Utrecht University.
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
CLARIN-NL Second Open Call Jan Odijk CLARIN-NL Call 2 Info-session Amsterdam, 26 Aug 2010.
DATA FOUNDATION TERMINOLOGY WG 4 th Plenary Update THE PLUM GOALS This model together with the derived terminology can be used Across communities and stakeholders.
EGI-Engage EGI-Engage Engaging the EGI Community towards an Open Science Commons Project Overview 9/14/2015 EGI-Engage: a project.
ADC Meeting ICEO Standards Working Group Steven F. Browdy, Co-Chair ADC Workshop Washington, D.C. September, 2007.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Increasing the usage of endangered language archives in the.
Data Archiving and Networked Services DANS is an institute of KNAW en NWO Trusted Digital Archives and the Data Seal of Approval Peter Doorn Data Archiving.
Data Fabric IG Use Case Analysis. 2 Data Fabric Analysis how to come to essential components & services? Analyze Data Practices.
RDA Data Foundation and Terminology (DFT) IG: Introduction Prepared for RDA 6 th Plenary Paris, Sept. 25, 2015 Gary Berg-Cross, Raphael Ritz Co-Chairs.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
CLARIN - a European Research Infrastructure Peter Wittenburg Max-Planck Institut für Psycholinguistik, Nijmegen.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Why should we invest in DWF? Peter Wittenburg CLARIN Research.
Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.
Towards a European network for digital preservation Ideas for a proposal Mariella Guercio, University of Urbino.
DASISH Final Conference Common Solutions to Common Problems.
Data Fabric IG Introduction. 2  about 50 interviews & about 75 community interactions  Data Management and Processing is too time consuming and costly.
Summary Data Practices Report Peter Wittenburg Max Planck Data & Compute Center former MPI for Psycholinguistics.
CLARIN work packages. Conference Place yyyy-mm-dd
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Hydro DWG at the RDA Plenary: BoF and Aligning HDWG work with WMO expectations and timeline Sylvain, Tony, Silvano, Ilya.
Summary of RDA Outputs so far dr. Ir. Herman Stehouwer 22 September 2015.
1 RDA and Metadata Peter Fox (my view) Metadata session
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
Sources of inspiration Discussions in DFT Use Cases Discussions in DF Use Cases „Paris“ Document Comments on „PARIS“ document Urgently need “Basic and.
Introduction to the VO ESAVO ESA/ESAC – Madrid, Spain.
Repository Registries Agenda 11.30Welcome & State of the Discussion Is it all one – is it all different? Peter & Herman and commenters 12.10Actions to.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
Open Science (publishing) as-a-Service Paolo Manghi (OpenAIRE infrastructure) Institute of Information Science and Technologies Italian Research Council.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EPOS and EUDAT.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT Aalto Data.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No The use of the.
RDA for Data Practitioners Peter Wittenburg / Rainer Stotzka.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Support to scientific.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Herbadrop.
Bringing visibility to food security data results: harvests of PRAGMA and RDA Quan (Gabriel) Zhou, Venice Juanillas Ramil Mauleon, Jason Haga, Inna Kouper,
1 This slide indicated the continuous cycle of creating raw data or derived data based on collections of existing data. Identify components that could.
Intentions and Goals Comparison of core documents from DFIG and Publishing Workflow IG show that there is much overlap despite different starting points.
PIDs in EUDAT Webinar, 15 Februari 2013
Overview of WGs, IGs and BoFs
EUDAT’s engagement with the Earth Sciences
Current and Upcoming RDA Recommendations Dr. ir. Herman Stehouwer
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
AAI for a Collaborative Data Infrastructure
Research Data Alliance - Research Data Sharing without barriers Terena Networking Conference 22 May 2014.
Data Ingestion in ENES and collaboration with RDA
ACS 2016 Moving research forward with persistent identifiers
Maggie, Carlo, Peter, Rebecca (GEDE discussions)
Agenda Welcome and overview (Peter)
C2CAMP (A Working Title)
Agenda welcome and goals (Peter)
Common Solutions to Common Problems
Three Uses for a Technology Roadmap
Integrating social science data in Europe
Brian Matthews STFC EOSCpilot Brian Matthews STFC
Bird of Feather Session
Presentation transcript:

Research Data Allience Why and what Peter Wittenburg

2 Who am I … MPI Nijmegen NL MPCDF Garching DE MPI for Psycholinguistics -Understand human language faculty -Experimental orientation -“Data intensive” from the start -Use all kind of parameters externally available -Simulations -Large archive online MPCDF -Offer computing & data services to all MPIs -Offer HPC capacity and knowhow -Offer BDA capacity and knowhow -Help in data solutions -RDA, EUDAT, PRACE Leading Methodology and Technology work Senior Advisor Data Systems

3 Content 1.Data Science and Infrastructures 2.Data Practices 3.Principles & Trends 4.RDA 5.RDA Results &Activities 6.Data Fabric IG

4 A few factors  nr. of researchers increases enormously  there is a pressure in the direction of Grand Challenges and those topics relevant for societies  research is increasingly often data intensive  border-crossing research is a fact (countries, disciplines) Research is changing

5 Data is in Focus data is the oil driving research and economy data is key to understanding big challenges observations experiments simulations crowd sourcing store combination analysis visualization conclusions

6 Many Activities at Policy Level  Digital Agenda to unlock the full value of scientific data  Typical report about measures to be taken The Data Harvest, December 2014 © RDA Europe

7 Requirements for Data Science  let’s use the G8 formulations – data should be  searchable-> create useful metadata  accessible -> deposit in trusted repository and use PIDs  interpretable-> create metadata, register schema and semantics  re-usable-> provide contextual metadata  persistent-> provide persistent repositories  Funders request Data Management Plans?  What are the consequences of these principles?  How to design the necessary infrastructure?

8 Infrastructure activities DOBES NoMaD

9  ~70 global, independent teams  One archive with one copy of all data  Agreements (data flow, metadata, formats, etc.)  ~80 TB in online archive  Web-based, open deposit  4 dynamic external copies DOBES infrastructure Complete tool suite supporting all major steps including repository system and metadata tools (all based on standards, standoff, technology independence) Documenting Endangered Languages

10  ~70 global, independent teams  One archive with one copy of all data  Agreements (data flow, metadata, formats, etc.)  ~80 TB in online archive  Web-based, open deposit  4 dynamic external copies Complete tool suite supporting all major steps including repository system and metadata tools (all based on standards, standoff, technology independence) Documenting Endangered Languages Changed culture globally in various dimensions DOBES infrastructure

11  Novel Materials Discovery project  Computational material science  Many labs create enormous amounts of data about materials and compounds  Chemical compounds space is endless  How to quickly find useful compounds in case of specific needs???  NoMaD brings together result data into one repository (incl. metadata etc.)  Finding patterns across measurements to detect hidden classes  Complementary to very large Materials Genome Initiative from Obama which is an infrastructure project to reduce time and effort (50%) to design suitable materials and deploy them NoMaD infrastructure  Structure is similar to DOBES  Group of specialists find agreements  Offering services  Driven by research questions

12  Novel Materials Discovery project  Computational material science  Many labs create enormous amounts of data about materials and compounds  Chemical compounds space is endless  How to quickly find useful compounds in case of specific needs???  NoMaD brings together result data into one repository (incl. metadata etc.)  Finding patterns across measurements to detect hidden classes  Complementary to very large Materials Genome Initiative from Obama which is an infrastructure project to reduce time and effort (50%) to design suitable materials and deploy them  Structure is similar to DOBES  Group of specialists find agreements  Offering services  Driven by research questions No doubt – it will change culture NoMaD infrastructure

13 Infrastructure activities CLARIN

14  Scattered landscape of language resources and tools  Typical problems (not findable, not accessible, not interoperable, lack of services, lack of stability, etc.)  Situation in many LRT centers just chaotic  project orientation  project tweaking CLARIN RI some old slides – still true

15 how to come to a persistent and stable infrastructure? how to come to a federation and how to get access? how to make all of their LRT visible? how to come to interoperable services? how to get it all together for user services? community centres service provider federation CMDI future & short term solution service oriented architecture pan-European demo cases CLARIN Centres Centres Criteria Long-term Preservation REPLIX Replication 25 Centre Candidates all are busy with restructuring plans 2 already give long-term preservation service CLARIN RI

16 how to come to a persistent and stable infrastructure? how to come to a federation and how to get access? how to make all of their LRT visible? how to come to interoperable services? how to get it all together for user services? community centres service provider federation CMDI future & short term solution service oriented architecture pan-European demo cases Trust DomainInitial Federation PID Service setup federation technology build initial federation setup EPIC service central user attribute server CLARIN RI

17 how to come to a persistent and stable infrastructure? how to come to a federation and how to get access? how to make all of their LRT visible? how to come to interoperable services? how to get it all together for user services? community centres service provider federation CMDI future & short term solution service oriented architecture pan-European demo cases Component Metadata Metadata nowVirtual Collection CMDI Infra ISOcat development setup OAI PMH machinery ISOcat RegistryVLO Observatory Category Definition LRT Inventory Virtual Language World ARBIL MD Editor CLARIN RI

18 how to come to a persistent and stable infrastructure? how to come to a federation and how to get access? how to make all of their LRT visible? how to come to interoperable services? how to get it all together for user services? community centres service provider federation CMDI future & short term solution service oriented architecture pan-European demo cases Service Oriented Infrastructure Web Services Interoperability Standards & Best Practices Service Framework Specification Web Service and Processing Chains Standards and Best Practices CLARIN RI

19 how to come to a persistent and stable infrastructure? how to come to a federation and how to get access? how to make all of their LRT visible? how to come to interoperable services? how to get it all together for user services? community centres service provider federation CMDI future & short term solution service oriented architecture pan-European demo cases EU Identity Index Case Multimedia/multim odal Case Folkstory CaseC4/WebLicht Corpus Case It changed culture and will go on Many EU RI do almost the same CLARIN RI

20 Infrastructure activities EUDAT

21 EUDAT infrastructure some old slides

22 Don’t know yet – far away from research EUDAT infrastructure

23 State of Infrastructure Building  Have a huge number of infrastructure initiatives in Europe and globally  Created much awareness, initiated changes and allowed knowledge gathering  Many working in discipline and/or regional/national “silos” believing that their solutions are the best  There is still a lot of dynamics and despite all progress no satisfaction (EU Open Science Cloud)  Outreach is partly still poor (->120 interviews & interactions)  We can certainly say that much SW that has been built cannot be maintained  Built one of the first full-fledged repository systems and other software – not maintainable  How many PID, AAI, MD, etc. solutions do we want to support?  Funding and Sustainability in most cases not clarified  Costs are too high  where can we reduce  where can we extract commons?

24 Content 1.Data Science and Infrastructures 2.Data Practices 3.Principles & Trends 4.RDA 5.RDA Results &Activities 6.Data Fabric IG

25  lack of proper documentation, schemas, semantics, relations, etc.  directory structures, spreadsheets etc. are ad hoc creations and knowledge fades away  etc. Data Practices – Data Entropy

26 Metadata standards Data Practices - Metadata slide von Bill Michener, DataONE

27 Data Practices – Survey  ~120 Interviews/Interactions  2 Workshops with Leading Scientists (EU, US)  too much manual or via ad hoc scripts  too much in Legacy formats (no PID & MD)  there are lighthouse projects etc. but...  DM and DP not efficient and too expensive (Biologist for 75% of his time data manager)  federating data incl. logical information much too expensive  hardly usage of automated workflows and lack of reproducibility

28 Data Practices – Survey  ~120 Interviews/Interactions  2 Workshops with Leading Scientists (EU, US)  too much manual or via ad hoc scripts  too much in Legacy formats (no PID & MD)  there are lighthouse projects etc. but...  DM and DP not efficient and too expensive (Biologist for 75% of his time data manager)  federating data incl. logical information much too expensive  hardly usage of automated workflows and lack of reproducibility DI research only available for Power-Institutes pressure towards DI research is high, but only some departments are fit for the challenges Senior Researchers: can’t continue like this! need to move towards proper data organization and automated workflows is evident but changes now are risky: lack of trained experts, guidelines and support

29 Content 1.Data Science and Infrastructures 2.Data Practices 3.Principles & Trends 4.RDA 5.RDA Results &Activities 6.Data Fabric IG

30  Comparison G8, FORCE11, FAIR & Nairobi principles  Searchable/findable-> create useful metadata  Accessible -> deposit in trusted repository, use PIDs, have proper AAI in place etc.  Interpretable-> use metadata, registered schema and semantics  Re-usable-> provide contextual metadata  Manageable/persistent-> provide persistent repositories Trends - Principles Drawing by Larry Lannom

31 Trends – Volume, Complexity from simple structures towards complex relationships

32 Trends – Anonymous use direct exchange between known colleagues Domain of Repositories new mechanisms of building trust needed

33 Trends – Re-Usage Domain of trusted Repositories Data will be re-used in different contexts

34 Trends – Structuring domains Nores to be assessed to increase trustfulness

35 Trends – large federations domain of registered data various common data services (across countries & disciplines) taken from EUDAT

36 Trends – unified Data Management management of data objects is widely type and discipline independent

37 Trends – world-wide PID system what Internet Domain nodes with IP numbers packages being exchanged standardized protocols Data Domain objects with PID numbers objects being exchanged standardized protocols

38 Trends – split of functions “logical layer” operations are complex due to relations, etc.

39 BIG Questions  How to change inefficient practices?  How to overcome infrastructure barriers?  How to come to fundable infrastructure eco- system?  Ho to turn trends and principles into action?

40 Network Example TCP/IP Specification 1977 TCP/IP Stress-test WWW-Mosaic available worldwide adoption  many different suggestion & protocols  first TCP/IP just one suggestion amongst many  at the beginning discussion about different systems  at the beginning no interest from researchers and also industry (toy of some freaks)  required some smart policy decisions to push unification 20 years!

41 Content 1.Data Science and Infrastructures 2.Data Practices 3.Principles & Trends 4.RDA 5.RDA Results &Activities 6.Data Fabric IG

42 Role of RDA

43 RDA is about changing data practices 43 RDA is about building the social and technical bridges that enable global open sharing of data. Researchers, scientists, data practitioners from around the world are invited to work together to achieve the vision Funders: NSF, EC, AU, Japan, Brazil, DE?, UK?, ZA?, FI?, etc.

44 RDA is about changing data practices 44 RDA Global WG/IG/BoF initiative THE MACHINE RDA Europe Project THE SUPPORT EC Data Practitioners funding owning RDA Results testing adopting Co-funding Workshops/Sessions Training Helping Knowledge Base Leading Scientists WS Policy Activities funding creating commenting

45 RDA Governance 45 Interest Groups domain coordination, idea generation, maintenance, … RDA Membership Working Groups implementable, impactful outcomes Council organisational vision and strategy Technical Advisory Board socio-technical vision and strategy Secretariat administration and operations Organisational Advisory Board needs, adoption, business advice

46 Use Cases are the basis! all indicated nodes are centers of national, regional and even worldwide federations NameInstitutestate 1Language ArchiveMax Planck Institute NLin operation 2Geodata Sharing PlatformAcademy of ChinaIn operation 3Datanet Federation ConcortiumRENCI USIn operation 4ADCIRC Storm ForcastingRENCI USIn operation 5EPOS Plate ObservationINGV/CINECA ItalyIn operation 6ENVRI Environment ObservationU Helsinki, FinlandIn design 7Nanoscopy Repository Cell structuresKIT, GermanyIn design 8Human Brain NeuroinformaticsEPFL Switzerlandin testing 9ENES Climate ModelingDKRZ GermanyIn operation 10LIGO Gravitation PhysicsNCSA USIn operation 11ECRIN Medical Trial InteroperationU Düsseldorf GermanyIn testing 12VPH Physiology SimulationU London UKIn operation 13Species ArchiveNature Museum GermanyIn operation 14International NeuroI FacilityINCF SwedenIn operation 15Molecular GeneticsMPI GermanyIn operation Use Case driven and not “theory driven”

47 RDA Engagement 47 from 103 countries

Plenary 6 and Data Challenge! CNAM, Paris, France September 2015

49 Content 1.Data Science and Infrastructures 2.Data Practices 3.Principles & Trends 4.RDA 5.RDA Results & Activities 6.Data Fabric IG

50 RDA Results I: simple common data model Definition A persistent identifier is a long-lasting ID represented by a string that uniquely identifies a DO and that is intended to be persistently resolved to meaningful state information about the identified DO. Note: We use the term Persistent Resolvable Identifier as a synonym. If all would adhere to simple model much would be gained Could define a simple repository API

51 Impact of DFT Result Federating this cost too much. How to maintain?

52  result: a registry for data types  you get an unknown file, pull it on DTR and content is being visualized  extended MIME Type concept  no free lunch: someone needs to register and define type  code available begin 2015  PIT Demo already working with DTR RDA Results II: Data Type Registry

53  result: a generic API and a set of basic attributes  a PID Record is like a Passport (Number, Photo, Exp-Date, etc.)  if all PID Service-Provider agree on one API and talk the same language (registered terms) SW development will become easy  Test-Installation in operation together with DTR RDA Results III: PID Information Types working with PID and service providers much easier worldwide interoperability

54  due to unforeseen circumstances need until P5  Practical Policies = executable Workflow Statements  result at P5: a set of Best Practice PPs for a number of typical DM/DP tasks (Integrity Check, Replication, etc.)  currently a large collection of PPs, currently being evaluated  you could add your policies RDA Results IV: Practical Policies huge simplification for data stewards finally feasible quality checks and certification huge step in trust improvement one cornerstone towards reproducible data

55 Data Fabric Interest Group Data Fabric IG looks for common components and services to make this work as efficient and reproducible as possible Other WG/IGs looking at data publication workflows and citation

56 RDA – first Working Group results results achieved after ~20 months!

57 DFIG – grouping of WG/IGs CITDD ProvBROK CERT BDA REP REPRO DMP DOM FIM PP

58  Recently paper a number of colleagues engaged in RDA Data Management Trends, Principles and Components – What Needs to be Done?  Co-authors don’t claim to own any ideas – but kick-off a broad discussion  Need to accelerate solution finding and convergence process Doc: Data Fabric Wiki: Position Paper “Paris.doc” 8 Common TrendsPartly stable, some still in debate G8+ PrinciplesWidely agreeed Consequences of PrinciplesNot really thought through 19 ComponentsTo be discussed now Organizational ApproachesTo be discussed now get involved in these discussions

59 DFIG Spinoff – Repository Registry Domain of Trusted Repositories Safe Deposit Scientists Publishers Funders trusted Re-use valid References reproducible Science machine usage Registry (Humans, Machines)

60 Other “Clusters” Community Groups  Agriculture / Wheat Interop  Biodiversity  Structural Biology  Biosharing Registry  ELIXIR  Toxicogenomics  Metabolomics  Geospatial  Materials Data  Photon&Neuron  Marine  History&Ethnogarphy  Urban Life Social Groups  Community Capability  Data Re-use  Data Life Cycle  Engagement  Ethical Aspects  Legal Interoperability  Data Rescue  Data Handling Training  Data for Development  Cloud Worldwide Training

61  Uptake session at P5 in San Diego plenary/programme/adoption-day.html  Calls for Uptake Proposals from EUDAT and RDA Europe  Possibilities in EC’s WP16/17  Proposal to NSF around National Data Service coming, activities in China, Japan, Germany, etc.  Establishment of Testbeds by NDS, EUDAT, etc. Uptake of results get involved in testing/uptaking

62  RDA:  RDA Europe:  Data Management Trends, Principles and Components - What Needs to be Done Next? V6.1: f31aa6f4d448http://hdl.handle.net/11304/992fe6a0-fe34-11e4-8a18- f31aa6f4d448  Principles for Data Sharing and Re-use: are they all the same?  Living with Data Management Plans  RDA Europe: Data Practices Analysis  DFT:  Data Fabric:  Data Fabric Wiki: References

63 Thanks for your attention.

64 1.PID System 2.Actor ID System 3.Registry S for Trusted Repositories 4.Metadata S 5.Schema Registry S 6.Registry S Semantic Categories, Vocabularies 7.Data Types Registry S 8.Registry S for Practical Policies 9.Prefabricated PP Modules 10.Distributed Authentication S 11.Authorisation Record Registry S Components - Position Paper  OAI-PMH, ResourceSync, SRU/CQL  Workflow Engine & Environment  Conversion Tool Registry  Analytics Component Registry  Repository API  Repository System  Certification & Trusted Repositories  Training Modules

65 RDA is about changing data practices 65