Data Practices and RDA Agreements Peter Wittenburg Max Planck Data and Compute Center (Munich) Former: MPI for Psycholinguistics (Nijmegen)

2 what after DOBES & CLARIN?  since ~2 years retired from MPI Nijmegen  in DOBES & CLARIN we created exemplary infrastructures with proper data organizations  this was very well perceived by others  thus we got invited  to give advice to the EC and member states  to participate in the European EUDAT data infrastructure  to participate in the worldwide Research Data Alliance

3 is there a problem?  data as an issue  many publications about data deluge  many publications with the question whether we are fit for the challenges of “data science”  many discussions about Big Data  many stories about inefficiencies and being excluded (projects, researchers, industry)  in the last 2 years:  about 50 interviews in many departments/institutes (different disciplines, DOBES, CLARIN) in EUDAT, RDA-EU etc.  about 80 intensive interactions at scientific meetings  in DFT Working group about 20 data models  workshop with leading scientists 2010/2011

4 which conclusions from interviews?  Data Management and Processing is too time consuming and costly due to the heterogeneity in how data is organized in particular with respect to logical information. Researchers see the need to change habitudes, but do not have agreed suggestions for solutions. many researchers stick to file systems – often at its limits  Federating data including logical layer information which is relevant for tracing provenance, for understanding creation context, for checking identity and integrity, etc. is so costly that in practice it is not done, although most data professionals understand that this practice cannot be continued like this.  DM and DP is not ready for Big Data due to the lack of usage of automated procedures incorporating proper data organization mechanisms. too many ad hoc scripts without proper documentation around

5 which conclusions from interviews?  Due to lack of software that is supporting proper data organizations we continue to create legacy data that cannot be integrated easily into our growing domain of accessible data. most disciplines have massive problem with legacy, we still create legacy  When we only consider that for example one of the key biologists in a large research institute is spending 75% of his time for manual data management, we can estimate how much money and human capital is wasted by the way we are doing. huge waste of money and skills

6 some more conclusions from interviews?  Need “chains of trust” in our increasingly anonymous data domain  one corner stone is trusted and certified repositories  another one is PID association incl. state information  change the way we publish/cite data so that creators are acknowledged  We need to start with training our young people in how to overcome this situation.  of course still hard work on changing data sharing culture

7 one concrete example  a great online database with brilliant data (about 40 TB) created by excellent and engaged researchers  some data is in file system  some derived data is in mySQL  metadata (descriptions about data content) is in a CMS  identified are data items by URLs (or are they cool URIs?)  relational metadata is stored in Python scripts  how do you want to replicate such data at low costs?  how do you want to federate such data at low costs?  how do you want to scale up such a solution?  how do you want to certify the repository?  what when the script expert leaves the institution?

8 are there trends? current practice is different: many collections have their own organization, own query system, own management principles, etc.

9 can we learn from Internet? 1986-1989 TCP/IP what happened in all these years?  many ideas, much testing, many proprietary solutions, many complete and “closed” solutions  an abstraction towards a unifying protocol + free a lot of time ~1960

10 is there an equivalence in data? what Internet Domain nodes with IP numbers packages being exchanged standardized protocols Data Domain objects with PID numbers objects being exchanged standardized protocols can we agree on simple concepts... -such as an API for PID systems and simple protocols -it would simplify our software -could use PID attributes (identity, integrity, rights, MD ref, data ref, etc.) -can again focus on other aspects

11 here is a role for RDA – Research Data Alliance How to tackle big challenges if data is highly fragmented by disciplines, domains and countries? RDA is building bridges to overcome the hurdles for easy data access, data sharing and interoperability by facilitating collaboration between experts from all over the world belonging to different disciplines and organisations

12 Global initiative with the support & funding of European Commission, Australian National Data Service and US National Science Foundation Who supports RDA? (RDA Colloquium) other countries to join soon

13 Why do we need RDA? 13 Have already so many important initiatives and organizations... Many work on policy level which was & is so important such as CODATA and WDS. Others work on standards in different but related areas (ISO, IETF, W3C, IEEE, etc.) … wanted to have an initiative that acts like early Internet initiative in a bottom-up way to overcome concrete barriers in short time frames. real progress in practice is so slow!!!

14 RDA – How does it work? Experts and Data practitioners come together in RDA Working and Interest Groups to overcome concrete hurdles Working together by f2f & virtual meetings and on the collaborative web platform

15 The RDA Engine – Working & Interest Groups  12 Working groups including Community Capability Model, Data Citation, Data Foundation and Terminology, Data Type Registries... https://rd-alliance.org/workinggroup-list.html https://rd-alliance.org/workinggroup-list.html  27 Interest groups including Agricultural Data Interoperability, Big Data Analytics... https://rd-alliance.org/interestgroup-list.html https://rd-alliance.org/interestgroup-list.html  including joint groups with CODATA and WDS 15

16 Data Foundation & Terminology What is a suitable data organization? What is the core of it? 21 models evaluated and defining a common ground. at the core is a Digital Object having a PID and metadata a PID record is like a passport – has a number, fingerprint, etc. a world-wide registration system is in place (incl. an API) Examples from WGs

17 PID information types WG Persistent identifiers (PID) are the core of proper data management and access let’s look at passports: a number is not sufficient... …. first solution for standardized PID info types … will design and implement an API for interaction with typed information Automated data management across disciplines and repositories can highly benefit from standardized types (DOI is one specific form of PID used for published data sets to make data citable) Examples from WGs

18 Data Type Registry WG There are so many data types in use, and new ones are continuously defined in science – innovation is a must The result is that often researchers see interesting data, but don’t know how to open, process or visualize the data … implementing a type registry for data, which explains how to open, visualize and process the data In 2014 a first implementation for a type registry is expected to be there You could build your own – all must be open Examples from WGs

19 An Interest Group (IGs) can be established prior to a Working Group for community discussion of issues and areas that facilitate data-driven research. IGs are longer-term groups defining common issues and interests. RDA Interest Groups

20 RDA – classification (check web site) RDA WGscross-disciplinaryinfrastructural (7) discipline-specific (1) non-infra (6) RDA IGscross-disciplinaryinfrastructural (10) discipline-specific (7) non-infra (6) infra: DFT, DTR, PIT, PP, MDR, Data Citation, MDR Interoperability non-infra:Data Categories & Codes, Certification, Data Publication (4) discipline:Wheat Interoperability infra: Semantics, PIDs, MD, Provenance, Context, BD Analytics, Longtail D, Brokering, FIM, Preservation, Data Fabric non-infra:publishing, legal Interoperability, Clouds in Dev. Countries, community capability, engagement discipline:agriculture, toxic genomics, structural bio, biodiversity, marine data, urban data, data in History and ethnography

21 RDA Plenary Meetings … Plenary 1 – 18- 20 March 2013 Gothenburg, Sweden  240 participants  3 WG, 9 IG Plenary 2 - 18-20 September 2013 in Washington, DC, USA  380 participants  6 WG, 17 IG, 5 BOF Plenary 3 - 26-28 March 2014 in Dublin, Ireland  490 participants  16 WG, 35 IG and 20 BOF meetings  10 co-located workshops & meetings Plenary 4 - 22-24 September 2014, Amsterdam, Netherlands come and join the work Working & interest groups get together and hold face-to-face discussions New groups proposals & Birds of a Feather RDA member networking Co-located events Working & interest groups get together and hold face-to-face discussions New groups proposals & Birds of a Feather RDA member networking Co-located events

22 RDA Members – who’s engaged? Afghanistan Argentina Armenia Australia Austria Belgium Bolivia Botswana Brazil Bulgaria Canada China Congo {Democratic Rep} Costa Rica Croatia Cuba Cyprus Czech Republic Denmark Estonia Finland France Germany Ghana Greece Hungary Iceland India Ireland {Republic} Israel Italy Japan Kenya Korea South Lithuania Malaysia Mexico Mozambique Nepal Netherlands New Zealand Niger Nigeria Norway Pakistan Palestine Philippines Poland Portugal Qatar Romania Russian Federation Senegal Serbia Singapore Slovenia South Africa Spain Sudan Sweden Switzerland Taiwan Tanzania Turkey Ukraine United Arab Emirates United Kingdom United States Uruguay Vatican City Venezuela ~1694 members from 73 countries ~1694 members from 73 countries RegionMARCH 2014% EU82349% AU624% US62037% Others18911% TOTAL1694

23 RDA MEMBERS – what type of organisations…?

24 Become a member …  Member benefits: join and form Working & Interest Groups, participate in RDA elections, contribute to discussions & debates, comment on emerging groups, attend plenaries, news & updates, etc.  Register to the on-line community and become a Member of RDA - open & free  https://www.rd-alliance.org/user/register https://www.rd-alliance.org/user/register  Countries can become organisational members Openness Consensus Balance Harmonization Community Driven Non-Profit

25 The European plug-in to RDA … RDA Europe Forum – strategic advice RDA Europe Science Workshops – interaction & feedback from target audience RDA Europe national & pan-European outreach – to engage new members & disseminate outputs RDA Europe policy report – to support European policy-makers & funders RDA Europe Forum – strategic advice RDA Europe Science Workshops – interaction & feedback from target audience RDA Europe national & pan-European outreach – to engage new members & disseminate outputs RDA Europe policy report – to support European policy-makers & funders RDA Europe, the European plug-in to the global RDA, supports RDA global and brings European voice to the table

26 RDA Collaborative Web Platform rd-alliance.org Interaction with RDA enquiries@rd-alliance.orgenquiries@rd-alliance.org RDA Europe - rda-europe@rd-alliance.org | europe.rd-alliance.orgrda-europe@rd-alliance.org Twitter - @resdatall Facebook - https://www.facebook.com/pages/Research-Data- Alliance/459608890798924https://www.facebook.com/pages/Research-Data- Alliance/459608890798924 LinkedIn - www.linkedin.com/pub/research-data-alliance/77/115/7aa/www.linkedin.com/pub/research-data-alliance/77/115/7aa/ SlideShare - http://www.slideshare.net/ResearchDataAlliance 26 All the links ….

27 Thank you! RDA Collaborative Web Platform rd-alliance.org Interaction with RDA enquiries@rd-alliance.orgenquiries@rd-alliance.org

28 do you need to bother?  but...  you will have to create a Data Management Plan, since funders will not accept current attitudes anymore  you will need to make your data visible (metadata), accessible (repository) and re-usable (metadata)  you will need to demonstrate trustfulness of your data (identity, integrity, authenticity, provenance)  trusted repositories will be the anchor of our future data domain and they will establish requirements to be auditable  most results are not re-producible (some surveys: reproducibility for articles published in scientific journals is as low as 10-30%)... researchers hate the “overhead” involved.

29 Trends relevant for DFT we had store and organization together (FS, Fedora, DB, etc) see a split in function to cope with amounts and complexity on the one side we see Clouds etc. solving the amount problems on the other side heterogeneity in the area of the logical layer logical layer components physical layer (FS, Cloud)

30 What is Research Data Alliance about? 30 There will be many bridges... … building the social and technical bridges that enable global open sharing of data.

31 Can your Organisation become a member?  … include R&D agencies, for-profit companies and non- profit foundations, community organizations, institutions, etc. (Annual membership fee based on size of organisation (# persons)) Why should it become a member?  Affiliation with likeminded organisations to coordinate efforts in mutual areas of interest & to avoid unnecessary duplication...

32 “Knowledge is the engine of our economy. And data is its fuel.” (Neelie Kroes, Vice-President of the European Commission) Strong engagement and impact - Bottom-up meeting top-down Community engagement through the researchers global involvement in the working and interest group activities

33 Domain Science - focused  Toxicogenomics Interoperability IG  Structural Biology IG  Biodiversity Data Integration IG  Agricultural Data Interoperability IG  Digital History and Ethnography IG  Defining Urban Data Exchange for Science IG  Marine Data Harmonization IG  Materials Data Management IG Community-Driven RDA Groups by Focus Data Stewardship - focused  Research Data Provenance IG  Certification of Digital Repositories IG  Preservation e-infrastructure  Long-tail of Research Data IG  Publishing Data IG  Domain Repositories IG  Global Registry of Trusted Data Repositories and Services IG Base Infrastructure - focused  Data Foundations and Terminology WG  Metadata Standards WG  Practical Policy WG  PID Information Types WG  Data Type Registries WG  Metadata IG  Big Data Analytics IG  Data Brokering IG Reference and Sharing - focused  Data Citation IG  Data Categories and Codes WG  Legal Interoperability IG Community Needs - focused  Community Capability Model IG  Engagement IG  Clouds in Developing Countries IG

34 How can you become a member?  Register to the on-line community and become a Member of RDA.  No fees involved for individual participation.  Membership is open to any individual who subscribes to the RDA Guiding Principles.  As a Member one may join and form Working and Interest Groups and participate in RDA elections.  https://www.rd-alliance.org/user/register https://www.rd-alliance.org/user/register 09/09/201534

35 RDA Working Groups  Form the Foundation for RDA Community Impact!  … envisioned as accelerants to data sharing practice and infrastructure in the short-term with the overarching goal of advancing global data-driven discovery and innovation  RDA Working Group profile:  Short-term: 12-18 months  Focused efforts with specific actions adopted by specific communities  International participation  Open, voluntary, consensus-driven  Complementary to effective efforts elsewhere Outcomes / deliverables: New data standards or harmonization of existing standards. Greater data sharing, exchange, interoperability, usability and re- usability. Greater discoverability of research data sets. Better management, stewardship, and preservation of research data.

36 high volume vs. complex data Volume Rank frequency of datatype mostly “regular” data Orphan data (B. Heidorn) “Most of the bytes are at the high end, but most of the datasets are at the low end” – Jim Gray collection data

37 do you need to bother?  if you just work in your own domain it may still work  still a difference between regular and complex data  everywhere the problem when people leave Information Content Ti me Time of publication Specific details General details Accident Retirement or career change Death (Michener et al. 1997) Data Entropy Bill Michener DataOne

38 are there commonalities? data one can work with organized data sharable & re-usable data scientific data fabric often all in file system often a lot of copying in file system

39 RDA Outputs.. What’s coming in 2014 (1/2)  Data Type Registries WG  Defining a system of data type registries  Defining a formal model for describing types and building a working model of a registry.  To be adopted by CNRI, International DOI Foundation, and used by the Deep Carbon Observatory and others  (working in conjunction with PID group)  Scheduled to complete Summer, 2014  Persistent Identifier Information Types  Defining a minimal set of types that must be associated with a PID (e.g. checksum, author). Specifying an API for interaction with PID types  Adopted and used by Data Conservancy and DKRZ  (working in conjunction with DTR group)  Scheduled to complete Summer, 2014  Metadata Standards  Creating use cases and prototype directory of current metadata standards from starting point of DCC directory and stakeholder contributions.  To be hosted and used by JISC, DataOne and others  Scheduled to complete Fall, 2014

40  Practical Code policies (rules)  Survey of policies in production use across data management centers. Test bed of machine-actionable policies (IRODS, DataVerse, dCache) at RENCI, DataNet Federation Consortium, CESNET, Odum Institute.  Deployment of 5 policy sets (integrity, access control, replication, provenance / event tracking, publication ) on test beds. Publication of standard policies for use as starter kits.  Scheduled to complete Summer, 2014  Language Codes  Operationalization of ISO language categories for repositories  Adopted and used by the Language Archive, PARADISEC  Proposal of data categories associated with the CMDI schema as ISO standards.  Scheduled to complete Fall, 2014  Data Foundations and Terminology  Defining a common vocabulary for data terms based on existing models.  Creating formal definitions in a structured vocabulary too which also provides an open registry for data terms.  (active input from all RDA WGs)  Tested and adopted by EUDAT, DKRZ, Deep Carbon Observatory, CLARIN, EPOS, and others  Scheduled to complete Summer, 2014 RDA Outputs.. What’s coming in 2014 (2/2)

41 RDA MEMBERS – HOW ARE THEY GROWING?

Data Practices and RDA Agreements Peter Wittenburg Max Planck Data and Compute Center (Munich) Former: MPI for Psycholinguistics (Nijmegen)

Similar presentations

Presentation on theme: "Data Practices and RDA Agreements Peter Wittenburg Max Planck Data and Compute Center (Munich) Former: MPI for Psycholinguistics (Nijmegen)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Practices and RDA Agreements Peter Wittenburg Max Planck Data and Compute Center (Munich) Former: MPI for Psycholinguistics (Nijmegen)

Similar presentations

Presentation on theme: "Data Practices and RDA Agreements Peter Wittenburg Max Planck Data and Compute Center (Munich) Former: MPI for Psycholinguistics (Nijmegen)"— Presentation transcript:

Similar presentations

About project

Feedback