Summary Data Practices Report Peter Wittenburg Max Planck Data & Compute Center former MPI for Psycholinguistics.

Slides:



Advertisements
Similar presentations
Workshop goals Promote learning: –exchange info; stimulate ideas for cooperation; add to collective knowledge base Help NDIIPP/JISC plan the future: –Bring.
Advertisements

Professor Dave Delpy Chief Executive of Engineering and Physical Sciences Research Council Research Councils UK Impact Champion Competition vs. Collaboration:
Repositories, Federations, APIs, Policies - wrap up - Peter Wittenburg these slides are just a personal summary of major points they do not represent per.
Co-funded by the European Union under FP7-ICT Co-ordinated by aparsen.eu #APARSEN Welcome to the Conference !! Juan Bicarregui Chair, APA Executive.
Information Types and Registries Giridhar Manepalli Corporation for National Research Initiatives Strategies for Discovering Online Data BRDI Symposium.
Connecting people, society and the economy to a location UNSC Learning Centre 25 February 2013 Peter Harper Deputy Australian Statistician Australian Bureau.
NSD © 2014 DASISH Digital Services Infrastructure for Social Sciences and Humanities WP4 Data Archiving Claudia Engelhardt (UGOE), Arjan Hogenaar (DANS),
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
DASISH Common Solutions to Common Problems. DASISH – Data Service Infrastructure for the Social Sciences and Humanities DASISH brings together 5 ESFRI.
Bielefeld Conference 2006: Academic Library and Information Services: New Paradigms for the Digital Age Hans Geleijnse Director of Library and IT Services.
Data preservation & the Virtual Observatory Bob Mann Wide-Field Astronomy Unit Royal Observatory Edinburgh
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
Software Sustainability Institute Training in Computational Skills Scientific Meeting 2014 “NGS Data after the Gold Rush” TGAC, Norwich.
CLARIN Centers for a Sustainable Infrastructure Daan Broeder, MPI for Psycholinguistics Jan Odijk, Utrecht University.
© 2014 The Regents of the University of Michigan. This work is licensed under the Creative Commons Attribution 4.0 Unported License. To view a copy of.
Common Ground A Policy Framework for Open Access to Research Data Susan Reilly, LIBER Projects
DATA FOUNDATION TERMINOLOGY WG 4 th Plenary Update THE PLUM GOALS This model together with the derived terminology can be used Across communities and stakeholders.
Challenges & opportunities in the preservation of (digital) information: the case of European research libraries Museo de las Ciencias Teatro de UNIVERSUM.
Data Infrastructures Opportunities for the European Scientific Information Space Carlos Morais Pires European Commission Paris, 5 March 2012 "The views.
Data Fabric IG Use Case Analysis. 2 Data Fabric Analysis how to come to essential components & services? Analyze Data Practices.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Why should we invest in DWF? Peter Wittenburg CLARIN Research.
SCIENCE, RESEARCH DATA, AND PUBLISHING Stewart Wills Editorial Director, Web & New Media, Science 26 February 2013.
Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.
Sharing Research Data Globally Alan Blatecky National Science Foundation Board on Research Data and Information.
DASISH Final Conference Common Solutions to Common Problems.
Data Fabric IG Introduction. 2  about 50 interviews & about 75 community interactions  Data Management and Processing is too time consuming and costly.
METADATA WORKSHOP Conclusions Keith Jeffery Peter Wittenburg.
Data Sharing and Archiving: A Professional Society View Clifford S. Duke Ecological Society of America September 9, 2010.
Problems/Disc. Adoption of standards Should there be standards? (not a big problem – responsibility lies with data centre – onus not on scientist) (peer.
CLARIN Issues Peter Wittenburg MPI for Psycholinguistics Nijmegen, NL.
Technology – Broad View Aspects that play a role when integrating archives leave the details of some core topics to the 2. day Bernhard Neumair:Base Technologies.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
Summary of RDA Outputs so far dr. Ir. Herman Stehouwer 22 September 2015.
Workshop on health systems research in low and middle income countries: the role of global health funders in the UK The Wellcome Trust, Gibbs Building,
Grids and Clouds - Humanities Research Peter Wittenburg, Daan Broeder The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands.
Research Data Allience Why and what Peter Wittenburg.
Deepcarbon.net Xiaogang Ma, Patrick West, John Erickson, Stephan Zednik, Yu Chen, Han Wang, Hao Zhong, Peter Fox Tetherless World Constellation Rensselaer.
Task XX-0X Task ID-01 GEO Work Plan Symposium April 2014 Task ID-01 “ Advancing GEOSS Data Sharing Principles” Experiences related to data sharing.
Why RDA? A domain repository perspective George Alter ICPSR University of Michigan.
Infrastructure Breakout What capacities should we build now to manage data and migrate it over the future generations of technologies, standards, formats,
DELOS Network of Excellence on Digital Libraries Yannis Ioannidis University of Athens, Hellas Digital Libraries: Future Research Directions for a European.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
Cloud-based e-science drivers for ESAs Sentinel Collaborative Ground Segment Kostas Koumandaros Greek Research & Technology Network Open Science retreat.
Open Science (publishing) as-a-Service Paolo Manghi (OpenAIRE infrastructure) Institute of Information Science and Technologies Italian Research Council.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EPOS and EUDAT.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No EUDAT Aalto Data.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No The use of the.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
Sara Bowman Center for Open Science | Promoting, Supporting, and Incentivizing Openness in Scientific Research.
EUDAT receives funding from the European Union's Horizon 2020 programme - DG CONNECT e-Infrastructures. Contract No Support to scientific.
ODIN – ORCID and DATACITE Interoperability Network ODIN: Connecting research and researchers Sergio Ruiz - DataCite Funded by The European Union Seventh.
Sara Bowman Center for Open Science | Promoting, Supporting, and Incentivizing Openness in Scientific Research.
CESSDA SaW Training on Trust, Identifying Demand & Networking
RDA 9th Plenary Breakout 3, 5 April :00-17:30
Approaches and Challenges in Managing Persistent Identifiers
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
Data Quality: Practice, Technologies and Implications
Agenda Welcome and overview (Peter)
Agenda welcome and goals (Peter)
An EUDAT-based FAIR Data Approach for Data Interoperability
Common Solutions to Common Problems
Three Uses for a Technology Roadmap
Malte Dreyer – Matthias Razum
Integrating social science data in Europe
Joint DFIG – Broker Meeting The DFIG view Peter Wittenburg
Bird of Feather Session
Director, ICT Centre of Excellence andOpen Data
Interoperability and data for open science
Presentation transcript:

Summary Data Practices Report Peter Wittenburg Max Planck Data & Compute Center former MPI for Psycholinguistics

2 Topics for Data Science  Relevance of Data  Trends in Data Domain  Requirements for Data Domain  Data Practices

3 Why talking about data? data is the oil driving research and economy data is key to understanding big challenges observations experiments simulations crowd sourcing store combination analysis visualization conclusions

4 Why talking about data? data is the oil driving research and economy data is key to understanding big challenges observations experiments simulations crowd sourcing store combination analysis visualization conclusions if this is true for most disciplines can we observe trends?... can we specify requirements?... can we correlate with practices?

5 Trends I – 3 Vs well known:  Volume  Variety  Velocity

6 Trends II – Complexity from simple structures towards complex relationships not so often mentioned:  lots of different relationships and dependencies  relationships at data object level and at content level  reproducibility gap

7 Trends III – Change in Approach  split into Physical Layer to optimize performance (files, clouds, etc.)  split into Logical Layer to optimize finding, tracing relationships, managing access, etc.

8 Trends IV - Anonymity direct exchange between known colleagues Domain of Repositories new mechanisms of building trust needed

9 Trends V – Re-Usage Domain of trusted Repositories Data will be re-used in different contexts Data needs to be findable, accessible, combinable and interpretable for others

10 Trends VI – large federations EUDAT example to domain of registered data exchange data manage and preserve data bring data close to HPC and strong clusters offer improved data services all across disciplines and countries in EU

11 Trends VII – large federations in Humanities one of the languages DOBES Example ~70 international teams collaborating + 1 archive changed culture in community remote language archives large data centers with copies

12 Requirements for Data Science  let’s use the G8 formulations – data should be  searchable-> create useful metadata  accessible -> deposit in trusted repository and use PIDs  interpretable-> create metadata, register schema and semantics  re-usable-> provide contextual metadata  persistent-> provide persistent repositories  let’s use the Knowledge Exchange formulations  promote data sharing norms and standards-> make all explicit  support services enabling effective sharing-> build infrastructure  provide a deposit infrastructure -> build infrastructure  develop flexible access mechanisms-> build infrastructure  allow linking between publications and data -> use PIDs and citation MD

13 Requirements for Funders  Naoyuki Tsunematsu (Senior Advisor, JST) Question 1: what is the value proposition for publically funded research? Publically funded research is about stimulating competitiveness new strand for competitiveness: Knowledge Discovery based on smart data collection role of funders seems to change: build infrastructures to make data visible and accessible invest in human capacity

14 Requirements for Culture  Naoyuki Tsunematsu (Senior Advisor, JST) Question 2: is culture of sharing a relevant issue? need for data exchange (and thus the need for proper data management) are yet difficult to convey in Japanese Science parallel trends observed for Japanese Science not so often included in collaborations anymore not so often represented in the top papers decrease in international ranking serious worries about counterproductive lack of openness this concern seems to be relevant for all of us G8 Open Data Report: UK 90, US/CA 80, FR 65, IT 35, JP 30, DE 25, RU 5

15 Data Practices I – Commons  many obstacles for sharing (mostly well–known)  cultural, sociological -> technological  lacking widely used mechanisms  lack of proper mechanisms and incentives  research is about competition & collaboration  even more obstacles for re-using  often doubted whether useful in particular across disciplines  Big Data claims are controversial  lack of trust, information, quality, etc.  too time consuming frequently

16 Data Practices I – Survey  ~120 Interviews/Interactions  2 Workshops with Leading Scientists (EU, US)  too much manual work or via ad hoc scripts  still creating huge amounts of legacy formats (no PID, MD, lack of explicitness)  there are positive project examples etc. but...  DM and DP not efficient and too expensive (Biologist for 75% of his time data manager)  federating data incl. logical information much too expensive  hardly usage of automated workflows and lack of reproducibility

17 Data Practices I – Survey  ~120 Interviews/Interactions  2 Workshops with Leading Scientists (EU, US)  too much manual work or via ad hoc scripts  still creating huge amounts of legacy formats (no PID, MD, lack of explicitness)  there are positive project examples etc. but...  DM and DP not efficient and too expensive (Biologist for 75% of his time data manager)  federating data incl. logical information much too expensive  hardly usage of automated workflows and lack of reproducibility many researchers can’t participate pressure towards DI research is high, but only some departments are fit for the data challenges Senior Researchers: can’t continue like this! need to move towards proper data organization and automated workflows is evident but changes now are risky: lack of trained experts, guidelines and support

18 Promiss of RDA 18 RDA is about building the social and technical bridges that enable global open sharing and re-use of data. Researchers, scientists, data practitioners from around the world are invited to work together to achieve the vision Funders: NSF, EC, AU, Japan, Brazil, DE?, UK?, ZA?, FI?, etc. how much time to we have? years from TCP/IP to Connectivity RDA to accelerate processes

19 Thanks for your attention. Next RDA Plenary P6: September Paris

20 Data Practices III – DOBES Example  ~ 70 teams, ~ 100 languages  DOBES changed culture  researchers agreed to deposit and share pre-final versions  researchers agreed on basic access mechanisms and layers  researchers started with cross-language analysis work using collections  agreement on data structuring and metadata principles  researchers used broadly modern technology  archivists established archiving principles and technology  technology all built on Open Standards  result is a large online data archive with many open resources  all fine?  NO - could not solve archive sustainability issue yet in a smooth way

21  science is changing and data accessibility will be crucial  trends of working with data requires culture change  currently working with data is inefficient, expensive and mostly non-reproducible  data intensive science is limited to power institutes  essential components must be persistent  connecting nodes cost years  RDA founded to accelerate towards efficient data science  still RDA is a very young initiative it is a chance Summary