Threats to Open Data: Implications for Library Collections and Services Michele Hayslett and John N. Vickery IASSIST 2007 McGill University, Montréal, Québec, CANADA May 15-18, 2007
Introduction “Open Data is the single most important problem in data-driven science.” Peter Murray-Rust University of Cambridge “scientific datasets are the modern equivalents of medieval manuscripts. “ Sayeed Choudhury Johns Hopkins University
Introduction Threats: Infrastructure Commodification of data Metadata Research Culture Legislation and Intellectual Property
Infrastructure: the challenges Version tracking and migration What version is most current Workable across diverse computing platforms Permission controls To foil mischief-makers To give appropriate parties appropriate rights Flexibility Perform multiple functions (lab notebook [date stamping], data analysis, paper drafts, data preservation, etc.) within one application Store multiple data types (images, numbers, sound, etc.)
Infrastructure: the implications Advantages of data repository over home-grown solutions Necessary to liaise with IT departments as well as faculty Build in-house expertise and act as translators Take the lead in cost-sharing On the basis of existing arrangements New initiatives
Infrastructure: the implications Leverage expertise Preservation Metadata and Organization Collections Reallocate staff and resources Become central players in a domain that includes…. Industry, supercomputing initiatives, professional societies, government, etc.
Commodification of data: the challenges Data as commodity
Commodification of data: the challenges Publication requirements Gaining prevalence in the sciences Publishers are well aware of the value of data Publishers and computing companies will add on more sophisticated information services
Commodification of data: the implications Value added products and services Will be requested by researchers Libraries must be prepared to pay for these products Allocating money for datasets Not a traditional part of libraries budgets Weigh costs and benefits
Commodification of data: the implications Open vs. proprietary products Institutional data repositories Participation in open data initiatives Who will own the data?
Metadata creation: the challenges Agreeing on standards… ”is like Middle-East peacekeeping — every detail has to be worked out and agreed on.” – Martin Elvis, Harvard-Smithsonian Center for Astrophysics Standards are even more difficult across multiple disciplines and data formats Scale of project is also an issue
Metadata creation: the implications Libraries have metadata expertise and need to bring the resources to solve the problem Provide proven models in DDI and varied repositories Librarians/Libraries as data managers on research projects? Could libraries take over the role of metadata creation? Requires re-alignment of legacy departments LOC no longer doing series authority files Considering dumping subject headings
Research culture: the challenges Lack of priority for metadata creation “The task of entering metadata is something that people appear to resist unless it is mandated in their job descriptions or through other enforcement mechanisms." – Schweik, Stepanov and Grove (2005) Need for community-wide codes of practice National Academies of Sciences UPSIDE: Uniform Principle for Sharing Integral Data and Materials Expeditiously Trust necessary discipline-wide (no scooping)
Research culture: the challenges Emphasis on confidentiality “competitive edge of data secrecy“ – Mike Carroll, Villanova Law School Academic advancement ability to claim credit – track origination with data on what basis – award collaboration Change from the inside necessary
Research culture: the implications Can serve as examples and provide education on these issues “…there is an underlying “science” to organizing documents…Many team members may not possess these important skills. ” – Schweik, Stepanov and Grove (2005) Serve as neutral party to build community of trust Serve as mediator between university and larger open data projects
Legislation and IP: the challenges Legislation to protect datasets/databases Data in U.S. vs. Data in E.U. H.R. 3261: Database and Collections of Information Misappropriation Act (failed) To protect or not? National Research Council’s Committee on Responsibilities of Authorship in the Biological Sciences said yes Open Data advocates say no Existing legislation, rules ineffective Shelby Amendment; Information Quality Act NIH requirement - data publicly available within 6 mos.
Legislation and IP: the implications License vs. copyright Licensing Time Money What would change if data were protected?
Summary Challenges Opportunities Infrastructure Commodification of data Metadata Research Culture Legislation/IP Opportunities Liaise between stakeholders Capture data at the source Provide expert help Build community of trust Manage access
Questions Michele Hayslett michele_hayslett@ncsu.edu John N. Vickery john_vickery@ncsu.edu
Questions for You How many of you are planning data repositories? How many of your institutions have a repository architect in the library or plan to hire one?