Chelcie Rowell Jane Greenberg Metadata Research Center UNC-Chapel Hill CONTROLLED VOCABULARY STATUS & POTENTIAL IN DATA REPOSITORIES Authority Control Interest Group ALA Annual 2013
Research Impetus Research Goals Methodology Results Conclusions OVERVIEW OF NSF-SPONSORED RESEARCH STUDY
1 RESEARCH IMPETUS
COST Vocabularies are expensive to create and maintain INTEROPERABILITY Vocabularies sometimes use standards (Z39.19, SKOS) but are often developed independently USABILITY Vocabularies are difficult to use for both information professionals and content creators INTERDISCIPLINARITY Collections are increasingly interdisciplinary CONTROLLED VOCABULARY CHALLENGES FOR DATA REPOSITORIES
A curated general-purpose repository for data underlying journal articles; member node of DataONE Staff performed vocabulary analysis mapping keywords from journals to 10 controlled vocabularies Low percentage of keywords had an exact match with within each controlled vocabulary So what? Dryad requires multiple controlled vocabularies within subject field alone LCSH, MESH, NBII, ERIC, ITIS VOCABULARY NEEDS OF DRYAD, ONE DATA REPOSITORY
HELPING INTERDISCIPLINARY VOCABULARY ENGINEERING (HIVE) MODEL
HELPING INTERDISCIPLINARY VOCABULARY ENGINEERING (HIVE) CONCEPT BROWSER
HELPING INTERDISCIPLINARY VOCABULARY ENGINEERING (HIVE) INDEXER
2 RESEARCH GOALS
1.Identify controlled vocabularies currently in use by different data repositories 2.Examine potential facilitators and inhibitors of controlled vocabulary use by different repository stakeholders 3.Explore infrastructure for using controlled vocabularies in place at different data repositories 4.Develop framework for studying controlled vocabulary use across different roles associated with data repositories RESEARCH GOALS
3 METHODOLOGY
1 DRAFT SURVEY INSTRUMENT 2 PERFORM PILOT TESTING 3 REVISE SURVEY INSTRUMENT 4 1 ST DISTRIBUTION 5 PRELIMARY DATA ANALYSIS 6 2 nd DISTRIBUTION 7 FINAL DATA ANALYSIS RESEARCH PROCESS
CODATA DARTG DC-SAM EPA JE JISC Research Data Mgmt PAMWG RDA RDAP SIG-CR SIG-STI SE STS-L USGS WEB SURVEY DISTRIBUTED TO DATANET & DATA REPOSITORY STAKEHOLDERS
Data Contributor Q3Data CuratorQ13DeveloperQ13 DataNet Administrator Q22OtherQ22 ROLE WITHIN DATA REPOSITORY DETERMINES QUESTION PATH
Data Contributor Q3Data CuratorQ13DeveloperQ13 DataNet Administrator Q22OtherQ22 ROLE WITHIN DATA REPOSITORY DETERMINES QUESTION PATH ✔
Data Contributor Q3Data CuratorQ13DeveloperQ13 DataNet Administrator Q22OtherQ22 ROLE WITHIN DATA REPOSITORY DETERMINES QUESTION PATH ✔✔
4 RESULTS
PARTICIPANT POPULATION
CONTROLLED VOCABULARY USE: CHOICES SUPPLIED BY SURVEY None of the aboveLCSHMeSHTGN ITISNBIIEnvThes/LTERGO 8733 UATAGROVOCERICNALT 3110 TOTAL = 93 participants
CONTROLLED VOCABULARIES: SUPPLIED BY PARTICIPANTS
YesNo Don’t Know TOTAL Select from multiple controlled vocabularies when describing a single dataset Use software to generate suggested subject terms selected from a controlled vocabulary OF THE DATA CONTRIBUTORS WHO HAD NOT PERFORMED ONE OF THE ACTIONS BELOW, HOW MANY WOULD MAKE USE OF THAT FUNCTION IN THE NEXT 12 MONTHS?
YesNo Don’t Know TOTAL Select from multiple controlled vocabularies when describing a single dataset Use software to generate suggested subject terms selected from a controlled vocabulary OF THE DATA CURATORS WHOSE REPOSITORY DOES NOT SUPPORT ONE OF THE ACTIONS BELOW, HOW MANY WOULD SUPPORT THAT FUNCTION IN THE NEXT 12 MONTHS?
IF A TOOL WERE BUILT THAT SUPPORTED THE USE OF CONTROLLED VOCABULARIES WITHIN & ACROSS DATA REPOSITORIES, WHAT FEATURES WOULD THIS TOOL NEED? We would be more likely to use the tool if it was offered in the form of a web services API as opposed to a web site or a desktop application. Web services would make the tool platform-independent and easier to embed within our current suite of software aplications.
IF A TOOL WERE BUILT THAT SUPPORTED THE USE OF CONTROLLED VOCABULARIES WITHIN & ACROSS DATA REPOSITORIES, WHAT FEATURES WOULD THIS TOOL NEED? Ease of use, ease of ‘plugging’ into different services and software.
IF A TOOL WERE BUILT THAT SUPPORTED THE USE OF CONTROLLED VOCABULARIES WITHIN & ACROSS DATA REPOSITORIES, WHAT FEATURES WOULD THIS TOOL NEED? I would use such a tool to add preferred terms to records while keeping free-text tags in place.
IF A TOOL WERE BUILT THAT SUPPORTED THE USE OF CONTROLLED VOCABULARIES WITHIN & ACROSS DATA REPOSITORIES, WHAT FEATURES WOULD THIS TOOL NEED? [S]cience researchers are not familiar with the jargon of ‘controlled vocabularies’ and ‘ontologies.’ They need a tool that helps them connect the correct subject headings or keywords to their work, regardless of what scheme it is. They mostly don't care if it's LCSH or NBII – they just want the correct terms attached to their dataset.
IF A TOOL WERE BUILT THAT SUPPORTED THE USE OF CONTROLLED VOCABULARIES WITHIN & ACROSS DATA REPOSITORIES, WHAT FEATURES WOULD THIS TOOL NEED? My ‘wish list’ includes: selection of specific vocabularies to be used in specific contexts web services that support identification of candidate terms based on metadata content tools for addressing shared terms in different vocabularies
12345MEAN Availability on WWW Openness to term suggestions Generation of suggested terms from selected controlled vocab Data storage Inter/national governance Update frequency Availability of terms as URIS In-house governance TOTAL81 LIMITATIONS: FACILITATORS & INHIBITORS OF CONTROLLED VOCABULARY USE
On a five point scale, with 1 being least important and 5 being most important, please rate how the following aspects FACILITATE your use of controlled vocabularies to describe scientific research data. LIMITATIONS: FACILITATORS & INHIBITORS OF CONTROLLED VOCABULARY USE 1 Low importance 2 Slightly important 3 Neutral 4 Moderately important 5 Very important Availability on the WWW High update frequency
On a five point scale, with 1 being least important and 5 being most important, please rate how the following aspects IMPEDE your use of controlled vocabularies to describe scientific research data. LIMITATIONS: FACILITATORS & INHIBITORS OF CONTROLLED VOCABULARY USE 1 Low importance 2 Slightly important 3 Neutral 4 Moderately important 5 Very important Unavailabilit y on the WWW Low update frequency
On a five point scale, with 1 being least important and 5 being most important, please rate how the following aspects FACILITATE your use of controlled vocabularies to describe scientific research data. LIMITATIONS: FACILITATORS & INHIBITORS OF CONTROLLED VOCABULARY USE 1 Low importance 2 Slightly important 3 Neutral 4 Moderately important 5 Very important Availability on the WWW High update frequency ✔ ✔
On a five point scale, with 1 being least important and 5 being most important, please rate how the following aspects IMPEDE your use of controlled vocabularies to describe scientific research data. LIMITATIONS: FACILITATORS & INHIBITORS OF CONTROLLED VOCABULARY USE 1 Low importance 2 Slightly important 3 Neutral 4 Moderately important 5 Very important Unavailabilit y on the WWW Low update frequency ✔ ✔
LIMITATIONS: WILD WILD WEST No research designs on which to model ours Population and sample difficult to define
5 CONCLUSIONS
Multiple roles associated with data repositories would make use of the following functions: Access to multiple vocabularies at the time of indexing Automatic generation of suggested terms Diversity of understanding regarding what defines a “controlled vocabulary” Long tail of controlled vocabularies actively in use Clear ideas on how to design research about controlled vocabulary use by different data repository stakeholders CONCLUSIONS
PARTICIPATE! The survey will remain open until July 15.
KEEP IN TOUCH! Chelcie Juliet Rowell Digital Initiatives Librarian Z. Smith Reynolds Library Wake Forest University Jane Greenberg Professor, School of Information & Library Science Director, Metadata Research Center University of North Carolina at Chapel Hill
ACKNOWLEDGEMENTS This study was supported by the U.S. National Science Foundation Grant No. ACI We would like to express our gratitude to the people who helped and supported us throughout the design and implementation of this research study, especially Rebecca Koskela, Laura Moyers, and Amber Budden of DataONE and Mary Whitton of RENCI, who were instrumental in helping us to disseminate the survey. We would also like to thank pilot testers of the first draft of our survey instrument as well as all survey participants.
Greenberg, J. (2009). Theoretical Considerations of Lifecycle Modeling: An Analysis of the Dryad Repository Demonstrating Automatic Metadata Propagation, Inheritance, and Value System Adoption. Cataloging & Classification Quarterly, 47(3): 380–402. Greenberg, J. et al. (2011). HIVE: Helping Interdisciplinary Vocabulary Engineering. Bulletin of the American Society for Information Science and Technology, 37(4). etAl.html. etAl.html Helping Interdisciplinary Vocabulary Engineering (HIVE) Demonstration System. Helping Interdisciplinary Vocabulary Engineering (HIVE) Wiki. REFERENCES
Tenopir, C. et al. (2011) Data Sharing by Scientists: Practices and Perceptions. PLoS ONE, 6(6): 1–21. Willis, C. et al. (2012). Analysis and Synthesis of Metadata Goals for Scientific Data. Journal of the American Society for Information Science and Technology, 63 (8): 1505–1520. REFERENCES