Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

The Dryad Data Repository Ryan Scherle 1, Hilmar Lapp 1, Amol Bapat 2, Sarah Carrier 2, Jane Greenberg 2, Peggy Schaeffer 1, Todd Vision 1,3, Hollie White.
DDI for the Uninitiated ACCOLEDS /DLI Training: December 2003 Ernie Boyko Statistics Canada Chuck Humphrey University of Alberta.
Alexandria Digital Library Project Integration of Knowledge Organization Systems into Digital Library Architectures Linda Hill, Olha Buchel, Greg Janée.
Jane Greenberg, Professor and Director, Metadata Research Center School of Information And Library Science University of North Carolina at Chapel Hill.
Subject Analysis: An Introduction Based on BASIC SUBJECT CATALOGING USING LCSH edited by Lori Robare.
SKOS-2-HIVE UNT workshop. Morning Session Schedule Introductions and Exploring HIVE Section 1: Knowledge Organization and Vocabulary Control Section 2:
Helping Helping Interdisciplinary Vocabulary Engineering Ryan Scherle – National Evolutionary Synthesis Center Jose Aguera – University of North Carolina.
DEMONSTRATING HIVE AND HIVE-ES: SUPPORTING TERM BROWSING AND AUTOMATIC TEXT INDEXING WITH LINKED OPEN VOCABULARIES UC3M: David Rodríguez, Gema Bueno, Liliana.
SKOS-2-HIVE GWU workshop.
Information Retrieval in Practice
Using Metadata in CONTENTdm Diana Brooking and Allen Maberry Metadata Implementation Group, Univ. of Washington Crossing Organizational Boundaries Oct.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Semantic Mediation & OWS 8 Glenn Guempel
Overview of Search Engines
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
PREMIS Tools and Services Rebecca Guenther Network Development & MARC Standards Office, Library of Congress NDIIPP Partners Meeting July 21,
Teaching Metadata and Networked Information Organization & Retrieval The UNT SLIS Experience William E. Moen School of Library and Information Sciences.
Universität Innsbruck Leopold Franzens  Copyright 2007 DERI Innsbruck EASAIER 18 Month Coordination Meeting, Tel Aviv, Israel WP 2 – Media.
SKOS-2-HIVE Interactive Seminar. Introductions Hollie White Jane Greenberg
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
Controlled Vocabulary Working Group PRESENTED BY JOHN PORTER.
Bio-Medical Information Retrieval from Net By Sukhdev Singh.
D4: SKOS and HIVE—Enhancing the Creation, Design and Flow of Information Speakers: Hollie White Jane Greenberg Coordinator: Alan Keely.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Testing and Improving Interoperability The Z39.50 Interoperability Testbed William E. Moen School of Library and Information Sciences Texas Center for.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Producción de Sistemas de Información Agosto-Diciembre 2007 Sesión # 8.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Metadata Lessons Learned Katy Ginger Digital Learning Sciences University Corporation for Atmospheric Research (UCAR)
Ontologies and Lexical Semantic Networks, Their Editing and Browsing Pavel Smrž and Martin Povolný Faculty of Informatics,
The Agricultural Ontology Service (AOS) A Tool for Facilitating Access to Knowledge AGRIS/CARIS and Documentation Group Library and Documentation Systems.
HIVE: Enabling Common Language and Interdisciplinarity EPA-NIEHS Advancing Environmental Health Data Sharing and Analysis: Finding a Common Language June.
IL Step 3: Using Bibliographic Databases Information Literacy 1.
210 mm Integration of an Automatic Indexing System within the Document Flow of a Grey Literature Repository Jindřich MynarzJindřich Mynarz, Ctibor ŠkutaCtibor.
Problems in Semantic Search Krishnamurthy Viswanathan and Varish Mulwad {krishna3, varish1} AT umbc DOT edu 1.
EVA Workshop, 26 March 2003, Florence, Italy1 COINE Cultural Objects In Networked Environments Anthi Baliou University of Macedonia,Library Thessaloniki,
Introduction to Morpho BEAM Workshop Samantha Romanello Long Term Ecological Research University of New Mexico.
Recent Developments in CLARIN-NL Jan Odijk P11 LREC, Istanbul, May 23,
Information Modeling and Semantic Web Application For National Climate Assessment Jin Guang Zheng 1 Curt Tilmes 2
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
/Greenberg/NDS DataDryad.org and the interoperability continuum. Repositories and Interoperability 2nd National Data Service Consortium Workshop.
Controlled Vocabulary Giri Palanisamy Eda C. Melendez-Colom Corinna Gries Duane Costa John Porter.
SKOS-2-HIVE GWU workshop. Introductions Hollie White Jane Greenberg
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Overviews of the Library of Texas & ZLOT Project Dr. William E. Moen Principal Investigator.
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.
LTER IM Meeting 2008 – Benson, Boose, Bohm, Gries, Gu, Kaplan, Koskela, Laney, Porter, Remillard, Sheldon and others.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Steven Perry Dave Vieglais. W a s a b i Web Applications for the Semantic Architecture of Biodiversity Informatics Overview WASABI is a framework for.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
PDS4 Demonstration Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.
Jane Greenberg & the Dryad Team The DRYAD Repository ~~~~~~ INLS 720 visit to NESCent November 17, 2008.
HIVE-DRYAD Integration. For Curators Use HIVE to generate subject, taxon, and spatial terms suggestion. Curator’s needs: – Get terms suggestion from HIVE.
HIVE as a Machine-aided Indexing Tool Personal Keyword use without vocabulary control Machine-aided indexing term extraction Participant relevant and not.
Roger Mills February don’t be evil stand on the shoulders of giants.
The Agricultural Ontology Server (AOS) A Tool for Facilitating Access to Knowledge AGRIS/CARIS and Documentation Group Food and Agriculture Organization.
Chelcie Rowell Jane Greenberg Metadata Research Center UNC-Chapel Hill CONTROLLED VOCABULARY STATUS & POTENTIAL IN DATA REPOSITORIES Authority Control.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Information Retrieval in Practice
TRSS Terminology Registry Scoping Study
Cloud based linked data platform for Structural Engineering Experiment
Building Search Systems for Digital Library Collections
Introduction to Metadata
Metadata supported full-text search in a web archive
Presentation transcript:

Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José Ramón Pérez Agüera Lee Richardson Ryan Scherle Todd Vision Hollie White Craig Willis

Overview Part 1 Introduction to HIVE Underlying rationale A scenario Research and challenges Part 2 Technical overview and implementation Progress and challenges Next steps Part 3 Let you experiment

HIVE Team Craig Willis Bob Losee Lee Richardson Hollie White Jane Greenberg Madhura Marathe Lina Huang José R. P. Agüera Ryan Scherle

4 HIVE model  approach for integrating discipline CVs  Model addressing C V cost, interoperability, and usability constraints (interdisciplinary environment)

5 Data underlying peer-reviewed articles in the basic and applied biosciences

Vocabulary analysis – 600 keywords, Dryad partner journals Vocabularies: NBII Thesaurus, LCSH, the Getty’s TGN, ERIC Thesaurus, Gene Ontology, IT IS (10 vocabularies) Facets: taxon, geographic name, time period, topic, research method, genotype, phenotype… Results 431 topical terms, exact matches – NBII Thesaurus, 25%; MeSH, 18% 531 terms ( topical terms, research method and taxon ) – LCSH, 22% found exact matches, 25% partial Conclusion: Need multiple vocabularies Vocabulary needs for Dryad

1. Provide efficient, affordable, interoperable, and user friendly access to multiple vocabularies during metadata creation activities 2. Present a model and an approach that can be replicated  —> not necessarily a service 1. Building HIVE  Vocabulary preparation  Server development 2. Sharing HIVE  Continuing education (empowering information professionals) 3. Evaluating HIVE  Examining HIVE in Dryad HIVE work-plan 3 Phases HIVE Goals

HIVE Partners Vocabulary Partners Library of Congress: LCSH the Getty Research Institute (GRI): TGN (Thesaurus of Geographic Names ) United States Geological Survey (USGS): NBII Thesaurus, Integrated Taxonomic Information System (ITIS) National Library of Medicine and the National Agricultural Library Advisory Board Jim Balhoff, NESCent Libby Dechman, LCSH Mike Frame, USGS Alistair Miles, Oxford, UK William Moen, University of North Texas Eva Méndez Rodríguez, University Carlos III of Madrid Joseph Shubitowski, Getty Research Institute Ed Summers, LCSH Barbara Tillett, Library of Congress Kathy Wisser, Simmons Lisa Zolly, USGS WORKSHOPS HOSTS: Columbia Univ.; Univ. of California, San Diego; George Washington University; Univ. of North Texas; Universidad Carlos III de Madrid, Madrid, Spain

HIVE is for… HIVE for resource creators - w/Dryad: scientists, depositors HIVE for information professionals: curators, professional librarians, archivists, museum catalogers

~~~~Amy Meet Amy Zanne. She is a botanist. Like every good scientist, she publishes, and she deposits data in Dryad. Amy’s data

Usability Formal usability study 4 biologist, 5 information professionals ~ Tasks, usability ratings, satisfaction ranking Average time to search a concept: Librarians: 6.53 minutes Scientists: 3.82 minutes ~ consistent w/research at NIEHS, 2 times as long Average time for automatic indexing sequence Librarians: 1.91 minutes Scientists: 2.1 minutes Huang, 2010

System usability and flow metrics Huang, 2010

Challenges Building vs. doing/analysis Source for HIVE generation, beyond abstracts Combining many vocabularies during the indexing/term matching phase is difficult, time consuming, inefficient. NLP and machine learning offer promise Interoperability = dumbing down ontologies Proof-of-concept/ illustrate the differences between HIVE and other vocabulary registries (NCBO and OBO Foundry) People wanting a service General large team logistics, and having people from multiple disciplines (also the ++)

HIVE Technical Overview Craig Willis

Credits Ryan Scherle (Nescent) José Ramón Pérez Agüera (UNC) Lina Huang (UNC) Duane Costa (LTER) Alyona Medelyan & Ian Whitten (Univ. of Waikato/NZDL)

HIVE Technical Overview HIVE combines several open-source technologies to provide a framework for vocabulary services. Java-based web services can run in any Java application server Demonstration website ( Open-source Google Code project ( mrc/) mrc/ Source code, pre-compiled releases, documentation, mailing lists

Who’s using HIVE? HIVE is being evaluated by several institutions and organizations: Long Term Ecological Research Network (LTER) Prototype for keyword suggestion for Ecological Markup Language (EML) documents. Library of Congress Web Archives (Minerva) Evaluating HIVE for automatic LCSH subject heading suggestion for web archives. Dryad Data Repository Evaluating HIVE for suggestion of controlled terms during the submission and curation process. (Scientific name, spatial coverage, temporal coverage, keywords). Yale University, Smithsonian Institution Archives

HIVE Functions System for management of multiple controlled vocabularies in SKOS format Single interface for browsing, searching, and indexing using multiple vocabularies. Natural language and structured (SPARQL) queries Rich internet application (RIA) demonstration interface Java API and REST interfaces for programmatic access Framework for conversion of vocabularies to SKOS

HIVE Components HIVE Core API Java API for management of HIVE vocabularies. HIVE Web Service Google Web Toolkit (GWT) based interface to demonstrate the HIVE service. Includes Concept Browser and Indexer. HIVE REST API RESTful API developed by Duane Costa of the Long Term Ecological Research Network (LTER)

Supporting Technologies Sesame : Open-source triple store and framework for storing and querying RDF data Used for primary storage, structured queries Lucene: Java-based full-text search engine Used for keyword searching, autocomplete (version 2.0) H2: Embedded relational database Stores administrative data, fast concept index, KEA++ lookup tables. KEA++: Algorithm and Java API for automatic indexing

Architecture

Converting Vocabularies to SKOS “ We learned that some thesauri have complex structures for which no SKOS counterparts can be found and that for some features care is required in converting them in such a way that they are still usable for their original purpose.” Van Assem, Mark. (2010). Converting and Integrating Vocabularies for the Semantic Web. Unpublished dissertation.

Converting Vocabularies to SKOS SKOS does not fit all vocabularies/thesauri For example, MeSH Is a MeSH descriptor a SKOS Concept? “A Method to Convert Thesauri to SKOS” ( van Assem et al) Or is a MeSH concept a SKOS concept? “Converting MeSH to SKOS for HIVE” Either way, information is lost about the vocabulary

Converting Vocabularies to SKOS Additional information Each vocabulary has different requirements AGROVOCAvailable in SKOS ITISConvert from RDB (MySQL) to SKOS RDF/XML LCSHAvailable in SKOS MeSHConvert from XML to SKOS RDF/XML (SAX) NBIIConvert from XML to SKOS RDF/XML (SAX) TGNConvert from flat-file to SKOS RDF/XML

KEA++ for Keyphrase Extraction Algorithm and open-source Java library for extracting keyphrases from documents using SKOS vocabularies. Domain-independent machine learning approach with minimal training set (~50 documents). Leverages SKOS relationships and alternate/preferred labels Developed by Alyona Medelyan (KEA++), based on earlier work by Ian Whitten (KEA) University of Waikato, New Zealand ( (Expanded implementation in Medelyan’s MAUI) Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.” Journal of the American Society for Information Science and Technology, (59) 7: ).

KEA++: Feature definition Term Frequency/Inverse Document Frequency: Frequency of a phrase’s occurrence in a document with frequency in general use. Position of first occurrence: Distance from the beginning of the document. Candidates with high/low values are more likely to be valid (introduction/conclusion) Phrase length: Analysis suggests that indexers prefer to assign two- word descriptors Node degree: Number of relationships between the term in the CV. (MAUI expands feature set) Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.” Journal of the American Society for Information Science and Technology, (59) 7: ). Medelyan, O. (2010). Human-competitive automatic topic indexing. Unpublished dissertation.

HIVE – Upcoming Vocabulary synchronization Integration of HIVE with LCSH Atom Feed ( Integration and evaluation of alternative algorithms As part of the Dryad/HIVE integration Questions: What is the best algorithm for automatic term suggestion for Dryad vocabularies? Do different algorithms perform better for title, abstract, full-text, data? Do different algorithms perform better for a particular vocabulary/taxonomy/ontology?