Download presentation
Presentation is loading. Please wait.
1
SKOS-2-HIVE GWU workshop
2
Introductions Ryan Scherle (ryan@scherle.org)
Craig Willis
3
Afternoon Session Schedule
Overview Using HIVE as a service Installing and configuring HIVE Using HIVE Core API Understanding HIVE Internals HIVE supporting technologies Developing and customizing HIVE
4
Block 1: Introduction
5
Workshop Overview Schedule Interactive, less structure
Hands-on (work together) Activities: Installing and configuring HIVE Programming examples (HIVE Core API, HIVE REST API)
6
What is your background?
Java Tomcat/Webapps REST SKOS/RDF Sesame Lucene What are you most interested in getting out of this workshop?
7
HIVE Overview HIVE Website HIVE Architecture http://hive.nescent.org/
Primarily for demonstration purposes HIVE Architecture Consists of many technologies combined to provide a framework for vocabulary services.
8
HIVE Vocabularies Partner vocabularies: Other
Library of Congress Subject Headings (LCSH) NBII Biocompexity Thesaurus (NBII) Integrated Taxonomic Information System (ITIS) Thesaurus of Geographic Names (TGN) LTERNet Vocabulary (LTER) Other AGROVOC Medical Subject Headings (MeSH)
9
Architecture
10
HIVE Functions Conversion of vocabularies to SKOS
Rich internet application (RIA) for browsing and searching multiple SKOS vocabularies Java API and REST application interfaces for programmatic access to multiple SKOS vocabularies Support for natural language and SPARQLqueries Automatic keyphrase indexing using multiple SKOS vocabularies. HIVE supports two indexers: KEA++ indexer Basic Lucene indexer
11
Block 2: Using HIVE as a service
12
Using HIVE as a Service HIVE web application HIVE REST service
Developed by Jose Perez-Aguera, Lina Huang Java servlet, Google Web Toolkit (GWT) HIVE REST service Developed by Duane Costa, Long-Term Ecological Research Network
13
Activity: Calling HIVE-RS
Writing Java code to call the hive-rs web service
14
Block 3: Install and Configure HIVE
15
Installing and Configuring HIVE
Requirements Java 1.6 Tomcat (HIVE is currently using 6.x) Detailed installation instructions:
16
Installing and Configuring HIVE-web
Detailed installation instructions (hive-web) Quick start (hive-web) Download and extract Tomcat 6.x Download and extract latest hive-web war Download and extract sample vocabulary Configure hive.properties and agrovoc.properties Start Tomcat
17
Installing and Configuring HIVE-web from source
Detailed installation instructions (hive-web) Requirements Eclipse IDE for J2EE Developers Subclipse plugin Google Eclipse Plugin Apache Ant Google Web Toolkit 1.7.1 Tomcat 6.x
18
Installing and Configuring HIVE REST Service
Detailed installation instructions (hive-rs) Quick start (hive-rs) Download and extract latest webapp Download and extract sample vocabulary Configure hive.properties Start Tomcat
19
Importing SKOS Vocabularies
Note memory requirements for each vocabulary java –Xmx1024m - Djava.ext.dirs=path/to/hive/lib edu.unc.ils.mrc.h ive.admin.AdminVocabularies [/path/to/hive/conf/] [vocabulary] [train]
20
Block 4: Using the HIVE Core Library
21
HIVE Core Interfaces
22
HIVE Core Packages Main interfaces and implementations
edu.unc.ils.mrc.hive.api Main interfaces and implementations edu.unc.ils.mrc.hive.converter SKOS converters (MeSH, ITIS, NBII, TGN) edu.unc.ils.mrc.hive.lucene Lucene index creation and searching edu.unc.ils.mrc.hive.ir.tagging KEA++ and “dummy” tagger implementations
23
edu.unc.ils.hive.api SKOSServer: SKOSSearcher: SKOSTagger: SKOSScheme:
Provides access to one or more vocabularies SKOSSearcher: Supports searching across multiple vocabularies SKOSTagger: Supports tagging/keyphrase extraction across multiple vocabularies SKOSScheme: Represents an individual vocabulary
24
SKOSServer SKOSServer is the top-level class used to initialize the vocabulary server. Reads the hive.properties file and initializes the SKOSScheme (vocabulary management), SKOSSearcher (concept searching), SKOSTagger (indexing) instances based on the vocabulary configurations. edu.unc.ils.mrc.hive.api.SKOSServer TreeMap<String, SKOSScheme> getSKOSSchemas(); SKOSSearcher getSKOSSearcher(); SKOSTagger getSKOSTagger(); String getOrigin(QName uri);
25
SKOSSearcher Supports searching across one or more configured vocabularies. Keyword queries using Lucene, SPARQL queries using OpenRDF/Sesame edu.unc.ils.mrc.hive.api.SKOSSearcher searchConceptByKeyword(uri, lp) searchConceptByURI(uri, lp) searchChildrenByURI(uri, lp) SPARQLSelect()
26
SKOSTagger Keyphrase extraction using multiple vocabularies
Depends on setting in hive.properties edu.unc.ils.mrc.hive.api.SKOSTagger “dummy” or “KEA” List<SKOSConcept> getTags(String text, List<String> vocabularies, SKOSSearcher searcher);
27
SKOSScheme Represents an individual vocabulary, based on settings in <vocabulary>.properties Supports querying of statistics about each vocabulary (number of concepts, number of relationships, etc).
28
Activity Write a simple Java class that allows the user to query for a given term Write a Java class that can read a text file and call the tagger
29
Block 5: Understanding HIVE Internals
30
Architecture
31
Data Directory Layout /usr/local/hive/hive-data vocabulary/
vocabulary.rdf SKOS RDF/XML vocabularyAlphaIndex Serialized map vocabularyH2 H2 database (used by KEA) vocabularyIndex Lucene Index vocabularyKEA KEA model and training data vocabularyStore Sesame/OpenRDF store topConceptIndex Serialized map of top concepts
32
Keyword Search
33
Indexing
34
HIVE Internals: Data Models
Lucene Index: Index of SKOS vocabulary (view with Luke) Sesame/OpenRDF Store: Native/Sail RDF repository for the vocabulary KEA++ Model: Serialized KEAFilter object H2 Database: Embedded DB contains SKOS vocabulary in format used by KEA. (Can be queried using H2 command line) Alpha Index: Serialized map of concepts Top Concept Index: Serialized map of top concepts
35
HIVE Internals: HIVE Web
GWT Entry Points: HomePage ConceptBrowser Indexer Servlets VocabularyService: Singleton vocabulary server FileUpload: Handles the file upload for indexing ConceptBrowserServiceImpl IndexerServiceImpl
36
HIVE Internals: HIVE-RS
Details of HIVE-rs
37
Block 6: HIVE Supporting Technologies
38
HIVE supporting technologies
Lucene Sesame KEA H2 GWT
39
Activity Explore Lucene index with Luke
Explore Sesame store with SPARQL querying-semantic-web-tutorial.html example/
40
Block 7: Customizing HIVE
41
Obtaining Vocabularies
Several vocabularies can be freely downloaded Some vocabularies require licensing HIVE Core includes converters for each of the supported vocabularies. List of HIVE vocabularieshttp://code.google.com/p/hive- mrc/wiki/VocabularyConversion
42
Converting Vocabularies to SKOS
Additional information Each vocabulary has different requirements LCSH Available in SKOS RDF/XML NBII Convert from XML to SKOS RDF/XML (SAX) ITIS Convert from RDB (MySQL) to SKOS RDF/XML TGN Convert from flat-file to SKOS RDF/XML LTER AGROVOC MeSH
43
Converting Vocabularies to SKOS
A Method to Convert Thesauri to SKOS (van Assem et al) Prolog implementation IPSV, GTAA, MeSH Converting MeSH to SKOS for HIVE Java SAX-based parser
44
LTER Sample Service
45
Discussion Pros and Con
HIVE Core vs. HIVE Web vs. HIVE-RS Brainstorm applications that could benefit from HIVE, discuss implementations
46
Block 8: KEA++
47
About KEA++ Algorithm and open-source Java library for extracting keyphrases from documents using SKOS vocabularies. Developed by Alyona Medelyan (KEA++), based on earlier work by Ian Whitten (KEA) from the Digital Libraries and Machine Learning Lab at the University of Waikato, New Zealand. Problem: How can we automatically identify the topic of documents?
48
Automatic Indexing Free keyphrase indexing (KEA)
Significant terms in a document are determined based on intrinsic properties (e.g., frequency and length). Keyphrase indexing (KEA++) Terms from a controlled vocabulary are assigned based on intrinsic properties. Controlled indexing/term assignment: Documents are classified based on content that corresponds to a controlled vocabulary. e.g., Pouliquen, Steinberger, and Camelia (2003) Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.” Journal of the American Society for Information Science and Technology, (59) 7: ).
49
KEA++ at a Glance KEA++ uses a machine learning approach to keyphrase extraction Two stages: Candidate identification: Find terms that relate to the document’s content Keyphrase selection: Uses a model to identify the most significant terms.
50
KEA++: Candidate identification
Parse tokens based on whitespace and punctuation Create word n-grams based on longest term in CV Stem to grammatical root (Porter) Stem terms in vocabulary (Porter) Replace non-descriptors with descriptors using CV relationships Match stemmed n-grams to vocabulary
51
KEA++: Candidate identification
Stemming is not perfect ... Original Stemmed “information organization” “inform organ” “organizing information” “informative organizations” “informal organization”
52
KEA++: Feature definition
Term Frequency/Inverse Document Frequency Frequency of a phrase’s occurrence in a document with frequency in general use. Position of first occurrence: Distance from the beginning of the document. Candidates with high/low values are more likely to be valid (introduction/conclusion) Phrase length: Analysis suggests that indexers prefer to assign two-word descriptors Node degree: Number of relationships between the term in the CV.
53
DummyTagger Primarily intended as baseline for analysis of KEA++
Uses LingPipe for part-of-speech identification (limits indexing to certain parts of speech) Uses Lucene vocabulary index Simple TF*IDF implementation Configurable in hive.properties
54
Plans Automatic updates to vocabularies
Integration of other concept extraction algorithms Maui Dryad integration Other Maven integration Spring integration Data directory and property file restructuring Concept browser updates
55
Credits José Ramón Pérez Agüera Lina Huang Alyona Medelyan Ian Whitten
56
Questions /Comments Ryan Scherle Craig Willis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.