Download presentation
Presentation is loading. Please wait.
Published byVirginia Melton Modified over 9 years ago
1
Complex Data Transformations in Digital Libraries with Spatio-Temporal Information B. Martins, N. Freire, J. Borbinha Instituto Superior Técnico, Technical University of Lisbon 2008 International Conference on Asia-Pacific Digital Libraries
2
Introduction and Motivation The DIGMAP project addressed the development of a digital library for materials related to old maps – Collecting metadata from different providers (e.g. OAI-PMH servers) – Processing the metadata and enriching it with inferred spatio-temporal information Challenges in handling heterogeneous metadata – Transforming the original sources into the DIGMAP format (i.e., TEL profile) – Dealing with data inconsistency, non-uniformity, incorrectness and incompleteness – Handling the spatio-temporal information (e.g. dates and geospatial coordinates) Challenges in DIGMAP service interoperability – Using the results from DIGMAP services to enrich the metadata DIGMAP required appropriate XML processing technology for dealing with the above challenges
3
The Proposed Solution Use XML processing languages like XSLT and XQuery Extend the XPath 2.0 function library – Functions for managing geospatial information – Functions for managing temporal information – Functions for text processing – Other miscellaneous functions All the advantages of declarative languages like XSLT and XQuery, together with powerful methods for handling complex transformations
4
Outline Introduction Proposed Extensions to the XPath Function Library Implementation Issues Test Cases Within the DIGMAP Project Conclusions and Future Work
5
The Proposed Extensions Extensions for geospatial data handling – Combining spatial elements according to a geospatial predicates such as distance or intersection – Input given in GML, KML or textual strings with geospatial coordinates Extensions for temporal reasoning – Combining temporal information according to the predicates of Allen’s Algebra for temporal intervals – Input given in GML or string encodings (e.g. the ISO 8601 formats) Extensions for text mining – Keyword matching and textual similarity – Standard text mining operations (e.g. language recognition) Other miscellaneous extensions – Handling JDBC calls and calls to external Web services
6
Geospatial Data Handling Operators for performing geospatial analysis based on the OGC Simple Features and Filter Encoding specifications – Distance, union, intersection or difference between two geometries – Validity of a given spatial filter – Check if two geometries are spatially related (e.g. containment or overlap) – Check if two geometries fall bellow a given distance threshold – Area, length, buffer, centroid, boundary or envelope of a geometry – Geometric computations (e.g. translation or scaling) over a geometry – Conversion between GML, KML, C-Square, Geohash or WKT encodings – Transformations on the coordinate systems used in geometries
7
Temporal Data Handling Operators for temporal analysis based on Allen's interval algebra – Distance, union, intersection or difference between temporal intervals – Check if two intervals are related (e.g. containment or overlap) Other operators for temporal data handling – Compute lengths for temporal intervals (e.g. return seconds or years) – Conversion between GML and string encodings
8
Textual Data Handling Keyword matching and textual similarity – Tokenization and keyword-based search – Phonetic similarity (Soundex and Double Metaphone) – String similarity (e.g. Edit Distance, Jaro, Jaro-Winkler, Q-grams, …) Standard text mining operations – Language recognition – Keyword extraction (statistically significant keywords) – Named entity recognition (regexp, dictionaries or machine learning) – Text classification (machine learning)
9
Miscellaneous Functions Calling external Web services (REST and SOAP) Conversion from XML to JavaScript Object Notation (JSON) Handling Java DataBase Connectivity (JDBC) calls Reading malformed HTML Converting MARC formats into XML (MarcXml or MarcXchange) …
10
Implementation Issues Proposed extensions implemented on top of SAXON – SAXON is an open source XSLT/XQuery processor – Extension functions coded in Java (static methods) – Extension functions called by binding the Java class to a specific namespace – SAXON takes care of converting the arguments to make the functions fit Most extensions are wrappers over existing open-source libraries – GeoTools and Java Topology Suite (JTS) for the geospatial functions – Lucene and Nux for keyword matching – SimPack for textual similarity – NGramJ and LingPipe for text mining – MARC4J for metadata crosswalks (i.e. handling MARC formats) – Apache AXIS for external Web service calls
11
Test Cases Within DIGMAP Conversion between different metadata standards – Converting UNIMARC, MARC21 and other formats into the DIGMAP format – Geospatial coordinates were often given originally in general textual fields – DIGMAP currently indexes over 40.000 metadata records from different sources Wrappers around DIGMAP XML service interfaces – The DIGMAP Gazetteer uses formats like Alexandria DL Gazetteer Service format, KML, geoRSS, … – The DIGMAP GeoParser uses formats like SpatialML, geoRSS, OGC GeoParser, … – Converting between the different formats and calling the services for processing the metadata records Internal development of several DIGMAP services – Data integration within the DIGMAP Gazetteer – Convert different input sources into the Alexandria DL Gazetteer Content Standard – Handling duplicates and small corrections to the data The proposed approach was found to be expressive and computational performance was within acceptable bounds
12
An Example XQuery An XQuery for reading gazetteer data from an HTML source and convert the data Into the Alexandria DL Gazetteer Content format
13
Conclusions Data transformations in Digital Libraries can be very complex – Standard XML processing technology is often not enough – But simple extensions can add the required extra functionality We propose using extension functions to the XPath 2.0 library – Declarative syntax of XSLT and XQuery is not affected – Extension functions add the required extra functionality Used in DIGMAP collection building and service composition – Converting between different metadata formats – Handling the spatio-temporal information included in the metadata – Calling DIGMAP services to enrich the metadata records
14
Currently Ongoing Work Implementing a visual interface for encoding the metadata transformations Visual “pipelines” converted into XQuery instructions Hide the complexity of the XSLT/XQuery languages from non-expert users
15
Thanks for your attention. www.digmap.eu http://transform.digmap.eu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.