Download presentation
Presentation is loading. Please wait.
Published bySteven Stevenson Modified over 9 years ago
1
1 panFMP - Ein XML-basiertes Framework für Metadaten- Portale Vortrag und „hands-on“ Seminar am GFZ Potsdam Uwe Schindler MARUM – Universität Bremen PANGAEA ® - Publishing Network for Geoscientific & Environmental Data uschindler@pangaea.de
2
2 Metadata Portals: Search Technology for distributed Catalogues Searching directly on distributed catalogues: In distributed search infrastructures, every data provider not only has his own metadata catalogue, but also a corresponding search interface to the portal (e.g., web service based). Search requests are sent to all data providers. The portal only needs to collect the search results from the providers, then rank and display these to the end user. Examples: NSDI Clearinghouse, GeoMIS.BUND Harvesting catalogues into a central searchable catalogue: Every data provider has its own metadata catalogue but the search engine is centralized. The portal periodically harvests all metadata records into a central index and serves search requests from there. Major web search engines like Google or the FGDC related Geospatial One-Stop are based on this concept. The response time is optimal because only local components are used in the search process.
3
3 Metadata Portals: Harvesting solutions from PANGAEA ® WDC-MARE with its information system PANGAEA ® currently provides data portals for several EU/international projects: Not all data are stored centralized, so all datasets provided in portals must be consolidated from different sources! Features: –Data stays at the data providers –Metadata is harvested by the portal –Search queries are handled by the centralized catalogue (Google-like search speed!) –Scientist gets link to data at the provider
4
4 Metadata Harvesting Solutions Web Accessible Folder (WAF): Simple harvesting by recursively collecting XML files from a web server‘s directory listing – simple, but inefficient Open Archives Protocol for Metadata Harvesting (OAI-PMH):
5
5 Open Archives Protocol The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol developed by the Open Archives Initiative.Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Almost all digital libraries support it (most famous ones: Fedora Commons, arXiv and the CERN Document Server; GeoNetwork Opensource) Fedora CommonsGeoNetwork Opensource Portals by Scientific Commons, OAIster, SUB uses it during web crawling (if available) Very simple to implement (XML over HTTP-REST) Repository software for databases or file system metadata providers is widely available (e.g. DLESE jOAI software on the data provider side)DLESE jOAI
6
6 Current OAI-PMH software 1.Limited to Dublin Core metadata (libraries)! 2.Limited full text search functionality due to relational databases in the background! 3.No geographic retrievals (because of Dublin Core limitation)! 4.End user interface is part of the software, this limits usability in CMS systems.
7
7 Central indexing requirements 1.Open for any XML metadata format 2.Any mappings to document fields should be done by XPath/XSLT 3.Possibility to map incompatible XML schemas during harvesting by XSLT on-the-fly 4.On-the-fly validation of (maybe previously transformed) documents during harvesting 5.No relational database, only a full text search engine, that contains everything needed for operation 6.Range queries on specific fields (date/time or numeric) 7.Web service interface / programming API for the end user interface that is accessible from any language (Java/JSP, PHP, Perl,...)
8
8 Ranked searching - best results returned first Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries for date time values and numbers Fielded searching. All fields are searchable as a whole, each field separately (e.g. for author, parameter), or mixed. Any combination of boolean operators between search terms (AND, OR, NOT, exact phrase) Sorting by any field Multiple-index searching with merged results Simultaneous searching and updates due to high- performance indexing
9
9 Structure of a Lucene Index
10
10 panFMP – PANGAEA ® Framework for Metadata Portals panFMP is a generic and flexible framework for building geoscientific metadata portals independent of content standards for metadata and protocols. Data providers can be harvested with commonly used protocols (e.g., Open Archives Initiative Protocol for Metadata Harvesting) and metadata standards like Dublin Core, DIF, or ISO 19115. The new Java-based portal software supports any XML encoding and makes metadata searchable through Apache Lucene. Software administrators are free to define searchable fields independent of their type using XPath and/or XSL Templates. In addition, by extending the full-text search engine (FTS) Apache Lucene, we have significantly improved queries for numerical and date/time ranges by supplying a new trie-based algorithm, thus enabling high-performance space/time retrievals in FTS-based geo portals. The harvested metadata are stored in separate indexes, which makes it possible to combine these into different portals. The portal-specific Java API and web service interface is highly flexible and supports custom front-ends for users, provides automatic query completion (AJAX), and dynamic visualization with conventional mapping tools.Open Archives Initiative Protocol for Metadata HarvestingDublin CoreDIFISO 19115Apache LuceneXPathXSL TemplatesJava API
11
11 panFMP – Components of a metadata portal
12
12 panFMP - Harvesting
13
13 panFMP - Search Interface Supports all standard Lucene search features Additional support for fast range queries to enable bounding boxes, etc.: –implemented by redundant storage of “numerical terms” in different precisions –recursive reduction of distinct terms (every numerical value is a term) on range query –search time no longer dependent on index size Accessible via Java API or AXIS web service
14
14 panFMP – Range Queries Example on trie-based recursive splitting of range query with three precisions (simplied for demonstration): User wants to find all records with terms between "423" and "642". Instead of selecting all terms in lowermost row, query is optimized to only match on labelled terms with lower precision, where applicable. It is enough to select term "5" to match all records starting with "5" ("521", "522") or "44" for "445", "446", "448". Query is therefore simplied to match all records containing terms "423", "44", "5", "63", "641", or "642".
15
15 Examples http://sedis.iodp.org http://www.c3grid.de/portal http://www.world-data-centers.org/ http://dataportal.carboocean.org http://pages-dataportal.unibe.ch/cgi- bin/WebObjects/dataportalhttp://pages-dataportal.unibe.ch/cgi- bin/WebObjects/dataportal Currently not available: http://data.planktonnet.euhttp://data.planktonnet.eu
16
16 Thank You! Software available open source on Sourceforge.net! http://www.panFMP.org http://sourceforge.net/projects/panfmp
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.