Presentation is loading. Please wait.

Presentation is loading. Please wait.

Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008.

Similar presentations


Presentation on theme: "Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008."— Presentation transcript:

1 Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008

2 July 16, 2015Fedora Users Group - Open Repositories 2008 Background oIndiana University Digital Library Program Started in 1997 oDiversity of formats and collections Text, image, musical scores, audio, video, … oDiversity of search systems DLXS, XTF, Lucene, DB2 NSE, Oracle Text oCurrent project to unify architecture for storage, discovery, and delivery around Fedora

3 Search System Development oPhase one: create a search architecture and template for an image based search and discovery application oPhase two: extend the template and architecture to support more advanced search and discovery applications over different object types July 16, 2015Fedora Users Group - Open Repositories 2008

4 PHASE I: CREATING A BASIC IMAGE SEARCH July 16, 2015Fedora Users Group - Open Repositories 2008

5 Phase One: Simple Image Search oSlocum puzzle collection: ideal test case oSmall number of objects oSimple content model Each object represents a single physical puzzle Basic metadata: METS, MODS, DC RELS-EXT isMemberOf relationship with a collection object Pre-scaled derivative images July 16, 2015Fedora Users Group - Open Repositories 2008

6 July 16, 2015Fedora Users Group - Open Repositories 2008

7 Requirements: Identifier Resolution oExternal Identifiers rather than Fedora PIDs Seamless migration to Fedora No commitment to any underlying repository architecture oRequirement: Quickly resolve our identifier (PURL) to the Fedora PID July 16, 2015Fedora Users Group - Open Repositories 2008

8 Requirements: PURL Identifier Resolution July 16, 2015Fedora Users Group - Open Repositories 2008 Hypothetical ID Resolution Service OCLC PURL Resolver http://fedora.dlib.indiana.edu:8080/fedora/get/iudl:19794/THUMBNAIL http://purl.dlib.indiana.edu/iudl/lilly/slocum/thumbnail/LL-SLO-004696

9 Requirements: Keyword and Fielded Search oVery basic search requirements for any discovery and delivery web application Keyword search should maximize discovery MODS fields should be searchable to maximize accuracy of matches Search results paging Support for simple Boolean operators Wildcard searches are a requirement Full metadata record (MODS) returned July 16, 2015Fedora Users Group - Open Repositories 2008

10 Remaining Requirements oUser interface Extensible, Reusable, Customizable oService oriented approach Centralize core search system Standards-based access for integration with other services and end-user tools July 16, 2015Fedora Users Group - Open Repositories 2008

11 Requirements: Search System July 16, 2015Fedora Users Group - Open Repositories 2008 PURL Resolution Fielded Search Fedora Integration Slocum Webapp Generic Search Webapp UI LayerSearch Layer

12 Solutions: Search Protocol oSearch and Retrieve via URL (SRU) One of very few standard search protocols Extremely powerful and flexible query language (CQL) Can return records of any type Most commonly used with DC, MODS, MARCXML Has mechanisms for extension in case special needs arise July 16, 2015Fedora Users Group - Open Repositories 2008

13 Search System Solutions: SRU July 16, 2015Fedora Users Group - Open Repositories 2008 PURL Resolution Fielded Search Fedora Integration Slocum Webapp Generic Search Webapp SRU UI LayerSearch Layer

14 Solutions: Existing Products oFedora Search Good for finding items based on basic Fedora metadata, but not for more sophisticated searching oFedora Resource Index Search Also limited to searching basic metadata, not the content of datastreams July 16, 2015Fedora Users Group - Open Repositories 2008

15 Solutions: Existing Products oFedora Generic Search Service (GSearch) Hooks into Fedora Works with Lucene Easy to customize search fields though XSLT transformation of existing metadata oOCLC SRU/W Implementation Relatively complete implementation in Java, with ongoing development Others have had success using with Lucene July 16, 2015Fedora Users Group - Open Repositories 2008

16 Search System July 16, 2015Fedora Users Group - Open Repositories 2008 index OCLC SRU Implementation Lucene Database extension Fedora Generic Search Service Reads Updates SRU

17 Phase 1 Solution: General Applicability oPieces of this solution have been used for other image collections oSRU is used to expose these collections to OneSearch@IU, our federated search service oThe XSLT that assigned metadata to Lucene index fields was a solid base for the indexing needs of other collections. July 16, 2015Fedora Users Group - Open Repositories 2008

18 Phase 1 Solution: Lingering Problems oOur XSLT for the Generic Search Service wasn’t perfect oSome complications prevented full automation oWe punted on getting the perfect Lucene analyzer configuration July 16, 2015Fedora Users Group - Open Repositories 2008

19 PHASE II: EXTENDING FOR DIFFERENT COLLECTIONS July 16, 2015Fedora Users Group - Open Repositories 2008

20 EVIA Digital Archive July 16, 2015Fedora Users Group - Open Repositories 2008

21 Requirement: EVIADA Video Annotation Collection July 16, 2015Fedora Users Group - Open Repositories 2008 Video Object Field Collection Object Custom Annotation Software Field Collection

22 Requirement: EVIADA Video Annotation Collection oComplex Data model One Fedora object which is addressable and discoverable in parts oNew features Faceted Search and Browse Extensive custom fields July 16, 2015Fedora Users Group - Open Repositories 2008

23 Requirements: IN Harmony Sheet Music Collection July 16, 2015Fedora Users Group - Open Repositories 2008

24 Requirements: IN Harmony Sheet Music Collection oComplex Content model Three types of objects below the collection Sheet music Individual Score Page Image July 16, 2015Fedora Users Group - Open Repositories 2008 Chariot Race March

25 Requirements: IN Harmony Sheet Music Collection oNew Features Faceted Search and Browse Exact match searches Date range searches Dozens of very specific fields Sorting by date or title July 16, 2015Fedora Users Group - Open Repositories 2008

26 Options: oExtend our existing implementation All too appealing because of familiarity and “sunk costs” Major conflicts between existing model and desired model could result in unmaintainable “hackish” implementations July 16, 2015Fedora Users Group - Open Repositories 2008 oSwitch to a new infrastructure Would be great, if something existed that met our needs without having to rework everything oSome combination Best of both worlds?

27 Options: Faceted Search and Browse oUse Solr Built-in support for facets Is a service layer with an XML response But do we really want to abandon SRU, or maintain two search service protocols? July 16, 2015Fedora Users Group - Open Repositories 2008

28 Options: Faceted Search and Browse oExtend SRU Implementation Prevents the need for yet another service layer Has wide reuse potential Could be backed by Solr without substantially more effort. July 16, 2015Fedora Users Group - Open Repositories 2008

29 Solution: Faceted Search over SRU July 16, 2015Fedora Users Group - Open Repositories 2008 SRU Service (now with facet support)

30 Solution: Other SRU Improvements oMore complete CQL support Easy Improvements Operators (and, or, not, any, all) Application-specific fields July 16, 2015Fedora Users Group - Open Repositories 2008

31 Solutions: Other SRU Improvements oMore complete CQL support Difficult Improvements “cql.exact” relation facet implementation sort support July 16, 2015Fedora Users Group - Open Repositories 2008 dc.subject exact “United Kingdom” index dc.subject dc.subject.exact dc.subject dc.subject.sort

32 Options: Index Generation July 16, 2015Fedora Users Group - Open Repositories 2008 Fedora Generic Search Service Homegrown Solution

33 Reconsideration: GSearch oLimited by the one to one relationship between Lucene documents and fedora objects oStoring valid XML in CDATA to be stored in Lucene is messy and is prone to error as the metadata becomes more diverse oWe really only use it to generate a Lucene index July 16, 2015Fedora Users Group - Open Repositories 2008

34 Consideration: Solr oRobust wrapper for Lucene Exposes service to update index Exposes search features as a service Abstracts away much of the of complexities of Lucene oMigrating existing search indexes would be prohibitively time consuming, but it might be the best tool to bring up new collections July 16, 2015Fedora Users Group - Open Repositories 2008

35 Solution: Custom index service oA service whose initial functionality is simply to create and maintain Lucene Index directories that are served by SRU. Can easily be extended/configured to use different search engines or to delegate the process entirely (perhaps to Solr) oSupport for existing GSearch style XSLT oSimple Java interface to allow for easy index implementations. July 16, 2015Fedora Users Group - Open Repositories 2008

36 Search Service July 16, 2015Fedora Users Group - Open Repositories 2008 index OCLC SRU Implementation Lucene Database – configured for quick id resolution Custom Index Service Lucene Database – configured for basic search index Basic Index Writer GSearch Style XSLT Index Writer Lucene Database – configured for advanced search New Style XSLT Index Writer Compound Model Java Index Writer index Lucene Database – configured for compound model searches

37 Search Service July 16, 2015Fedora Users Group - Open Repositories 2008 index OCLC SRU Implementation Lucene Database – configured for quick id resolution Custom Index Service Lucene Database – configured for basic search index Basic Index Writer G Search Style XSTL Index Writer Lucene Database – configured for advanced search New Style XSTL Index Writer Compound Model Java Index Writer index Lucene Database – configured for compound model searches Solr Database – configured to interface with solr. Solr Solr Wrapping Index

38 Future Plans oFull Text searching Search text of entire books or journals Determine where in the hierarchy the match occurred Provide snippets with highlighted matches in context for the search results listing oSolutions XTF, Solr through our custom index service July 16, 2015Fedora Users Group - Open Repositories 2008

39 Conclusion oMost of the work is configuring the index which is a requirement that cannot be avoided. oMigration doesn’t have to be difficult or disruptive oAlways be willing and able to consider new products and technologies July 16, 2015Fedora Users Group - Open Repositories 2008

40 Thanks! Any Questions? owww.dlib.indiana.eduwww.dlib.indiana.edu owiki.dlib.indiana.edu/confluence/x/AQIwiki.dlib.indiana.edu/confluence/x/AQI omidurbin@indiana.edu ojwd@indiana.edu July 16, 2015Fedora Users Group - Open Repositories 2008


Download ppt "Accommodating Diverse Search Requirements over a Fedora Repository Michael Durbin and Jon W. Dunn Fedora User Group – Open Repositories 2008 April 3, 2008."

Similar presentations


Ads by Google