Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC
Open Source Search Engine Library Written in C++ (we use the PERL bindings) Uses the BM25 ranking function which gives the relevance matching “Scales well”: 100+ million documents Oh… code that we don’t need to maintain! Presentation
Database Document ◦ data ◦ terms ◦ Values (Xapian) Metadata management Searching Are you ready for it? Core Concepts
Collection of files storing indexes, positions, term frequencies, … One write-lock, multiple read-locks Stored in archives/ /var/xapian/ Supports multiple-DB’s (unused in EPrints) Can store arbitrary metadata Core Concepts: Database
A Document is an item returned by a search So it’s also the meaty bit of indexing Maps to a single data-obj in EPrints Has three main components: ◦ data ◦ terms ◦ values Core Concepts: Document
Arbitrary blob of data Un-processed by Xapian Used to store information needed to display the results Used to store the data-obj identifier in EPrints in order to quickly build EPrints::List objects Could be used to store more complex data: cached citations, JSON/PERL representation of the data-obj Limit ~100MB per Document Core Concepts: Document Data
Basis of relevance search: a search is a process of comparing the terms specified by a Query against the terms in the DB Three main types of terms: ◦ Un-prefixed terms: can be seen as a general pool of indexed terms ◦ Prefixed terms: allow to search a sub-set of information (title, authors…) ◦ Boolean terms: used to index identifiers (which don’t add any useful information to the probabilistic indexes) Core Concepts: Document Terms
Boolean terms useful for filtering exact values (e.g. subjects:PM, type:article, …). No text processing involved, values appear 0 or 1 time in Documents. Textual data - TermGenerator class: ◦ Provides the Stemmer and Stopper classes (note: language-dependent) ◦ Spelling correction ◦ Exact matching (“hello world”) and the termpos joys Core Concepts: Document Terms (2)
Unprefixed terms used for the simple search Prefixed terms used for a field-based search (such as the advanced search) Boolean terms used for any identifier-type of fields – this includes facets (when searching) Core Concepts: Document Terms (3)
“search helpers” – we used them for ordering and faceting (occurences & available facets) Each value (e.g. an order-value, a facet value) is stored in a numbered slot (32-bit integer) Mappings between a meaningful string and a slot are stored in the Xapian DB as metadata eprint.creators_name.en ( ) is the slot for the order- value for the field “creators_name” on the dataset “eprint” for English Core Concepts: Document Values
eprint.facet.type.0 ( ) is the 1st slot for a facet “type” on the dataset eprint Used by the MultiValueSorter class to order data (when not ordered by relevance) Used to find out available facets (after a search) and the occurrences of the values e.g. there are 3 items of type ‘article’, 14 items of date ‘2013’ Xapian documentation advises on keeping the number of values low (slow down searching) We usually limit the number of slots for a facet to 5 Core Concepts: Document Values (2)
We need to keep track of our slot mappings in the Xapian Database (not done by Xapian for us ) EPrints reserves slots per dataset: ◦ for order-values (1 per orderable field) ◦ for facet slots (1 per facetable value) EPrints also stores the current slot offsets to know: ◦ where the range for the next dataset starts ◦ where the next slot of order-values are EPrints also stores some other useful information as Metadata Core Concepts: Metadata management
Core Concepts: Metadata management (2)
Reverse process of indexing Composed of a tree of Query objects (and sometime a QueryParser object) linked by boolean operators $query = new Query( “hello” ) $query = new Query( AND, $query, “world” ) Can be stringified to see how the query is interpreted (easier to read than SQL!) Core Concepts: Searching
Parses user queries Supports: ◦ wildcards: wild* will match wildcat ◦ boolean op’s: pear AND (red OR green NOT blue) ◦ love/hate op’s: crab +nebula –crustacean ◦ exact match: “lorem ipsum” ◦ synonyms: colour/color, realise/realize ◦ stemming: happiness/happy -> happi ◦ suggestions: may provide a corrected query Features can be turned on/off (all are enabled on EPrints) Core Concepts: Searching - QueryParser
The object which runs the query Alternative ordering methods can be applied A MatchDecider method may be provided to filter out results (in fact, we use that to compute facets) Returns an MSet (Match Set) which contains the actual matching Documents Core Concepts: Search - Enquire
◦ architecture overview ◦ documentation ◦ advice for implementation Questions? EPrints implementation… Final words