Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.

Similar presentations


Presentation on theme: "Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC."— Presentation transcript:

1 Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

2  Open Source Search Engine Library  Written in C++ (we use the PERL bindings)  Uses the BM25 ranking function which gives the relevance matching  “Scales well”: 100+ million documents  Oh… code that we don’t need to maintain! Presentation

3  Database  Document ◦ data ◦ terms ◦ Values  (Xapian) Metadata management  Searching  Are you ready for it? Core Concepts

4  Collection of files storing indexes, positions, term frequencies, …  One write-lock, multiple read-locks  Stored in archives/ /var/xapian/  Supports multiple-DB’s (unused in EPrints)  Can store arbitrary metadata Core Concepts: Database

5  A Document is an item returned by a search  So it’s also the meaty bit of indexing  Maps to a single data-obj in EPrints  Has three main components: ◦ data ◦ terms ◦ values Core Concepts: Document

6  Arbitrary blob of data  Un-processed by Xapian  Used to store information needed to display the results  Used to store the data-obj identifier in EPrints in order to quickly build EPrints::List objects  Could be used to store more complex data: cached citations, JSON/PERL representation of the data-obj  Limit ~100MB per Document Core Concepts: Document Data

7  Basis of relevance search: a search is a process of comparing the terms specified by a Query against the terms in the DB  Three main types of terms: ◦ Un-prefixed terms: can be seen as a general pool of indexed terms ◦ Prefixed terms: allow to search a sub-set of information (title, authors…) ◦ Boolean terms: used to index identifiers (which don’t add any useful information to the probabilistic indexes) Core Concepts: Document Terms

8  Boolean terms useful for filtering exact values (e.g. subjects:PM, type:article, …). No text processing involved, values appear 0 or 1 time in Documents.  Textual data - TermGenerator class: ◦ Provides the Stemmer and Stopper classes (note: language-dependent) ◦ Spelling correction ◦ Exact matching (“hello world”) and the termpos joys Core Concepts: Document Terms (2)

9  Unprefixed terms used for the simple search  Prefixed terms used for a field-based search (such as the advanced search)  Boolean terms used for any identifier-type of fields – this includes facets (when searching) Core Concepts: Document Terms (3)

10  “search helpers” – we used them for ordering and faceting (occurences & available facets)  Each value (e.g. an order-value, a facet value) is stored in a numbered slot (32-bit integer)  Mappings between a meaningful string and a slot are stored in the Xapian DB as metadata  eprint.creators_name.en (1000000) is the slot for the order- value for the field “creators_name” on the dataset “eprint” for English Core Concepts: Document Values

11  eprint.facet.type.0 (1500300) is the 1st slot for a facet “type” on the dataset eprint  Used by the MultiValueSorter class to order data (when not ordered by relevance)  Used to find out available facets (after a search) and the occurrences of the values e.g. there are 3 items of type ‘article’, 14 items of date ‘2013’  Xapian documentation advises on keeping the number of values low (slow down searching)  We usually limit the number of slots for a facet to 5 Core Concepts: Document Values (2)

12  We need to keep track of our slot mappings in the Xapian Database (not done by Xapian for us  )  EPrints reserves 1 000 000 slots per dataset: ◦ 500 000 for order-values (1 per orderable field) ◦ 500 000 for facet slots (1 per facetable value)  EPrints also stores the current slot offsets to know: ◦ where the range for the next dataset starts ◦ where the next slot of order-values are  EPrints also stores some other useful information as Metadata Core Concepts: Metadata management

13 Core Concepts: Metadata management (2)

14  Reverse process of indexing  Composed of a tree of Query objects (and sometime a QueryParser object) linked by boolean operators  $query = new Query( “hello” ) $query = new Query( AND, $query, “world” )  Can be stringified to see how the query is interpreted (easier to read than SQL!) Core Concepts: Searching

15  Parses user queries  Supports: ◦ wildcards: wild* will match wildcat ◦ boolean op’s: pear AND (red OR green NOT blue) ◦ love/hate op’s: crab +nebula –crustacean ◦ exact match: “lorem ipsum” ◦ synonyms: colour/color, realise/realize ◦ stemming: happiness/happy -> happi ◦ suggestions: may provide a corrected query  Features can be turned on/off (all are enabled on EPrints) Core Concepts: Searching - QueryParser

16  The object which runs the query  Alternative ordering methods can be applied  A MatchDecider method may be provided to filter out results (in fact, we use that to compute facets)  Returns an MSet (Match Set) which contains the actual matching Documents Core Concepts: Search - Enquire

17  http://xapian.org http://xapian.org ◦ architecture overview ◦ documentation ◦ advice for implementation  Questions?  EPrints implementation… Final words


Download ppt "Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC."

Similar presentations


Ads by Google