Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.

Slides:



Advertisements
Similar presentations
ELibrary Topic Search Basics eLibrary topic search allows users to locate articles and multimedia resources –Relevant to K-12 curricula and user.
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Tries Standard Tries Compressed Tries Suffix Tries.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
U of R eXtensible Catalog Team MetaCat. Problem Domain.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Using ProQuest Databases Jackson Community College Atkinson Library.
Information Retrieval
Overview of Search Engines
Databases.
CORE 2: Information systems and Databases STORAGE & RETRIEVAL 2 : SEARCHING, SELECTING & SORTING.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
An introduction to databases In this module, you will learn: What exactly a database is How a database differs from an internet search engine How to find.
Empowering EPrints Search with Xapian
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Search Engines and Information Retrieval Chapter 1.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Project Overview Bibliographic merging, Endeca, and Web application.
Eurotrace Hands-On The Eurotrace File System. 2 The Eurotrace file system Under MS ACCESS EUROTRACE generates several different files when you create.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Support.ebsco.com EBSCOhost Basic Searching for Academic Libraries Tutorial.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
IL Step 3: Using Bibliographic Databases Information Literacy 1.
EXist Indexing Using the right index for you data Date: 9/29/2008 Dan McCreary President Dan McCreary & Associates (952) M.
WJEC Applied ICT Databases – Queries and Database Practice Queries When you create a database – one of the main strengths of it is the ability to.
Presented By: Gail Rose-Innes Camps Bay High School ICT & CAT Department Microsoft Access 2010.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CS370 Spring 2007 CS 370 Database Systems Lecture 1 Overview of Database Systems.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Best Bets: Improving Search to High Demand Resources Tito Sierra NCSU Libraries Code4Lib 2007.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
 Enhancing User Experience  Why it is important?  Discussing user experience one-by-one.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Lucene Jianguo Lu.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Search Engine Architecture
Text Based Information Retrieval
Building Search Systems for Digital Library Collections
CS 430: Information Discovery
CS 430: Information Discovery
OUTLINE Basic ideas of traditional retrieval systems
Query Languages.
Search Techniques and Advanced tools for Researchers
IL Step 3: Using Bibliographic Databases
Introduction to Information Retrieval
Spreadsheets, Modelling & Databases
Information Retrieval and Web Design
Presentation transcript:

Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

 Open Source Search Engine Library  Written in C++ (we use the PERL bindings)  Uses the BM25 ranking function which gives the relevance matching  “Scales well”: 100+ million documents  Oh… code that we don’t need to maintain! Presentation

 Database  Document ◦ data ◦ terms ◦ Values  (Xapian) Metadata management  Searching  Are you ready for it? Core Concepts

 Collection of files storing indexes, positions, term frequencies, …  One write-lock, multiple read-locks  Stored in archives/ /var/xapian/  Supports multiple-DB’s (unused in EPrints)  Can store arbitrary metadata Core Concepts: Database

 A Document is an item returned by a search  So it’s also the meaty bit of indexing  Maps to a single data-obj in EPrints  Has three main components: ◦ data ◦ terms ◦ values Core Concepts: Document

 Arbitrary blob of data  Un-processed by Xapian  Used to store information needed to display the results  Used to store the data-obj identifier in EPrints in order to quickly build EPrints::List objects  Could be used to store more complex data: cached citations, JSON/PERL representation of the data-obj  Limit ~100MB per Document Core Concepts: Document Data

 Basis of relevance search: a search is a process of comparing the terms specified by a Query against the terms in the DB  Three main types of terms: ◦ Un-prefixed terms: can be seen as a general pool of indexed terms ◦ Prefixed terms: allow to search a sub-set of information (title, authors…) ◦ Boolean terms: used to index identifiers (which don’t add any useful information to the probabilistic indexes) Core Concepts: Document Terms

 Boolean terms useful for filtering exact values (e.g. subjects:PM, type:article, …). No text processing involved, values appear 0 or 1 time in Documents.  Textual data - TermGenerator class: ◦ Provides the Stemmer and Stopper classes (note: language-dependent) ◦ Spelling correction ◦ Exact matching (“hello world”) and the termpos joys Core Concepts: Document Terms (2)

 Unprefixed terms used for the simple search  Prefixed terms used for a field-based search (such as the advanced search)  Boolean terms used for any identifier-type of fields – this includes facets (when searching) Core Concepts: Document Terms (3)

 “search helpers” – we used them for ordering and faceting (occurences & available facets)  Each value (e.g. an order-value, a facet value) is stored in a numbered slot (32-bit integer)  Mappings between a meaningful string and a slot are stored in the Xapian DB as metadata  eprint.creators_name.en ( ) is the slot for the order- value for the field “creators_name” on the dataset “eprint” for English Core Concepts: Document Values

 eprint.facet.type.0 ( ) is the 1st slot for a facet “type” on the dataset eprint  Used by the MultiValueSorter class to order data (when not ordered by relevance)  Used to find out available facets (after a search) and the occurrences of the values e.g. there are 3 items of type ‘article’, 14 items of date ‘2013’  Xapian documentation advises on keeping the number of values low (slow down searching)  We usually limit the number of slots for a facet to 5 Core Concepts: Document Values (2)

 We need to keep track of our slot mappings in the Xapian Database (not done by Xapian for us  )  EPrints reserves slots per dataset: ◦ for order-values (1 per orderable field) ◦ for facet slots (1 per facetable value)  EPrints also stores the current slot offsets to know: ◦ where the range for the next dataset starts ◦ where the next slot of order-values are  EPrints also stores some other useful information as Metadata Core Concepts: Metadata management

Core Concepts: Metadata management (2)

 Reverse process of indexing  Composed of a tree of Query objects (and sometime a QueryParser object) linked by boolean operators  $query = new Query( “hello” ) $query = new Query( AND, $query, “world” )  Can be stringified to see how the query is interpreted (easier to read than SQL!) Core Concepts: Searching

 Parses user queries  Supports: ◦ wildcards: wild* will match wildcat ◦ boolean op’s: pear AND (red OR green NOT blue) ◦ love/hate op’s: crab +nebula –crustacean ◦ exact match: “lorem ipsum” ◦ synonyms: colour/color, realise/realize ◦ stemming: happiness/happy -> happi ◦ suggestions: may provide a corrected query  Features can be turned on/off (all are enabled on EPrints) Core Concepts: Searching - QueryParser

 The object which runs the query  Alternative ordering methods can be applied  A MatchDecider method may be provided to filter out results (in fact, we use that to compute facets)  Returns an MSet (Match Set) which contains the actual matching Documents Core Concepts: Search - Enquire

 ◦ architecture overview ◦ documentation ◦ advice for implementation  Questions?  EPrints implementation… Final words