Empowering EPrints Search with Xapian

Slides:



Advertisements
Similar presentations
EPrints Web Configuratio n Management. SQL database Web server Scripts to configure repository activities Configuration files EPrints - the Administrator's.
Advertisements

EIONET Training Searching and categorizing content Miruna Bădescu Finsiel Romania Copenhagen, 27 October 2003.
EPrints 2.0 / March 4 th 2002 / Glasgow / Chris Gutteridge Introduction to EPrints 2.0 March 4 th 2002 Glasgow Christopher Gutteridge from the Department.
Microsoft Dynamics® SL
MILLENNIUM STATISTICS … fun for all!! Matt Polcyn August 6, 2004.
Web Forms For Marketers
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Millennium Create Lists Claudia Conrad Product Manager, Cataloging Northwest IUG October 2003.
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features INIS Training Seminar 7-11 October 2013, Vienna Domenico.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
EventStore Managing Event Versioning and Data Partitioning using Legacy Data Formats Chris Jones Valentin Kuznetsov Dan Riley Greg Sharp CLEO Collaboration.
NEEO Workpackage 5 NEEO Project Meeting - 6 Paris, FR 26 November, 2009 Benoit PAUWELS.
Online Magazine Bryan Ng. Goal of the Project Product Dynamic Content Easy Administration Development Layered Architecture Object Oriented Adaptive to.
Information & Library Services SwetsWise User Guide Emma Crowley Senior Academic Services Librarian
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Author: Texas Instruments ®, Sitara™ ARM ® Processors Building Blocks for PRU Development Module 2 PRU Firmware Development This session covers how to.
Using Social Care Online: an overview Version 1.0 April 2015.
Search on Journal of Dairy Science ® An Overview April
ELSEVIER SCIENCE ( LIFE SCIENCE-CURRENT OPINIONS, TRENDS, FUELLRESS )
Overview of Search Engines
Word Up! Using Lucene for full-text search of your data set.
CiviCRM - Advanced Topics Dave Greenberg Michal Mach
FireRMS SQL Audit, Archiving & Purging Presented by Laura Small FireRMS Quality Assurance.
DTIC Discovery Tools 28 March 2012 Moderator: Kapin L. Ferguson.
Running a Report.  List Bibliography Report  Found under: All Titles Purpose : Creates customized bibliographies by catalog, call number, or item characteristics.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
RMsis – v Simplify Requirement Management for JIRA.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
SunGuide® Software Development Project Release 4.3 Express Lanes Enhancements Design Review December 15, 2009 December 15, 20091R4.3 Design Review.
Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.
© 2011 Autodesk High-End Infrastructure Modeling with Low-Cost Tools: Introducing AutoCAD® Map 3D 2012 Bradford Heasley, GISP Vice President, Brockwell.
User Guide to DBPIA for Institutional Members Nurimedia Co., Ltd. 2012
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
Data Validation OPEN Development Conference September 19, 2008 Sushmita De Systems Analyst.
Graphing and statistics with Cacti AfNOG 11, Kigali/Rwanda.
CaDSR Software Users Meeting 3.1 Requirements Review 9/19/2005 caDSR Software Team Host: Denise Warzel NCICB, Assistant Director, caDSR.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Searching CiteSeer Metadata Using Nutch Larry Reeve INFO624 – Information Retrieval Dr. Lin – Winter 2005.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Developing Great Dashlets Will Abson About Me Project Lead, Share Extras Alfresco Developer and previously Solutions Engineer DevCon 2011 –
Introduction to KE EMu
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features The Role of the International Nuclear Information System.
DrupalCon 2011: Feedback ENTICE meeting 8 April 2011 Silvia Tomanin DG-CO.
Adxstudio Portals Training
TOPSpro Special Topics I: Database Managemen t. Agenda for Module I: Database Management  TOPSpro Backup/Restore Wizard  TOPS-TOPS Import/Export Wizard.
This was written with the assumption that workbooks would be added. Even if these are not introduced until later, the same basic ideas apply Hopefully.
Developing Great Dashlets Will Abson About Me Project Lead, Share Extras Alfresco Developer and previously Solutions Engineer DevCon 2011 –
Using SRDR™ For Systematic Reviews of Diagnostic Tests SRDR is being developed and maintained by the Brown EPC under contract with the Agency for Healthcare.
Developing Great Dashlets Will Abson About Me Project Lead, Share Extras Alfresco Developer and previously Solutions Engineer DevCon 2011 –
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
The Next Step Hudson Fare Files 102 – Import & upload Rev. 10/14.
GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.
Views: From the Beginning John Fiala Twitter: jcfiala Vintage Digital.
Using Social Care Online: an overview
EPrints 3.3 Bazaar and Beautiful
Magento Indexers Ivan Chepurnyi Magento Trainer / Lead Developer.
IsoveraDL Performance Enhancements
Multimedia Information Retrieval
Search Techniques and Advanced tools for Researchers
Introduction to Smart Search
MIT GSL 2018 week 3 | thursday Meteor and App Ideation.
Introduction to Information Retrieval
EPrints Web Configuration Management
Comparing your papers to the rest of the world
USER MANUAL - WORLDSCINET
Demo of Current Institutional Repository Functionality in Islandora
USER MANUAL - WORLDSCINET
Presentation transcript:

Empowering EPrints Search with Xapian EPrints for Administrators Training @ University of Southampton 28th September 2011 Empowering EPrints Search with Xapian Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Summary Review of EPrints Internal Search Indexing Searching Extras TO-DO’s Using & contributing Demo(s) EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 EPrints “Internal” Search - Overview Search DataSet List 1 1..n 1..n MetaField Field Condition 1..n EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 EPrints “Internal” Search – Overview (2) match = “EX” queries the main & auxilliary dataset tables match = “IN” queries the __rindex dataset table ordering is done via the __ordervalues_$langid dataset table EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 EPrints “Internal” Search – Downsides Simple search is not scalable Lots of derived data in the DB (backup?) No relevance matching -> good matches do not surface up No advanced features: suggestions, facets, boolean op’s etc. Home-brewed: hard to maintain the code, hard to extend Difficult to debug… EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 EPrints Xapian Search Introduced in 3.3 Only integrated with the simple search Little flexibility in controlling what is indexed Advanced features “not really” enabled Searches every fields (“text_index” not respected) But the idea is good & worth building upon EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Indexing Attempts to re-use EPrints’ default configuration: datasets’ field defintion (+ “text_index”) fields defined in the simple search (un-prefixed terms) But needs its own bits to define: default indexing methods (by MetaField type) facet-able indexes order-able indexes May be used to declare derived indexes – examples: “open_access”: to filter references from open full-text documents “year”: to filter by year of publication (rather than by date) “image_orientation”: if you had an archive of images, you could extract the orientation via EXIF EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Indexing - Classes Xapian::Index Config Xapian DB IndexMethod OrderMethod Fulltext Name, etc. Alpha. Name, etc. EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Indexing – Extra information Indexes are prefixed by “_” e.g. “_title” so we can sanitise the user query – otherwise users could do prefixed search (and search not necessarily allowed fields) Z notation: indicates a stemmed value or index: Z_title, Zhappi (internal Xapian convention) Script available to re-process the Xapian indexes (similar to “epadmin reindex” but doesn’t re-index the EPrints’ internal) Reserved indexes: _id: keep the internal id of the data-obj (/id/eprint/123) _dataset: to which dataset the record belongs to (‘eprint’, ‘user’…) _configuration_md5: keeps an MD5 of the conf. the item was indexed against (useful?) - _index_timestamp: when the item was last indexed EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Searching Again, attempts to re-use EPrints’ configuration: simple search (mostly for ordering methods) advanced/staff search: which fields to use (prefixed terms) Extra bits can be configured such as which facets can be used on each search (simple, advanced, …) Only indexed stuff can be searched  you cannot use a facet which has not been generated you need to re-index your data if you change the simple search def. same if you add new order-able fields EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Searching (2) Abstracted by Plugin::Search (original implementation) Tricky to make it work with EPrints’ UI because it expects an EPrints::Search object Plugin::Search::Internal is a wrapped EPrints::Search object (hack) so Plugin::Search::Xapian must emulate this behaviour EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Searching – Classes & Op. Stack /cgi/xapian Search::XapianSearch Paginate::Facets Xapian::Facets Plugin::Search::Xapian Xapian DB EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Searching – Extra information May be used in a script Exports & feeds work Can be serialised/de-serialised (including facets) so should work for Saved Searches (to test) EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Extras “Related Items” Jiadi has developed a Bootstrap-based Pagination module: more sexy supports alternative “views” of the search results EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 TO-DO’s Range searching: possible in Xapian but not yet implemented (e.g. 1..10) Some refactoring: Xapian::Index -> Xapian::Indexer Plugin::Search::Xapianv2 => Plugin::Search::Xapian (and replace the default EPrints’ Xapian implementation) Test with real life data (done to a certain extent...) Load & scalability testing (+ number of slots etc.) Multi-lang considerations (and related IndexMethod) EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 TO-DO’s – Would be nice Page displaying how a data-obj has been indexed prefixes terms facets & order-able fields Status page (cf. “Admin > Status”): DB size number of Documents indexed datasets (and how) Weighting: supported (via conf.) but un-tested in real life EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Internal Search vs Xapian Search Xapian is more of a user search The internal search is still required to: get records from the Database ($dataset->search()) this affects screens such as “Manage Deposits”, the “Review” etc. which cannot wait for items to be indexed (direct DB calls) may be needed to apply ACL’s (if some items cannot be searched): safer to use the (MySQL) DB as authority EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Debugging Xapian Plugin::Search::Xapian may be set to debug mode: shows processing and query building Xapian comes with an analysis tool, “delve” to: view the content of the Xapian DB or some selected Documents see if a term exists in the DB (and in which Documents) other info (term frequency etc.) Knowing what Xapian is searching and how a data-obj is indexed is key to debug most search-relating issues EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Using & Contributing Not quite at release stage but it is –currently- isolated so shouldn’t break your IR All the code is on GitHub: https://github.com/eprints/xapianv2 EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Demos http://puffin.ecs.soton.ac.uk/cgi/xapian Simple search / facets / export / order Simple search with boolean op’s, suggestion Advanced search / facets / export / order Related items http://vmdev1.eprints.org/cgi/xapian (more data + cached citations) http://vmdev1.eprints.org/cgi/xapian_status EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.

EPrints for Administrators Training @ University of Southampton 28th September 2011 Q&A & what’s next Let’s have a play? Code overview? Doc? EPrints Services, Web & Internet Science (WAIS) Research Group, Electronics & Computer Science, University of Southampton 2011.