Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.

Slides:



Advertisements
Similar presentations
Introduction to NHibernate By Andrew Smith. The Basics Object Relation Mapper Maps POCOs to database tables Based on Java Hibernate. V stable Generates.
Advertisements

Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.
Site Collection, Sites and Sub-sites
Tomás Fernández Löbbe Plugtree LLC October 7th
Lucene/Solr Architecture
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with
EasySearch Technical Overview. Ever seen a website without a full text search? BUT – Search is expensive Financially Computationally – Search is complicated.
SQL Server Accelerator for Business Intelligence (SSABI)
AskMe A Web-Based FAQ Management Tool Alex Albu. Background Fast responses to customer inquiries – key factor in customer satisfaction Costs for customer.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
G O B E Y O N D C O N V E N T I O N WORF: Developing DB2 UDB based Web Services on a Websphere Application Server Kris Van Thillo, ABIS Training & Consulting.
High Performance Faceted Interfaces Using S2S Eric Rozell, Tetherless World Constellation.
1 Open-Source Search Engines and Lucene/Solr UCSB 290N Tao Yang Slides are based on Y. Seeley, S. Das, C. Hostetter.
Overview of Search Engines
Simple Web SQLite Manager/Form/Report
Implementing search with free software An introduction to Solr By Mick England.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:
Powerful Full-Text Search with Solr Yonik Seeley Web 2.0 Expo, Berlin 8 November 2007 download at
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
© NYC Apache Lucene/Solr Meetup. Lucid Imagination, Inc. Agenda Welcome "Faster. Better. Solr! What to look for in Solr 1.4“ Yonik Seeley,
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
4-1 INTERNET DATABASE CONNECTOR Colorado Technical University IT420 Tim Peterson.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
1. 2 introductions Nicholas Fischio Development Manager Kelvin Smith Library of Case Western Reserve University Benjamin Bykowski Tech Lead and Senior.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Module 10 Administering and Configuring SharePoint Search.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Search Dr Ian Boston University of Cambridge Image © University of Cambridge December :30 INTL 6.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Design a full-text search engine for a website based on Lucene
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Lucene Jianguo Lu.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Breeda Herlihy, IR Manager, UCC Library. UCC selected DSpace in 2008 Software selection group Staff from Library IT, Computer Centre, Special Collections,
Searching and Indexing
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Open Source distributed document DB for an enterprise
Safe by default, optimized for efficiency
Detailed search stats from DSpace Solr
Building Search Systems for Digital Library Collections
The Re3gistry software and the INSPIRE Registry
CS6604 Digital Libraries IDEAL Webpages Presented by
Lucene in action Information Retrieval A.A
– JukeBox – transparency, flexibility, speed and comfort!
Introduction of Week 11 Return assignment 9-1 Collect assignment 10-1
Lucene/Solr Architecture
Getting Started With Solr
Rafał Kuć – Sematext sematext.com
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
Bryan Soltis – Kentico Technical Evangelist
Intro to Azure Search Julie Smith 2019.
Intro to Azure Search Julie Smith 2019.
Presentation transcript:

Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland

1 History Search for a replacement search platform commercial: high license fees open-source: no full solutions CNET grants code to Apache, Solr enters Incubator 17 Jan 2006 Solr is a Lucene sub-project Users: CNET Reviews, CNET Channel, shopper.com, news.com, nines.org, krugle.com, oodle.com, booklooker.de

2 Lucene Refresher Lucene is a full-text search library Add documents to an index via IndexWriter A document is a a collection of fields No config files, dynamic field typing Flexible text analysis – tokenizers, filters Search for documents via IndexSearcher Hits = search(Query,Filter,Sort,topN) Scoring: tf * idf * lengthNorm

3 What Is Solr A full text search server based on Lucene XML/HTTP Interfaces Loose Schema to define types and fields Web Administration Interface Extensive Caching Index Replication Extensible Open Architecture Written in Java5, deployable as a WAR

4 Solr Core Architecture Lucene Admin Interface Standard Request Handler Disjunction Max Request Handler Custom Request Handler Update Handler Caching XML Update Interface Config Analysis HTTP Request Servlet Concurrency Update Servlet XML Response Writer Replication Schema

5 Adding Documents HTTP POST to /update Apache Solr An intro... search lucene Solr is a full...

6 Deleting Documents Delete by Id Delete by Query (multiple documents) manufacturer:microsoft

7 Commit makes changes visible closes IndexWriter removes duplicates opens new IndexSearcher newSearcher/firstSearcher events cache warming “register” the new IndexSearcher same as commit, merges all index segments.

8 Default Query Syntax Lucene Query SyntaxLucene Query Syntax [; sort specification] 1.mission impossible; releaseDate desc 2.+mission +impossible –actor:cruise 3.“mission impossible” –actor:cruise 4.title:spiderman^10 description:spiderman 5.description:“spiderman movie”~10 6.+HDTV +weight:[0 TO 100] 7.Wildcard queries: te?t, te*t, test*

9 Default Parameters Query Arguments for HTTP GET/POST to /select paramdefaultdescription qThe query start0Offset into the list of matches rows10Number of documents to return fl*Stored fields to return qtstandardQuery type; maps to query handler df(schema)Default field to search

10 Search Results Apple 60 GB iPod with Video ASUS Extreme N7800GTX/2DHTV

11 Caching IndexSearcher’s view of an index is fixed Aggressive caching possible Consistency for multi-query requests filterCache – unordered set of document ids matching a query resultCache – ordered subset of document ids matching a query documentCache – the stored fields of documents userCaches – application specific, custom query handlers

12 Warming for Speed Lucene IndexReader warming field norms, FieldCache, tii – the term index Static Cache warming Configurable static requests to warm new Searchers Smart Cache Warming (autowarming) Using MRU items in the current cache to pre- populate the new cache Warming in parallel with live requests

13 Smart Cache Warming Field Cache Field Norms Warming Requests Request Handler Live Requests On-Deck Solr IndexSearcher Filter Cache User Cache Result Cache Doc Cache Registered Solr IndexSearcher Filter Cache User Cache Result Cache Doc Cache Regenerator Autowarming – warm n MRU cache keys w/ new Searcher Autowarming Regenerator

14 Schema Lucene has no notion of a schema Sorting - string vs. numeric Ranges - val:42 included in val:[1 TO 5] ? Lucene QueryParser has date-range support, but must guess. Defines fields, their types, properties Defines unique key field, default search field, Similarity implementation

15 Field Definitions Field Attributes: name, type, indexed, stored, multiValued, omitNorms Dynamic Fields, in the spirit of Lucene!

16 Search Relevancy PowerShot SD 500 PowerShotSD500 SD500PowerShot PowerShot sd500powershot powershot WhitespaceTokenizer WordDelimiterFilter catenateWords=1 LowercaseFilter power-shot sd500 power-shotsd500 sd500powershot sd500powershot WhitespaceTokenizer WordDelimiterFilter catenateWords=0 LowercaseFilter Query Analysis A Match! Document Analysis

17 Configuring Relevancy <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt“/> <filter class="solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>

18 copyField Copies one field to another at index time Usecase: Analyze same field different ways copy into a field with a different analyzer boost exact-case, exact-punctuation matches language translations, thesaurus, soundex Usecase: Index multiple fields into single searchable field

19 High Availability Load Balancer Appservers Solr Searchers Solr Master DB Updater updates admin queries Index Replication admin terminal HTTP search requests Dynamic HTML Generation

20 Replication solr/data/index Master solr/data/index Searcher new segment solr/data/snapshot hard links solr/data/snapshot WIP 2. hard links 3. rsync 4. mv dir Lucene index segments after mv after rsync

21 Faceted Browsing Example

22 Faceted Browsing DocList Search(Query,Filter[],Sort,offset,n) computer_type:PC memory:[1GB TO *] computer price asc proc_manu:Intel proc_manu:AMD section of ordered results DocSet Unordered set of all results price:[0 TO 500] price:[500 TO 1000] manu:Dell manu:HP manu:Lenovo intersection Size() = 594 = 382 = 247 = 689 = 104 = 92 = 75 Query Response

23 Web Admin Interface Show Config, Schema, Distribution info Query Interface Statistics Caches: lookups, hits, hitratio, inserts, evictions, size RequestHandlers: requests, errors UpdateHandler: adds, deletes, commits, optimizes IndexReader, open-time, index-version, numDocs, maxDocs, Analysis Debugger Shows tokens after each Analyzer stage Shows token matches for query vs index

24

25 Selling Points Fast Powerful & Configurable High Relevancy Mature Product Same features as software costing $$$ Leverage Community Lucene committers, IR experts Free consulting: shared problems & solutions

26 Where are we going? OOTB Simple Faceted Browsing Automatic Database Indexing Federated Search HA with failover Alternate output formats (JSON, Ruby) Highlighter integration Spellchecker Alternate APIs (Google Data, OpenSearch)

27 Resources WWW Mailing Lists