Lucene/Solr Architecture

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Apache Solr Out Of The Box (OOTB) Chris Hostetter
Lucene Near Realtime Search Jason Rutherglen & Jake Mannix LinkedIn 6/3/2009 SOLR/Lucene Users Group San Francisco.
Practical Solr Guide for Developers. First…some questions. How many of you in the room know what Solr is? How many have worked with Solr? How many will.
Tomás Fernández Löbbe Plugtree LLC October 7th
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
6 SQL Server Integration Same manageability, administration & development experience Integrated queries & transactions Integrated HA and backup/restore.
Computing & Information Sciences Kansas State University Kansas State University Olathe Workshop on Big Data – August, 2014 KSU Laboratory for Knowledge.
Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.
How to Use LucidWorks Search
 Apache Solr Apache Solr – Introduction David Shemer.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Search at Scale Hadoop, Katta & Solr the smores of Search.
Richa Arora.  Tool Identified and Overview  Schema.xml  Tokenization, Stop words, and Synonym Handling  Indexing  Data Import Handler  Query format.
Information Retrieval in Practice
Architecture of a Search Engine
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.
Overview of Search Engines
Passage Three Introduction to Microsoft SQL Server 2000.
Implementing search with free software An introduction to Solr By Mick England.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
ECPRD seminar on the net IX”, Brussels, 2011 Faceted Search Some examples of applied faceted search on websites developed by the EP Jerry.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:
Powerful Full-Text Search with Solr Yonik Seeley Web 2.0 Expo, Berlin 8 November 2007 download at
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.
Search Search Drupal with Apache Solr with CERN Web Communications Group – Copyright 2013.
© NYC Apache Lucene/Solr Meetup. Lucid Imagination, Inc. Agenda Welcome "Faster. Better. Solr! What to look for in Solr 1.4“ Yonik Seeley,
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
Search Admin Content UX Crawl Content Processing Index Query ProcessingWFE API Analytics Processing Crawl Search Admin Link Analytics Reporting FAST.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Search Dr Ian Boston University of Cambridge Image © University of Cambridge December :30 INTL 6.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Mike Jackson EPCC OGSA-DAI Architecture + Extensibility OGSA-DAI Tutorial GGF17, Tokyo.
Faceted browsing for ACL Anthology Praveen Bysani.
Design a full-text search engine for a website based on Lucene
XML and Object Serialization. Structure of an XML Document Header Root Element Start Tags / End Tags Element Contents – Child Elements – Text – Both (mixed.
1 Aspire Document Processing 1. 2 Document Processing – “Aspire” Very High Performance Structured Document Processing Architecture Dynamic configuration.
Windows 7 WampServer 2.1 MySQL PHP 5.3 Script Apache Server User Record or Select Media Upload to Internet Return URL Forward URL Create.
Reviews Crawler (Detection, Extraction & Analysis) FOSS Practicum By: Syed Ahmed & Rakhi Gupta April 28, 2010.
SQL Query Analyzer. Graphical tool that allows you to:  Create queries and other SQL scripts and execute them against SQL Server databases. (Query window)
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
DSpace Statistics Graham Triggs Head of Repository Systems, Symplectic.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
Apache Solr Beyond The Box Chris Hostetter
Apache Solr Out Of The Box (OOTB) Chris Hostetter hossman - apache - org
Information Retrieval in Practice
Implementing Autocomplete with Solr and jQuery
Search Engine Architecture
Searching and Indexing
Custom search forms with Apache Solr David Hernández
Detailed search stats from DSpace Solr
Building Search Systems for Digital Library Collections
Lucene in action Information Retrieval A.A
Introduction to Nutch Zhao Dongsheng
Query Processing CSD305 Advanced Databases.
Lucene/Solr Architecture
Introduction to Elasticsearch with basics of Lucene May 2014 Meetup
Rafał Kuć – Sematext sematext.com
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
Presentation transcript:

Lucene/Solr Architecture Request Handlers Response Writers Update Handlers /admin /select /spell XML Binary JSON XML CSV binary Extracting Request Handler (PDF/WORD) Search Components Schema Update Processors Query Highlighting Signature Spelling Statistics Logging Faceting Debug Indexing Apache Tika More like this Clustering Query Parsing Config Distributed Search Data Import Handler (SQL/RSS) Analysis Faceting Filtering Search Caching High-lighting Index Replication Apache Lucene Core Search IndexReader/Searcher Indexing IndexWriter Text Analysis

Lucene/Solr plugins RequestHandlers – handle a request at a URL like /select SearchComponents – part of a SearchHandler, a componentized request handler Includes, Query, Facet, Highlight, Debug, Stats Distributed Search capable UpdateHandlers – handle an indexing request Update Processor Chains – per-handler componentized chain that handle updates Query Parser plugins Mix and match query types in a single request Function plugins for Function Query Text Analysis plugins: Analyzers, Tokenizers, TokenFilters ResponseWriters serialize & stream response to client

Lucene/Solr Query Plugin Architecture Declarative Analysis per-field - Tokenizer to split text - TokenFilter to transform tokens - Analyzer for completely custom - Separate query / index analyzer QParser plugins - Support different query syntaxes - Support different query execution - Function Query supports pluggable custom functions - Excellent support for nesting/mixing different query types in the same request. schema.xml Whitespace Tokenizer Analyzer for “title” CustomFilter SynonymFilter Porter Stemmer // declaratively defines types // and analyzers for fields <fieldType name=“text1”> <filter=“whitespace”> <filter=“customFilter” …> <filter=“synonyms” file=..> <filter=“porter” except=..> <field name=“title” type=“text1” <field name=“cust1” class=… Analyzer for “cust1” (potentially completely custom architecture not using tokenizer/filters) solrconfig.xml < index configuration /> < caching configuration /> < request handler config /> < search component config /> < update processor config /> < misc – HTTP cache, JMX > <parser name=“mycustom” … <func name=“custom” class=… MyCustom QParser Lucene QParser Function QParser sqrt sum pow custom max log XML QParser DisMax QParser Function Range Q

Lucene/Solr Request Plugins {“response”={ “docs”={ http://.../select?q=cheese&wt=json /select /admin/luke /mypath RequestHandler Request Handler (non-component based) Request Handler (custom) XML response writer Query Component Distributed Search Facet Component XSLT response writer Highlight Component Binary response writer Debug Component Each request handler can be mapped to a different URL SearchHandler is a componentized RequestHandler that allows search components to be chained together and also enables the framework for distributed search operations. Each Searchhandler can have it’s own custom set of search components, along with default or invariant parameters All of the configuration is declarative – including adding new request handlers or search components. The QueryResponse object is very generic and can handle returning any type of data JSON response writer Query Response Custom response writer Additional plug-n-play search components TermVector QueryElevation Spellcheck Terms MoreLikeThis Statistics My Custom Clustering

Lucene/Solr Indexing PDF <doc> <title> /update /update/csv HTTP POST HTTP POST /update /update/csv /update/xml /update/extract XML Update Handler CSV Update Handler XML Update with custom processor chain Extracting RequestHandler (PDF, Word, …) Update Processor Chain (per handler) Text Index Analyzers Data Import Handler Database pull RSS pull Simple transforms Remove Duplicates processor Logging Index Custom Transform RSS feed Just like all request handlers, update handlers can be mapped to a specific URL and have their own set of default or invariant parameters. Each update handler can have it’s own Update Processor Chain that can do Document-level operations prior to indexing, or even redirect indexing to a different server or create multiple documents (or zero) from a single one. All of the configuration is declarative, including the specification of update processor chains. pull Lucene SQL DB pull Lucene Index