Powerful Full-Text Search with Solr Yonik Seeley Web 2.0 Expo, Berlin 8 November 2007 download at

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru.
Lucene/Solr Architecture
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Apache Solr Yonik Seeley 29 June 2006 Dublin, Ireland.
1.  Understanding about How to Working with Server Side Scripting using PHP Framework (CodeIgniter) 2.
AskMe A Web-Based FAQ Management Tool Alex Albu. Background Fast responses to customer inquiries – key factor in customer satisfaction Costs for customer.
 Apache Solr Apache Solr – Introduction David Shemer.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Solr has a lot of extensive features Solr Integration and Enhancements Todd Hatcher.
Richa Arora.  Tool Identified and Overview  Schema.xml  Tokenization, Stop words, and Synonym Handling  Indexing  Data Import Handler  Query format.
Introduction to Open Source Search with Apache Lucene and Solr Grant Ingersoll.
1 Open-Source Search Engines and Lucene/Solr UCSB 290N Tao Yang Slides are based on Y. Seeley, S. Das, C. Hostetter.
Overview of Search Engines
Implementing search with free software An introduction to Solr By Mick England.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:
Word Up! Using Lucene for full-text search of your data set.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Introduction to Apache Lucene/Solr CSCI 572: Information Retrieval and Search Engines Summer 2010.
1 Web Developer & Design Foundations with XHTML Chapter 6 Key Concepts.
© NYC Apache Lucene/Solr Meetup. Lucid Imagination, Inc. Agenda Welcome "Faster. Better. Solr! What to look for in Solr 1.4“ Yonik Seeley,
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20 Rafał Kuć – sematext.com.
Elasticsearch in Dashboard Data Management Applications David Tuckett IT/SDC 30 August 2013 (Appendix 11 November 2013)
1 Lucene Jianguo Lu School of Computer Science University of Windsor.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Copyright © Orbeon, Inc. All rights reserved. Erik Bruchez Applications of XML Pipelines XML Prague, June 16 th, 2007.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Solr Performance & Key Innovations Yonik Seeley, Lucid Imagination May
Revolutionizing enterprise web development Searching with Solr.
Solr 3.1 and Beyond Yonik Seeley Lucid Imagination October 8,
10/13/2015 ©2006 Scott Miller, University of Victoria 1 Content Serving Static vs. Dynamic Content Web Servers Server Flow Control Rev. 2.0.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
Open Search Office Web Services Database Doc Mgt Sys Pipeline Index Geospatial Analysis Text Search Faceting Caching Query parsing Clustering Synonyms.
CH1. Hardware: CPU: Ex: compute server (executes processor-intensive applications for clients), Other servers, such as file servers, do some computation.
HathiTrust Research Center Architecture Data subsystem.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Module 10 Administering and Configuring SharePoint Search.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
AxKit A member of the Apache XML project Ryan Maslyn Kyle Bechtel.
Design a full-text search engine for a website based on Lucene
1 Java Servlets l Servlets : programs that run within the context of a server, analogous to applets that run within the context of a browser. l Used to.
1. 2 Google Session 1.About MIT’s Google Search Appliance (GSA) 2.Adding Google search to your web site 3.Customizing search results 4.Tips on improving.
807 - TEXT ANALYTICS Massimo Poesio Lab 2: (Quick intro to) SOLR Document clustering with MAHOUT.
Clusterpoint Margarita Sudņika ms RDBMS & NoSQL Databases & tables → Document stores Columns, rows → Schemaless documents Scales UP → Scales UP.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
Lucene Jianguo Lu.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
HW3 Overview There are 4 components to this homework; you will possibly not need all of them; 1. Installing Ubuntu 2. Installing Solr 3. Using Solr to.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Apache Cocoon – XML Publishing Framework 데이터베이스 연구실 박사 1 학기 이 세영.
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
Node.js Express Web Applications
Searching and Indexing
Building Search Systems for Digital Library Collections
CS6604 Digital Libraries IDEAL Webpages Presented by
What’s changed in the Shibboleth 1.2 Origin
Lucene in action Information Retrieval A.A
Lucene/Solr Architecture
Getting Started With Solr
Rafał Kuć – Sematext sematext.com
Battle of the Giants Apache Solr 4.0 vs ElasticSearch 0.20
Intro to Azure Search Julie Smith 2019.
Intro to Azure Search Julie Smith 2019.
Presentation transcript:

Powerful Full-Text Search with Solr Yonik Seeley Web 2.0 Expo, Berlin 8 November 2007 download at

What is Lucene High performance, scalable, full-text search library Focus: Indexing + Searching Documents –“Document” is just a list of name+value pairs No crawlers or document parsing Flexible Text Analysis (tokenizers + token filters) 100% Java, no dependencies, no config files

What is Solr A full text search server based on Lucene XML/HTTP, JSON Interfaces Faceted Search (category counting) Flexible data schema to define types and fields Hit Highlighting Configurable Advanced Caching Index Replication Extensible Open Architecture, Plugins Web Administration Interface Written in Java5, deployable as a WAR

adminupdateselect Standard request handler Custom request handler XML response writer JSON response writer XML Update Handler CSV Update Handler Lucene Basic App Document super_name: Mr. Fantastic name: Reed Richards category: superhero powers: elasticity Query Response (matching docs) Query (powers:agility) Servlet Container Solr HTML Webapp Indexer

Indexing Data HTTP POST to Peter Parker Spider-Man superhero agility spider-sense

Indexing CSV data Iron Man, Tony Stark, superhero, powered armor | flight Sandman, William Baker|Flint Marko, supervillain, sand transform Wolverine,James Howlett|Logan, superhero, healing|adamantium Magneto, Erik Lehnsherr, supervillain, magnetism|electricity fieldnames=supername,name,category,powers &separator=, &f.name.split=true&f.name.separator=| &f.powers.split=true&f.powers.separator=|

Data upload methods URL= HTTP POST body (curl, HttpClient, etc) curl $URL -H 'Content-type:text/plain; charset=utf-8' Multi-part file upload (browsers) Request parameter ?stream.body=‘Cyclops, Scott Summers,…’ Streaming from URL (must enable) ?stream.url=file://data/info.csv

Indexing with SolrJ // Solr’s Java Client API… remote or embedded/local! SolrServer server = new CommonsHttpSolrServer(" SolrInputDocument doc = new SolrInputDocument(); doc.addField("supername","Daredevil"); doc.addField("name","Matt Murdock"); doc.addField(“category",“superhero"); server.add(doc); server.commit();

Deleting Documents Delete by Id, most efficient Delete by Query category:supervillain

Commit makes changes visible –Triggers static cache warming in solrconfig.xml –Triggers autowarming from existing caches same as commit, merges all index segments for faster searching _0.fnm _0.fdt _0.fdx _0.frq _0.tis _0.tii _0.prx _0.nrm _0_1.del _1.fnm _1.fdt _1.fdx […] Lucene Index Segments

Searching &start=0&rows=2&fl=supername,category Spider-Man superhero Msytique supervillain

Response Format Add &wt=json for JSON formatted response {“result": {"numFound":427, "start":0, "docs": [ {“supername”:”Spider-Man”, “category”:”superhero”}, {“supername”:” Msytique”, “category”:” supervillain”} ] } Also Python, Ruby, PHP, SerializedPHP, XSLT

Scoring Query results are sorted by score descending VSM – Vector Space Model tf – term frequency: numer of matching terms in field lengthNorm – number of tokens in field idf – inverse document frequency coord – coordination factor, number of matching terms document boost query clause boost

Explain fast&indent=on&debugQuery=on = (MATCH) product of: = (MATCH) sum of: = (MATCH) weight(text:fast in 6), product of: = queryWeight(text:fast), product of: = idf(docFreq=5) = queryNorm = (MATCH) fieldWeight(text:fast in 6), product of: = tf(termFreq(text:fast)=2) = idf(docFreq=5) = fieldNorm(field=fast, doc=6) 0.5 = coord(1/2) = (MATCH) product of:

Lucene Query Syntax 1.justice league Equiv: justice OR league QueryParser default operator is “OR”/optional 2.+justice +league –name:aquaman Equiv: justice AND league NOT name:aquaman 3.“justice league” –name:aquaman 4.title:spiderman^10 description:spiderman 5.description:“spiderman movie”~100

Lucene Query Examples2 1.releaseDate:[2000 TO 2007] 2.Wildcard searches: sup?r, su*r, super* 3.spider~ Fuzzy search: Levenshtein distanceLevenshtein distance Optional minimum similarity: spider~0.7 4.*:* 5.(Superman AND “Lex Luthor”) OR (+Batman +Joker)

DisMax Query Syntax Good for handling raw user queries –Balanced quotes for phrase query – ‘+’ for required, ‘-’ for prohibited –Separates query terms from query structure &q=super man// the user query &qf=title^3 subject^2 body// field to query &pf=title^2,body// fields to do phrase queries &ps=100// slop for those phrase q’s &tie=.1// multi-field match reward &mm=2// # of terms that should match &bf=popularity// boost function

DisMax Query Form The expanded Lucene Query: +( DisjunctionMaxQuery( title:super^3 | subject:super^2 | body:super) DisjunctionMaxQuery( title:man^3 | subject:man^2 | body:man) ) DisjunctionMaxQuery(title:”super man”~100^2 body:”super man”~100) FunctionQuery(popularity) Tip: set up your own request handler with default parameters to avoid clients having to specify them

Function Query Allows adding function of field value to score –Boost recently added or popular documents Current parser only supports function notation Example: log(sum(popularity,1)) sum, product, div, log, sqrt, abs, pow scale(x, target_min, target_max) –calculates min & max of x across all docs map(x, min, max, target) –useful for dealing with defaults

Boosted Query Score is multiplied instead of added –New local params syntax added &q= super man Parameter dereferencing in local params &q= &boost=sqrt(popularity) &userq=super man

Analysis & Search Relevancy LexCorp BFG-9000 LexCorp BFG-9000 BFG9000LexCorp LexCorp bfg9000lexcorp lexcorp WhitespaceTokenizer WordDelimiterFilter catenateWords=1 LowercaseFilter Lex corp bfg9000 Lexbfg9000 bfg9000 Lex corp bfg9000lexcorp WhitespaceTokenizer WordDelimiterFilter catenateWords=0 LowercaseFilter Query Analysis A Match! Document Indexing Analysis corp

Configuring Relevancy <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt“/> <filter class="solr.StopFilterFactory“ words=“stopwords.txt”/> <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>

Field Definitions Field Attributes: name, type, indexed, stored, multiValued, omitNorms, termVectors Dynamic Fields

copyField Copies one field to another at index time Usecase #1: Analyze same field different ways –copy into a field with a different analyzer –boost exact-case, exact-punctuation matches –language translations, thesaurus, soundex Usecase #2: Index multiple fields into single searchable field

Facet Query &facet=true&facet.field=cat &facet.query=price:[0 TO 100] &facet.query=manu:IBM {"response":{"numFound":26,"start":0,"docs":[…]}, “facet_counts":{ "facet_queries":{ "price:[0 TO 100]":6, “manu:IBM":2}, "facet_fields":{ "cat":[ "electronics",14, "memory",3, "card",2, "connector",2] }}}

Filters Filters are restrictions in addition to the query Use in faceting to narrow the results Filters are cached separately for speed 1. User queries for memory, query sent to solr is &q=memory&fq=inStock:true&facet=true&… 2. User selects 1GB memory size &q=memory&fq=inStock:true&fq=size:1GB&… 3. User selects DDR2 memory type &q=memory&fq=inStock:true&fq=size:1GB &fq=type:DDR2&…

Highlighting &hl=true&hl.fl=features {"response":{"numFound":5,"start":0,"docs":[ {"id":"3007WFP", “price”:899.95}, …] "highlighting":{ "3007WFP":{ "features":["30\" TFT active matrix LCD, 2560 x 1600” "VA902B":{ "features":["19\" TFT active matrix LCD, 8ms response time, 1280 x 1024 native resolution"]}}}

MoreLikeThis Selects documents that are “similar” to the documents matching the main query. &q=id:6H500F0 &mlt=true&mlt.fl=name,cat,features "moreLikeThis":{ "6H500F0":{"numFound":5,"start":0, "docs”: [ {"name":"Apple 60 GB iPod with Video Playback Black", "price":399.0, "inStock":true, "popularity":10, […] }, […] ] […]

High Availability Load Balancer Appservers Solr Searchers Solr Master DB Updater updates admin queries Index Replication admin terminal HTTP search requests Dynamic HTML Generation

Resources WWW – – – Mailing Lists