Vores tankesæt: 80% teknologi | 20% forretning

Slides:



Advertisements
Similar presentations
Efficient full-text search in databases Andrew Aksyonoff, Peter Zaitsev Percona Ltd. shodan (at) shodan.ru.
Advertisements

Lucene/SOLR 2: Lucene search API
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.
Information Retrieval in Practice
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Hinrich Schütze and Christina Lioma
LCT2506 Internet 2 Data-driven web sites Week 5. LCT2506 Internet 2 Current Practice  Combining web pages and data stored in a relational database is.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
GOAT SEARCH Revorg GOAT Search Solution (Powered by Lucene)
Implementing search with free software An introduction to Solr By Mick England.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:
Word Up! Using Lucene for full-text search of your data set.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
What's the story with open source? Searching and monitoring news media with open source technology Charlie Hull, Flax BCS IRSG Search Solutions 2010 Photo.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Revolutionizing enterprise web development Searching with Solr.
Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
´Google-ized´ search in your business data Author: Krasen Paskalev Certified Oracle 8i/9i DBA Seniour Oracle Consultant Semantec GmbH Benzstr.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Web- and Multimedia-based Information Systems Lecture 2.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
Lucene Boot Camp Grant Ingersoll Lucid Imagination Nov. 4, 2008 New Orleans, LA.
Lucene Jianguo Lu.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
General Architecture of Retrieval Systems 1Adrienn Skrop.
INFORMATION RETRIEVAL Pabitra Mitra Computer Science and Engineering IIT Kharagpur
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380.
Information Retrieval in Practice
CS520 Web Programming Full Text Search
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Search Engine Architecture
Adam Koehler Index Speed Demons - How To Turbo-Charge Your Text Based Queries Using Full-Text Indexing.
Searching and Indexing
Text Based Information Retrieval
Information Retrieval and Web Search
Safe by default, optimized for efficiency
Building Search Systems for Digital Library Collections
Implementation Issues & IR Systems
CS 430: Information Discovery
PHP / MySQL Introduction
MG4J – Managing GigaBytes for Java Introduction
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Multimedia Information Retrieval
Search Techniques and Advanced tools for Researchers
Lucene in action Information Retrieval A.A
Lucene/Solr Architecture
Information Retrieval and Web Design
Intro to Azure Search Julie Smith 2019.
Intro to Azure Search Julie Smith 2019.
Presentation transcript:

Vores tankesæt: 80% teknologi | 20% forretning Apache Lucene V 4.0

aly@kringdevelopment.dk | +45 53 72 73 40 |www.lybecker.com/blog Anders Lybecker Consultant Solution Architect KRING Development A/S Expertise .Net SQL Server Freetext Search aly@kringdevelopment.dk | +45 53 72 73 40 |www.lybecker.com/blog

Agenda Lucene Intro Indexing Searching Analysis Options Patterns Multilingual What not to do! „Did you mean...“ functionality Performance factors for indexing and searching

What is Lucene Information retrieval software library Also know as a search engine Free / open source Apache Software Foundation Document Database Schema free Inverted Index Large and active community Extensible and scalable (6 billion+ documents) Java, .Net, C, Python etc.. Document Databases: CouchDB, Terrastore Relevance Ranking / Sorting Integrates different data sources (email, web pages, files, database, ...) 1:1 port for the Java version. Existet since 2000 Douglas Cutting

Who uses Lucene? MySpace, LinkedIn, Technorati, Wikipedia, Monster.com, SourceForge, CIA, CNET Reviews, E. On, Expert-Exchange, The Guardian, Akamai, Eclipse, JIRA, Statsbiblioteket - the State and University Library in AArhus – Denmark, AOL, Disney, Furl, IBM OmniFind Yahoo! Edition, Hi5, TheServerSide, Nutch, Solr

Basic Application Document Name: Anders Company: Kring Development Skills: .Net, SQL, Lucene Hits (Matching docs) Query Skills: Lucene IndexWriter Analysis IndexSearcher Index (Directory)

Querying Construct Query Filter Sort E.g via QueryParser Limiting the result, E.g security filters Does not calculate score (Relevance) Caching via CachingWrapperFilter Sort Set sort order, default Relevance Demo

Types of Queries Name Description TermQuery Query by a single Term – Word PrefixQuery Wildcard query – like Dog* RangeQuery Ranges like AA-ZZ, 22-44 or 01DEC2010-24DEC2010 BooleanQuery Container with Boolean like semantics – Should, Must or Must Not PhraseQuery Terms within a distance of one another (slop) WildcardQuery E.g. A?de* matches Anders FuzzyQuery Phontic search via Levenshtein distance algorithm Others in Contrib

Query Parser Default Query Parser Syntax Custom Query parsers conference conference AND lucene <=> +conference +lucene Oracle OR MySQL C# NOT php <=> C# -php conference AND (Lucene OR .Net) “KRING Development“ title:”Lucene in Action” L?becker Mad* schmidt~ schmidt, schmit, schmitt price:[12 TO 14] Custom Query parsers Use Irony, ANTLR …

Analysis Converting your text into Terms Actions Lucene does NOT search your text Lucene searches the set of terms created by analysis Actions Break on whitespace, punctuation, caseChanges, numb3rs Stemming (shoes -> shoe) Removing/replacing of Stop Words The quick brown fox jumps -> quick brown fox jumps Combining words Adding new words (synonyms) Only string -> Lexiographical sort order… Demo

Field Options Analyzed, Not Analyzed, Analyzed No Norms, Not Analyzed No Norms Stored – Yes, No, Compress Index Store TermVector Example usage Not Analyzed No Norms* Yes No Identifiers (Primary keys, file names), SSN, Phone No, URLs, names, Dates and textual fields for sorting Analyzed Positions + Offsets Title, Abstract Main content body Not Analyzed Document type, Primary keys (if not used for searching) Hidden keywords TermVectors - Unique terms occurs and counts * Norms are used for Relevance ranking

Field Options Norms Term Vectors Boosts and field length normalization Use for ranking Default: shorter fields has higher rank Term Vectors Miniature inverted index Term frequency pairs Positional information of each Term occurrence (Position and Offset) Use with PhraseQuery Highlighter "More Like This“ Field.SetOmitTermFreqAndPositions(false) No Relevance score No PhraseQuery and SpanQuery Use with Filtering fields

Copy Fields It’s common to want to index data more than one way You might store an unanalyzed version of a field for searching And store an analyzed version for faceting You might store a stemmed and non-stemmed version of a field To boost precise matches

Multilingual Generally, keep different languages in their own fields or indexes This lets you have an analyzer for each language Stemming, stop words, etc.

Wildcard Querying Scenario Reversing Token Filter Search for *soft Leading wildcards require traversing the entire index Reversing Token Filter Reverse the order, and leading wildcards become trailing *soft -> tfos*

What can go wrong? Lots of things You can’t find things You find too much Poor query or indexing performance Problems happen when the terms are not what you think they are

Case: Slow Searches They index 500,000 books Multiple languages in one field So they can’t use stemming or stop words Their worst case query was: “The lives and literature of the beat generation” It took 2 minutes to run The query requires checking every doc containing “the” & “and” And the position info for each occurrence

Bi-grams Bi-grams combine adjacent terms “The lives and literature“ becomes “The lives” “lives and” “and literature” Only have to check documents that contain the pair adjacent to each other. Only have to look at position information for the pair But can triple the size of the index Word indexed by itself Indexed both with preceding term, and following term

Common Bi-grams Form bi-grams only for common terms “The” occurs 2 billion times. “The lives” occurs 360k. Used the only 32 most common terms Average response went from 460 ms to 68ms.

Auto Suggest N-grams Edge N-grams unigrams: “c”, “a”, “s”, “h” bigrams: “ca”, “as”, “sh” trigrams: “cas”, “ash” 4-grams: “cash” Edge N-grams “c”, “ca”, “cas”, “cash” Alternative: PrefixQuery Two other approaches are to use either the TermsComponent (new in Solr 1.4) or faceting. Demo

Spell Checking „Did you mean...“ Spell checker starts by analyzing the source terms into n-grams Index Structure Example word kings gram3 kin, ing, ngs gram4 king, ings start3 kin start4 king end3 ngs end4 ings http://wiki.apache.org/lucene-java/SpellChecker Demo

Trie Fields – Numeric ranges Added in v2.9 175 is indexed as hundreds:1 tens:17 ones:175 TrieRangeQuery:[154 TO 183] is executed as tens:[16 TO 17] OR ones:[154 TO 159] OR ones:[180 TO 183] Configurable precisionStep per field 40x speedup for range queries

Synonyms Synonym filter allows you to include alternate words that the user can use when searching For example, theater, theatre Useful for movie titles, where words are deliberately mis-spelled Don’t over-use synonyms It helps recall, but lowers precision Produces tokens at the same token position “local theater company” theatre

Other features Find similar documents Result Highlighter Tika Selects documents similar to a given document, based on the document's significant terms Result Highlighter Tika Rich document text extraction Spatial Search … Demo

General Performance Factors Use local file system Index Size Stop Word removal Use of stemming Type of Analyzer More complicated analysis, slower indexing Turn off features you are not using (Norms, Term Vectors etc.) Index type (RAMDirectory, other) Occurrences of Query Terms Optimized Index Disable Compound File format Just add more RAM :-)

Indexing Performance Factors Re-use the IndexWriter IndexWriter.SetRAMBufferSizeMB Minimum # of MBs before merge occurs and a new segment is created Usually, Larger == faster, but more RAM IndexWriter.SetMergeFactor How often segments are merged Smaller == less RAM, better for incremental updates Larger == faster, better for batch indexing IndexWriter.SetMaxFieldLength Limit the number of terms in a Document Reuse Document and Field instances Larger mergeFactors defers merging of segments until later, thus speeding up indexing because merging is a large part of indexing. However, this will slow down searching, and, you will run out of file descriptors if you make it too large. Values that are too large may even slow down indexing since merging more segments at once means much more seeking for the hard drives.

Search Performance Factors Use ReadOnly IndexReader Share a single instance of IndexSearcher Reopen only when necessary and pre warm-up Query Size Stop Words removal, Bi-grams … Query Type(s) WildcardQuery rewrites to BooleanQuery with all Terms Use FieldSelector Select only the stored fields needed Use Filters with cache Search an “all” field instead of many fields with the same Query Terms Use RangeFilter instead of RangeQuery RangeQuery expands every term in the range to a boolean expression, and easily blows past the built-in BooleanQuery.maxClauseCount limit (Lucene 2.0 defaults to about 1000). RangeFilter doesn't suffer from this limitation. Demo

Alternatives MS FullText / Fast Oracle Text MySQL FullText dtSearch Commercial Xapian Open Source Sphinx Used by Craigslist

Solr

What is Solr Enterprise Search engine Free / Open Source Started by C|NET Build on Lucene Web-based application (HTTP) Runs in a Java servlet container 18-09-2018

Features Solr Core – virtual instances Lucene best practices Sharding Replication DataImportHandler Faceting Replication v1.3: rsync on Linux V1.4: Java-based replication via a RequestHandler Demo 18-09-2018

Questions?

Resources Anders Lybecker’s Blog Lucene Lucene.Net Lucene Wiki http://www.lybecker.com/blog/ Lucene http://lucene.apache.org/java/docs/ Lucene.Net http://lucene.apache.org/lucene.net/ Lucene Wiki http://wiki.apache.org/lucene-java/ Book: Lucene In Action Luke - Lucene Index Exploration Tool http://www.getopt.org/luke/

Relevans Scoring Factor Description tf(t in d) Term frequency factor for the term (t) in the document (d), ie how many times the term t occurs in the document. idf(t) Inverse document frequency of the term: a measure of how “unique” the term is. Very common terms have a low idf; very rare terms have a high idf. boost(t.field in d) Field & Document boost, as set during indexing. You may use this to statically boost certain fields and certain documents over others. lengthNorm(t.field in d) Normalization value of a field, given the number of terms within the field. This value is computed during indexing and stored in the index norms. Shorter fields (fewer tokens) get a bigger boost from this factor. coord(q, d) Coordination factor, based on the number of query terms the document contains. The coordination factor gives an AND-like boost to documents that contain more of the search terms than other documents. queryNorm(q) Normalization value for a query, given the sum of the squared weights of each of the query terms.

Index Structure Document Field Term Index Segment Grouping of content Properties of the Document Term Unit of indexing – often a word Index Segment File – an index by it self Lucene write segments incrementally Index Segment Segment Segment … Field 1 Field 2 Document

Phonetic Analysis Creates a phonetic representation of the text, for “sounds like” matching PhoneticFilterFactory. Uses one of Metaphone Double Metaphone Soundex Refined Soundex Nysis

Components of a Analyzer CharFilters Tokenizers TokenFilters

CharFilters Used to clean up/regularize characters before passing to TokenFilter Remove accents, etc. MappingCharFilter They can also do complex things, we’ll look at HTMLStripCharFilter later.

Tokenizers Convert text to tokens (terms) Only one per analyzer Many Options WhitespaceTokenizer StandardTokenizer PatternTokenizer More…

TokenFilters Process the tokens produced by the Tokenizer Can be many of them per field