The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.
EQUINOX DATA DELIVERY SYSTEM May 31, 2011 –Elizabeth Hill Equinox.uwo.ca.
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
© 2008 RightNow Technologies, Inc. Title Best Practices for Maintaining Your RightNow Knowledge Base Penni Kolpin Knowledge Engineer.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Information Retrieval in Practice
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Lucene Lab General IR Process Start Indexing (start stepping though all files) Tokenize & stem each file Index 1 st, Index User enters (roughly)
Searching with Lucene Chapter 2. For discussion Information retrieval What is Lucene? Code for indexer using Lucene Pagerank algorithm.
Lucene Brian Nisonger Feb 08,2006. What is it? Doug Cutting’s grandmother’s middle name Doug Cutting’s grandmother’s middle name A open source set of.
Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
WISER: Newspapers online : an introduction to the scope and range of recent and current newspapers available on Oxlip, including hints on effective search.
Overview of Search Engines
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Word Up! Using Lucene for full-text search of your data set.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
1 Lucene Jianguo Lu School of Computer Science University of Windsor.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
CSC 142 B 1 CSC 142 Java objects: a first view [Reading: chapters 1 & 2]
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Program documentation using the Javadoc tool 1 Program documentation Using the Javadoc tool.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
NoteSearch - Find what you’re looking for. Prototype Team B.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
LIS618 lecture 8 Credo and Gale Thomas Krichel
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
1 Innovative Solutions For Mission Critical Systems Using EMF Annotations to Drive Program Behavior February 19, 2014.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Appendix E Using the Java API Documentation 報告人:黃偉倫 學號:
Design a full-text search engine for a website based on Lucene
University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Chapter 11: Advanced Inheritance Concepts. Objectives Create and use abstract classes Use dynamic method binding Create arrays of subclass objects Use.
Lucene Jianguo Lu.
Internet Power Searching: Finding Pearls in a Zillion Grains of Sand By Daniel Arze.
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
Presented By:. What is JavaHelp: Most software developers do not look forward to spending time documenting and explaining their product. JavaSoft has.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Adding Links to a Page Index School Library Services Page.
Introduction to Information Retrieval Introduction to Information Retrieval ΜΕ003-ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Εισαγωγή στο.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
Apache Lucene Searching the Web and Everything Else Daniel Naber Mindquarry GmbH ID 380.
Information Retrieval in Practice
Lucene Tutorial Chris Manning and Pandu Nayak
Jianguo Lu School of Computer Science University of Windsor
More Sophisticated Behavior
Searching AND INDEXING Big data
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
CS276 Lucene Section.
Searching and Indexing
Lucene in action Information Retrieval A.A
Getting Started With Solr
Table of Contents 1) Understanding Lucene 2) Lucene Indexing
Presentation transcript:

The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens

What is Lucene? Doug Cutting’s grandmother’s middle name A open source set of Java Classses – Search Engine/Document Classifier/Indexer – Developed by Doug Cutting (1996) Xerox/Apple/Excite/Nutch/Yahoo/Cloudera Hadoop founder, Board of directors of the Apache Software Jakarta Apache Product. Strong open source community support. High-performance, full-featured text search engine library Easy to use yet powerful API

Use the Source, Luke Document Field – Represents a section of a Document: name for the section + the actual data. Analyzer – Abstract class (to provide interface) – Document -> tokens (for later indexing) – StandardAnalyzer class. IndexWriter – Creates and maintains indexes. IndexSearcher – Searches through an index. QueryParser – Builds a parser that can search through an index. Query – Abstract class that contains the search criteria created by the QueryParser. TopDocs – Contains the top K Document objects found in a serach by an IndexSearcher, and their scores.

Indexing a Document

Document from an article private Document createDocument(String article, String author, String title, String topic, String url, Date dateWritten) { document.add(new TextField("author",author, Store.YES)); document.add(new TextField("title",title, Store.YES )); document.add(new TextField("topic",topic, Store.YES )); document.add(new TextField("article", article, Store.NO)); document.add(new StoredField("URL", url)); document.add(new StringField("Date", dateWritten, Store.NO)); return document; }

The Field Object SubclassTokenizedIndexedStoredUse for TextFieldYes Can be contents you indexed and tokenized StoredFieldNo Yes contents you don’t want to index, but want to store (url for example) StringFieldNoYesCan be Values you want indexed but not tokenized (dates, keywords,..)

The Field Object Factory MethodTokenizedIndexedStoredUse for Field.Text(String name, String value) Yes contents you want stored Field.Text(String name, Reader value) Yes No contents you don't want stored Field.Keyword(String name, String value) NoYes values you don't want broken down Field.UnIndexed(String name, String value) No Yes values you don't want indexed Field.UnStored(String name, String value) Yes No values you don't want stored Deprecated – old API

Store a Document in the index Directory dir = FSDirectory.open(new File("lucene-index")); private void indexDocument(Document document) throws Exception { Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_45); IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45, analyzer); IndexWriter writer = new IndexWriter(dir, iwc); writer.addDocument(document); writer.close(); }

Analyzers and Tokenizers SimpleAnalyzerSimpleAnalyzer seems to just use a Tokenizer that converts all of the input to lower case. StopAnalyzerStopAnalyzer includes the lower-case filter, and also has a filter that drops out any "stop words", words like articles (a, an, the, etc) that occur so commonly in english that they might as well be noise for searching purposes. StopAnalyzer comes with a set of stop words, but you can instantiate it with your own array of stop words. StandardAnalyzerStandardAnalyzer does both lower-case and stop-word filtering, and in addition tries to do some basic clean-up of words, for example taking out apostrophes ( ' ) and removing periods from acronyms (i.e. "T.L.A." becomes "TLA"). Lucene SandboxHere you can find analyzers in your own language

Adding to an Index public void indexArticle( String article, String author, String title, String topic, String url, Date dateWritten) throws Exception { Document document = createDocument ( article, author, title, topic, url, dateWritten ); indexDocument(document); }

Searching the Index

Searching Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_45); IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(dir)); QueryParser qp = new QueryParser(Version.LUCENE_45, "article", analyzer); Query q = qp.parse(searchString); TopDocs top = searcher.search(q, numResults);

Extracting Document objects for (ScoreDoc sd : top.scoreDocs) { Document doc = searcher.doc(sd.doc); // display the articles that were found to the user }

Search Criteria Supports several searches: AND OR and NOT, fuzzy, proximity searches, wildcard searches, and range searches – author:Henry relativity AND "quantum physics“ – "string theory" NOT Einstein – "Galileo Kepler"~5 – author:Johnson date:[01/01/2004 TO 01/31/2004]

Thread Safety Indexing and searching are not only thread safe, but process safe. What this means is that: – Multiple index searchers can read the lucene index files at the same time. – An index writer or reader can edit the lucene index files while searches are ongoing – Multiple index writers or readers can try to edit the lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock). The query parser is not thread safe, The index writer however, is thread safe,

Luke Luke is a handy tool for development, that allows you to watch an already existing Lucene Index.