Lucene Part2. Lucene Jarkarta Lucene (http://jakarta.apache.org/lucene/) is a high- performance, full-featured, java, open-source, text search engine.

Slides:



Advertisements
Similar presentations
Name: Tatiana “Tania” Harrison
Advertisements

CS0007: Introduction to Computer Programming Introduction to Classes and Objects.
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Excel What’s so great about PivotTable reports?. Course contents Overview: More data than you can handle? Lesson 1: Make your data work for you Lesson.
Information Retrieval in Practice
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports [Your company name] presents:
XP New Perspectives on Microsoft Office Excel 2003, Second Edition- Tutorial 11 1 Microsoft Office Excel 2003 Tutorial 11 – Importing Data Into Excel.
1 Introduction to OBIEE: Learning to Access, Navigate, and Find Data in the SWIFT Data Warehouse Lesson 5: Navigation in OBIEE – Touring the Catalog Page.
Overview of Search Engines
Access Tutorial 3 Maintaining and Querying a Database
Proxy Design Pattern Source: Design Patterns – Elements of Reusable Object- Oriented Software; Gamma, et. al.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Week 4-5 Java Programming. Loops What is a loop? Loop is code that repeats itself a certain number of times There are two types of loops: For loop Used.
INTRO TO PROGRAMMING Chapter 2. M-files While commands can be entered directly to the command window, MATLAB also allows you to put commands in text files.
XP New Perspectives on Microsoft Office Access 2003 Tutorial 12 1 Microsoft Office Access 2003 Tutorial 12 – Managing and Securing a Database.
IAGAP Access Database A Tutorial. Databases There are several databases available from the IAGAP Project. There are several databases available from the.
Nachos Phase 1 Code -Hints and Comments
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
CMSC 202 Exceptions. Aug 7, Error Handling In the ideal world, all errors would occur when your code is compiled. That won’t happen. Errors which.
© The McGraw-Hill Companies, 2006 Chapter 4 Implementing methods.
Introduction to Microsoft Access 2003 Mr. A. Craig Dixon CIS 100: Introduction to Computers Spring 2006.
Chapter 5 Being a Web App. Very few servlet or JSP stands alone Many times in our application, different servlets or JSPs need to share information 
PHP meets MySQL.
Chapter 8 Cookies And Security JavaScript, Third Edition.
JAVA SERVER PAGES. 2 SERVLETS The purpose of a servlet is to create a Web page in response to a client request Servlets are written in Java, with a little.
CSCI-383 Object-Oriented Programming & Design Lecture 13.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Collecting Things Together - Lists 1. We’ve seen that Python can store things in memory and retrieve, using names. Sometime we want to store a bunch of.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Database Systems Microsoft Access Practical #3 Queries Nos 215.
Operating Systems COMP 4850/CISG 5550 File Systems Files Dr. James Money.
Intermediate 2 Software Development Process. Software You should already know that any computer system is made up of hardware and software. The term hardware.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
ITGS Databases.
Microsoft Access. Microsoft access is a database programs that allows you to store retrieve, analyze and print information. Companies use databases for.
Introduction to Access. Access 2010 is a database creation and management program.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Connect. Communicate. Collaborate PerfsonarUI plug-in tutorial Nina Jeliazkova ISTF, Bulgaria.
David Lawrence 7/8/091Intro. to PHP -- David Lawrence.
Design a full-text search engine for a website based on Lucene
Ch- 8. Class Diagrams Class diagrams are the most common diagram found in modeling object- oriented systems. Class diagrams are important not only for.
Excel 2007 Part (3) Dr. Susan Al Naqshbandi
M1G Introduction to Programming 2 5. Completing the program.
Files Tutor: You will need ….
Lesson 4.  After a table has been created, you may need to modify it. You can make many changes to a table—or other database object—using its property.
COMP3241 E-Commerce Technologies Richard Henson University of Worcester November 2014.
Visual Basic for Application - Microsoft Access 2003 Finishing the application.
TABLE OF CONTENTS 2014 BasmahAlQadheeb. What is a report? A report is a clearly structured document that presents information as clearly as possible.
Photoshop Actions Lights, Camera, Actions in Photoshop.
Lucene Jianguo Lu.
1 CSC160 Chapter 1: Introduction to JavaScript Chapter 2: Placing JavaScript in an HTML File.
Access Queries and Forms. Adding a New Field  To insert a field after you have saved your table, open Access, and open the table  It is easier to add.
Chapter 4 Crystal Report Presenter: PEN PHIROM (MscIT) Phone:
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Information Retrieval in Practice
Weebly Elements, Continued
Safe by default, optimized for efficiency
CMSC 202 Exceptions.
Exceptions and networking
Presentation transcript:

Lucene Part2

Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine API written by Doug Cutting.

Lucene API Note that Lucene is specifically an API, not an application. This means that all the hard parts have been done, but the easy programming has been left to you. The payoff for you is that, unlike normal search engine applications, you spend less time wading through tons of options and build a search application that is specifically suited to what you're doing. You can easily develop a custom search application, perfectly suited to your needs. Lucene is startlingly easy to develop with and use.

Lucene Use the Source, Luke This tutorial is a brief overview; the Lucene distribution comes with four example classes: FileDocument IndexFiles SearchFiles DeleteFiles These classes are really a good introduction to how to use Lucene..

Lucene Indexes Here's a simple attempt to diagram how the Lucene classes go together: Index Document 1 Field A (name/value)‏ Field B (name/value)‏ Document 2 Field A (name/value)‏ Field B (name/value)‏

Lucene Index and Searching At the heart of Lucene is an Index. This class usually gets its data from a filesystem directory that contains a certain set of files that follow a certain structure, but it doesn't absolutely have to be a directory. You pump data into the Index, then do searches on the Index to get results out. To build the Index, you use an IndexWriter object. To run a search on the Index you use an IndexSearcher object.

Lucene Search The search itself is a Query object, which you pass into IndexSearcher.search(). IndexSearcher.search() returns a Hits object, which contains a Vector of Document objects.

Lucene Documents Document objects are stored in the Index, but they have to be put into the Index at some point, and that's your job. You have to select what data to enter in, and convert them into Documents. You read in each data file (or database entry, or whatever), instantiate a Document for it, break down the data into chunks and store the chunks in the Document as Field objects (a name/value pair). When you're done building a Document, you write it to the Index using the IndexWriter.

Lucene Query Parser Queries can be quite complicated, so Lucene includes a tool to help generate Query objects, called a QueryParser. The QueryParser takes a query string, much like what you'd put into an Internet search engine, and generates a Query object.

Lucene Analyzer The Analyzer. Lucene indexes text, and part of the first step is cleaning up the text. You use an Analyzer to do this - it drops out punctuation and commonly occurring but meaningless words (the, a, an, etc). Lucene provides a couple different Analyzers, and you can make but your own, but the BIG GOTCHA people keep running into is that you must make sure you use the same sort of analyzer for both indexing and searching. You must feed the same sort of Analyzer to the QueryParser that you originally fed to the IndexWriter.

Lucene What you have to do Lucene handles the indexing, searching and retrieving, but it doesn't handle: managing the process (instantiating the objects and hooking them together, both for indexing and for searching) selecting the data files parsing the data files getting the search string from the user displaying the search results to the use

Lucene Indexing In Depth You index by creating Documents full of Fields (which contain name/value pairs) and pumping them into an IndexWriter, which parses the contents of the Field values into tokens and creates an index.

Lucene Document Objects Lucene doesn't index files, it indexes Document objects. To index and then search files, you first need write code that converts your files into Document objects. A Document object is a collection of Field objects (name/value pairs). So, for each file, instantiate a Document, then populate it with Fields.

Lucene Lucene Key Feature‏ Lucene just handles name/value pairs. , for example, is mostly name/value oriented: to: fred from: barney subject: dinner? body: Let's get together for dinner tonight! For more complex files, you have to "flatten" that structure out into a set of name/value fields.

Lucene A minimum, as in the standard Lucene examples, would be: the path to the original document actually show the user the original document after the search a modification date compare against the original Document's modification date, to see if it needs to be reindexed the contents of the file run the search against Standard Lucene

Lucene You also ought to really think about glomming all of the Field data together and storing it as some sort of "all" Field. This is the easiest way to set it up so your users can search all Fields at once, if they want. Yes, you could come up with a complex scheme to rewrite your users' query so it searches across all of the known fields, but remember, keep it simple. See tutorial. The All Field

Lucene A Field object contains a name (a String) and a value (a String or a Reader), and three booleans that control whether or not the value will be indexed for searches, tokenized prior to indexing, and stored in the index so it can be returned with the search. Field Objects

Lucene Indexed for searches - sometimes you'll want to have fields available in your Documents that don't really have anything to do with searching. Two examples I can think of off the top of my head are creation dates and file names, so you can compare when the Document was created against the file modification date, and decide if the document needs to be reindexed. Since these fields won't ever make sense to use in an actual search, you can decrease the amount of work Lucene does by marking them as not indexed for searches. Indexed for Searches

Lucene Tokenized prior to indexing - tokenizing refers to taking a piece of text and cleaning it up, and breaking it down into individual pieces (tokens) for the indexer. This is done by the Analyzer. Some fields you may not want to be tokenized, for example a serial number field.. Tokenized prior to indexing

Lucene Stored in the index Stored in the index - even if a field is entirely indexed, it doesn't necessarily mean that it'll be easy for Lucene to reconstruct it. Although Lucene is a search index, and not a database, if your fields are reasonably small, you can ask Lucene to store them in the index. With the fields stored in the index, instead of using the Document to locate the original file or data and load it, you can actually pull the data out of the Document. This works best with fairly small fields and documents that you'd need to parse for display anyway.

Lucene Field Factory Factory Method Field.Text(String name, String value)‏ Field.Text(String name, Reader value)‏ Field.Keyword(String name, String value)‏ Field.UnIndexed(String name, String value)‏ Field.UnStored(String name, String value)‏

Lucene IndexWriter The IndexWriter's job is to take the input (a Document), feed it through the Analyzer you instantiate it with, and create an index. Using the IndexWriter itself is fairly simple. You instantiate it with parameters for where to put the index files and the Analyzer you want it to use for cleaning up the tokens. Then feed Documents into IndexWriter.addDocument(). The actual index is a set of data files that the IndexWriter creates in a location defined (depending on how you instantiate the IndexWriter) by a lucene Directory object, a File, or a path string.

Lucene Directory Objects You can also store the index in a Lucene Directory object. A Lucene Directory is an abstraction around the java filesystem classes. Using a Directory lets the Lucene classes hide what exactly is going on. This in turn lets you do clever behind-the- scenes things like keeping the file cached in memory for really high performance by using the RAM-based Directory class (Lucene comes with two Directory classes, one for file-based and one for RAM-based).

Lucene Analyzers and Tokenizers The analyzer's job is to take apart a string of text and give you back a stream of tokens. The tokens are presumably usually words from the text content of the string, and that's what gets stored (along with the location and other details) in the index. Each analyzer includes one or more tokenizers and may include filters. The tokenizers take care of the actual rules for where to break the text up into words (typically whitespace). The filters do any post-tokenizing work on the tokens (typically dropping out punctuation and commonly occurring words like "the", "an", "a", etc).

Lucene Analyzers SimpleAnalyzer seems to just use a Tokenizer that converts all of the input to lower case. StopAnalyzer includes the lower-case filter, and also has a filter that drops out any "stop words", words like articles (a, an, the, etc) that occur so commonly in english that they might as well be noise for searching purposes. StopAnalyzer comes with a set of stop words, but you can instantiate it with your own array of stop words. StandardAnalyzer does both lower-case and stop-word filtering, and in addition tries to do some basic clean-up of words, for example taking out apostrophes ( ' ) and removing periods from acronyms (i.e. "T.L.A." becomes "TLA").

Lucene Searching In Depth To actually do the search, you need an IndexSearcher, but we'll get to that in a moment; before you can even think about feeding the IndexSearcher a query, you have to have a Query object. The IndexSearcher does the actual munging through the index, but it only understands Query objects.

Lucene Query and QueryParser Objects You produce the Query object by feeding the user's argument string into QueryParser.parse(), along with a string for the default field to search (if the user doesn't specify which field to search) and an Analyzer. The Analyzer is what QueryParser uses to tokenize the argument string. (Gotcha Warning: remember, again, you have to make sure that you use the same flavor Analyzer for tokenizing the argument string as you used for tokenizing the Index. StopAnalyzer is probably a safe choice for this, since that's the one used in the example code.) QueryParser.parse() returns a Query.

Lucene Thread Safety Multiple index searchers can read the lucene index files at the same time. An index writer or reader can edit the lucene index files while searches are ongoing Multiple index writers or readers can try to edit the lucene index files at the same time (it's important for the index writer/reader to be closed so it will release the file lock).

Lucene More on Thread Safety However, the query parser is not thread safe, so each thread using the index should have its own query parser. The index writer however, is thread safe, so you can update the index while people are searching it. However, you then have to make sure that the threads with open index searchers close them and open new ones, to get the newly updated data.

Lucene IndexSearchers To get an IndexSearcher you simply instantiate an IndexSearcher with a single argument that tells Lucene where to find an existing index. The argument is either of these two: a string containing a path to the file, a Lucene Directory object (see the section about Directory objects under "Indexing In Depth", above)

Lucene IndexReaders There's actually a third option for instantiating an IndexSearcher; you can instantiate it with any class that is a concrete subclass of the abstract class IndexReader This makes more sense if you take a peek at the code for IndexSearcher. The other two constructors just turn your file path or Directory object into an IndexReader by calling the static method IndexReader.open().

Lucene Multiple Indexes If you're searching a single index, you use an IndexSearcher with a single index. If you need to search across multiple indexes, you instantiate one IndexSearcher per index, create an array, stick the IndexSearcher instances in the array, and instantiate a MultiSearcher with the array as an argument.

Lucene Doing The Search To actually do the search, you take the argument string the user enters, pass it to a QueryParser and get back a parsed Query object (and remember (third time's the charm) to use the right kind of Analyzer when you instantiate the QueryParser; use the same sort of Analyzer that you used when you built the index; the QueryParser'll use the Analyzer to tokenize the argument string). Then you feed the parsed Query to the IndexSearcher.search(). The return is a Hits object, which is a collection of Document objects for documents that matched the search parameters. The Hits object also includes a score for each Document, indicating how well it matched.

Lucene Hits IndexSearcher.search(Query) returns a "Hits" object, which is sort of like a Vector, containing a ranked list of Lucene Document objects. These are the same Document objects you fed into the IndexWriter, but specifically the ones that matched your search. Now you need to format the hits for a display, or manufacture HREFs pointing to the original documents, or whatever you were basically planning to do with the search results.

Lucene (cont.) Summary Lucene For key value pair fast retrieval. For read oriented data.