Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.

Slides:



Advertisements
Similar presentations
Relational Database and Data Modeling
Advertisements

Information Retrieval in Practice
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 8 – File Structures.
Comp 335 File Structures Indexes. The Search for Information When searching for information, the information desired is usually associated with a key.
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Information Retrieval in Practice
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Physical Database Monitoring and Tuning the Operational System.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Overview of Search Engines
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
Overview of a Database Management System
Databases C HAPTER Chapter 10: Databases2 Databases and Structured Fields  A database is a collection of information –Typically stored as computer.
ASP.NET Programming with C# and SQL Server First Edition
ACOT Intro/Copyright Succeeding in Business with Microsoft Excel
Files and Streams. Java I/O File I/O I/O streams provide data input/output solutions to the programs. A stream can represent many different kinds of sources.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 4th Edition Copyright © 2009 John Wiley & Sons, Inc. All rights.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
1 Design Issues in XML Databases Ref: Designing XML Databases by Mark Graves.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CS370 Spring 2007 CS 370 Database Systems Lecture 1 Overview of Database Systems.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Design a full-text search engine for a website based on Lucene
Information Retrieval
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Lesson 13 Databases Unit 2—Using the Computer. Computer Concepts BASICS - 22 Objectives Define the purpose and function of database software. Identify.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Chapter 5 Index and Clustering
Session 1 Module 1: Introduction to Data Integrity
Lucene Jianguo Lu.
PowerPoint Presentation for Dennis, Wixom, & Tegarden Systems Analysis and Design with UML, 5th Edition Copyright © 2015 John Wiley & Sons, Inc. All rights.
CS4432: Database Systems II
General Architecture of Retrieval Systems 1Adrienn Skrop.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Module 11: File Structure
Search Engine Architecture
Searching and Indexing
Ch. 8 File Structures Sequential files. Text files. Indexed files.
Implementation Issues & IR Systems
Microsoft Access 2003 Illustrated Complete
Databases.
MG4J – Managing GigaBytes for Java Introduction
Introduction to Database Systems
Developing a Model-View-Controller Component for Joomla Part 3
ICOM 5016 – Introduction to Database Systems
Presentation transcript:

Lucene Part1 ‏

Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y

Lucene Couple of Problems Relational Database – 1971 – Dr. E.F. Codd Excellent way to store 2 dimensions Null Inapplicable Science of Data.... The Study of Null PersonHockeyTeam NamePerson_Id AddressTeam Id

Lucene Example PersonHockeyTeam Stephen5 PhiladelphiaFlyers 5

Lucene Inverted Indexes PersonHockeyTeam Stephen5 PhiladelphiaFlyers 5 We can easily answer the question, where does Stephen live, but what about give me all the people living in Philadelphia. We can do this.

Lucene Inverted Indexes Storage contains Philadelphia Stephen, Chris, Tara

Lucene Document Centric Create Documents on the Fly Add field called Category table name Lucene indexes everything.

Lucene One Developer Says Every other open source search engine I evaluated, including Swish-E, Glimpse, iSearch, and libibex, was poorly suited to Eyebrowse's requirements in some way. This would have made integration problematic and/or time-consuming. With Lucene, I added indexing and searching to Eyebrowse in little more than half a day, from initial download to fully working code! This was less than one-tenth of the development time I had budgeted, and yielded a more tightly integrated and feature-rich result than any other search tool I considered.

Lucene How search engines work Creating and maintaining an inverted index is the central problem when building an efficient keyword search engine. To index a document, you must first scan it to produce a list of postings. Postings describe occurrences of a word in a document; they generally include the word, a document ID, and possibly the location(s) or frequency of the word within the document

Lucene Building A Search Index If you think of the postings as tuples of the form, a set of documents will yield a list of postings sorted by document ID. But in order to efficiently find documents that contain specific words, you should instead sort the postings by word (or by both word and document, which will make multiword searches faster). In this sense, building a search index is basically a sorting problem. The search index is a list of postings sorted by word.

Lucene An Innovative Implementation Most search engines use B-trees to maintain the index; they are relatively stable with respect to insertion and have well-behaved I/O characteristics (lookups and insertions are O(log n) operations). Lucene takes a slightly different approach: rather than maintaining a single index, it builds multiple index segments and merges them periodically. For each new document indexed, Lucene creates a new index segment, but it quickly merges small segments with larger ones -- this keeps the total number of segments small so searches remain fast. To optimize the index for fast searching, Lucene can merge all the segments into one, which is useful for infrequently updated indexes.

Lucene Preventing Conflicts To prevent conflicts (or locking overhead) between index readers and writers, Lucene never modifies segments in place, it only creates new ones. When merging segments, Lucene writes a new segment and deletes the old ones -- after any active readers have closed it. This approach scales well, offers the developer a high degree of flexibility in trading off indexing speed for searching speed, and has desirable I/O characteristics for both merging and searching.

Lucene Index Segment A Lucene index segment consists of several files: A dictionary index containing one entry for each 100 entries in the dictionary A dictionary containing one entry for each unique word A postings file containing an entry for each posting

Lucene Flat Files Since Lucene never updates segments in place, they can be stored in flat files instead of complicated B-trees. For quick retrieval, the dictionary index contains offsets into the dictionary file, and the dictionary holds offsets into the postings file. Lucene also implements a variety of tricks to compress the dictionary and posting files -- thereby reducing disk I/O -- without incurring substantial CPU overhead.

Lucene Using Lucene - Create an Index The simple program CreateIndex.java creates an empty index by generating an IndexWriter object and instructing it to build an empty index. In this example, the name of the directory that will store the index is specified on the command line.

Lucene The Code public class CreateIndex { // usage: CreateIndex index-directory public static void main(String[] args) throws Exception { String indexPath = args[0]; IndexWriter writer; // An index is created by opening an IndexWriter with // create argument set to true. writer = new IndexWriter(indexPath, null, true); writer.close(); }

Lucene Index Text Documents IndexFile.java shows how to add documents -- the files named on the command line -- to an index. For each file, IndexFiles creates a Document object, then calls IndexWriter.addDocument to add it to the index. From Lucene's point of view, a Document is a collection of fields that are name- value pairs. A Field can obtain its value from a String, for short fields, or an InputStream, for long fields. Using fields allows you to partition a document into separately searchable and indexable sections, and to associate metadata -- such as name, author, or modification date -- with a document. For example, when storing mail messages, you could put a message's subject, author, date, and body in separate fields, then build semantically richer queries like "subject contains Java AND author contains Gosling."

Lucene Indexing In Depth In the code, we store two fields in each Document : path, to identify the original file path so it can be retrieved later, and body, for the file's contents.

Lucene public class IndexFiles { // usage: IndexFiles index-path file... public static void main(String[] args) throws Exception { String indexPath = args[0]; IndexWriter writer; writer = new IndexWriter(indexPath, new SimpleAnalyzer(), false); for (int i=1; i<args.length; i++) { System.out.println("Indexing file " + args[i]); InputStream is = new FileInputStream(args[i]); // We create a Document with two Fields, one which contains // the file path, and one the file's contents. Document doc = new Document(); doc.add(Field.UnIndexed("path", args[i])); doc.add(Field.Text("body", (Reader) new InputStreamReader(is))); writer.addDocument(doc); is.close(); }; writer.close(); } Code Example

Lucene Search.java provides an example of how to search the index. While the com.lucene.Query package contains many classes for building sophisticated queries, here we use the built-in query parser, which handles the most common queries and is less complicated to use. We create a Searcher object, use the QueryParser to create a Query object, and call Searcher.search on the query. The search operation returns a Hits object -- a collection of Document objects, one for each document matched by the query -- and an associated relevance score for each document, sorted by score. Search

Lucene public class Search { public static void main(String[] args) throws Exception { String indexPath = args[0], queryString = args[1]; Searcher searcher = new IndexSearcher(indexPath); Query query = QueryParser.parse(queryString, "body", new SimpleAnalyzer()); Hits hits = searcher.search(query); for (int i=0; i<hits.length(); i++) { System.out.println(hits.doc(i).get("path") + "; Score: " + hits.score(i)); }; } Code Example

Lucene The built-in query parser supports most queries, but if it is insufficient, you can always fall back on the rich set of query- building constructs provided. The query parser can parse queries like these: free AND "text search" Search for documents containing "free" and the phrase "text search" +text search Search for documents containing "text" and preferentially containing "search" giants -football Search for "giants" but omit documents containing "football" author:gosling java Search for documents containing "gosling" in the author field and "java" in the body Query Parsing

Lucene Lucene uses three major abstractions to support building text indexes: Document, Analyzer, and Directory. The Document object represents a single document, modeled as a collection of Field objects (name-value pairs). For each document to be indexed, the application creates a Document object and adds it to the index store. The Analyzer converts the contents of each Field into a sequence of tokens.. Beyond Basic Text Documents

Lucene Token Token, the basic unit of indexing in Lucene, represents a single word to be indexed after any document domain transformation -- such as stop-word elimination, stemming, filtering, term normalization, or language translation -- has been applied. The application filters undesired tokens, like stop words or portions of the input that do not need to be indexed, through the Analyzer class. It also modifies tokens as they are encountered in the input, to perform stemming or other term normalization. Conveniently, Lucene comes with a set of standard Analyzer objects for handling common transformations like word identification and stop-word elimination, so indexing simple text documents requires no additional work. If these aren't enough, the developer can provide more sophisticated analyzers.

Lucene Analyzer The application provides the document data in the form of a String or InputStream, which the Analyzer converts to a stream of tokens. Because of this, Lucene can index data from any data source, not just files. If the documents are stored in files, use FileInputStream to retrieve them, as illustrated in IndexFile.java. If they are stored in an Oracle database, provide an InputStream class to retrieve them. If a document is not a text file but an HTML or XML file, for example, you can extract content by eliminating markups like HTML tags, document headers, or formatting instructions. This can be done with a FilterInputStream, which would convert a document stream into a stream containing only the document's content text, and connect it to the InputStream that retrieves the document. So, if we wanted to index a collection of XML documents stored in an Oracle database, the resulting code would be very similar to IndexFiles.java. But it would use an application-provided InputStream class to retrieve the document from the database (instead of FileInputStream ), as well as an application-provided FilterInputStream to parse the XML and extract the desired content.

Lucene Summary Lucene is the most flexible and convenient open source search toolkit I've ever used. Cutting describes his primary goal for Lucene as "simplicity without loss of power or performance," and this shines through clearly in the result. The design seems so simple, you might suspect it is just the obvious way to design a search toolkit. We should all be so lucky as to craft such obvious designs for our own software.