Lucene Jianguo Lu.

Slides:



Advertisements
Similar presentations
Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.
Advertisements

Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.
Chapter 5: Introduction to Information Retrieval
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
MAE Training for User July 8, Agenda Wiki FishEye Crucible Stash.
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Tries Standard Tries Compressed Tries Suffix Tries.
Information Retrieval in Practice
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Lucene Lab General IR Process Start Indexing (start stepping though all files) Tokenize & stem each file Index 1 st, Index User enters (roughly)
Generalized Vector Space Model Definition Let k i be a vector associated with the index term k i. Independence of index terms in the vector model implies.
Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Company LOGO Peer To Peer Full Text Search Authors : Tanya Finkel, Victor Antonov Advisor: Maxim Gurevich.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Overview of Search Engines
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
MOVIE QUOTES SEARCH ENGINE Students: Meytal Bialik Zvi Cahana Supervisors: Hayim Makabee Oren Somekh Technion – Israel Institute Of Technology Computer.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 24 How Websites Work with Databases How Websites Work with Databases.
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
1 Lucene Jianguo Lu School of Computer Science University of Windsor.
Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Project Overview Bibliographic merging, Endeca, and Web application.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Chapter 8 Cookies And Security JavaScript, Third Edition.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
By: Namrata Lele Mentors: Dave Vieglais Bruce Wilson 1 VDC/TWG Meeting August 09.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Design a full-text search engine for a website based on Lucene
Information Retrieval
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Introduction to Information Retrieval Introduction to Information Retrieval ΜΕ003-ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Εισαγωγή στο.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
Information Retrieval in Practice
ΠΛΕ70: Ανάκτηση Πληροφορίας
Lucene Tutorial Chris Manning and Pandu Nayak
Jianguo Lu School of Computer Science University of Windsor
Searching AND INDEXING Big data
CS276 Lucene Section.
Searching and Indexing
Building Search Systems for Digital Library Collections
CS 430: Information Discovery
Lucene in action Information Retrieval A.A
Lecture 8 Information Retrieval Introduction
Information Retrieval and Web Design
Table of Contents 1) Understanding Lucene 2) Lucene Indexing
Presentation transcript:

Lucene Jianguo Lu

What is lucene Lucene is Lucene is not an application ready for use an API an Information Retrieval Library Lucene is not an application ready for use Not an web server Not a search engine It can be used to Index files Search the index It is open source, written in Java Two stages: index and search Picture From Lucene in Action

Indexing Just the same as the index at the end of a book Without an index, you have to search for a keyword by scanning all the pages of a book Process of indexing Acquire content Build document Transform to text file from other formats such as pdf, ms word Lucene does not support this kind of filter There are tools to do this Analyze document Tokenize the document Stemming Stop words Lucene provides a string of analyzers User can also customize the analyzer Index document Key classes in Lucene Indexing Document, Analyzer, IndexWriter

Lucene analyzers StandardAnalyzer WhitespaceAnalyzer StopAnalyzer A sophisticated general-purpose analyzer. WhitespaceAnalyzer A very simple analyzer that just separates tokens using white space. StopAnalyzer Removes common English words that are not usually useful for indexing. SnowballAnalyzer A stemming that works on word roots. Analyzers for languages other than English

Code snippets Index is written In this directory Directory dir = FSDirectory.open(new File(indexDir)); writer = new IndexWriter( dir, new StandardAnalyzer(Version.LUCENE_30), true, IndexWriter.MaxFieldLength.UNLIMITED); … … Document doc = new Document(); doc.add(new Field("contents", new FileReader(f))); doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED)); doc.add(new Field("fullpath", f.getCanonicalPath(), writer.addDocument(doc); When doc is added, StandardAnalyzer Create a document instance from a file Add the doc to writer

Document and Field Construct a Field: doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED)); Construct a Field: First two parameters are field name and value Third parameter: whether to store the value If NO, content is discarded after it indexed. Storing the value is useful if you need the value later, like you want to display it in the search result. Fourth parameter: whether and how the field should indexed. doc.add(new Field("contents", new FileReader(f))); Create a tokenized and indexed field that is not stored

Search Three models of search Boolean Vector space Probabilistic Lucene supports a combination of boolean and vector space model Steps to carry out a search Build a query Issue the query to the index Render the returns Rank pages according to relevance to the query

Search code Directory dir = FSDirectory.open(new File(indexDir)); IndexSearcher is = new IndexSearcher(dir); QueryParser parser = new QueryParser(Version.LUCENE_30, "contents", new StandardAnalyzer(Version.LUCENE_30)); Query query = parser.parse(q); TopDocs hits = is.search(query, 10); for(ScoreDoc scoreDoc : hits.scoreDocs) { Document doc = is.doc(scoreDoc.doc); System.out.println(doc.get("fullpath")); } Indicate to search which index Parse the query Search the query Process the returns one by one. Note that ‘fullpath’ is a field added while indexing