Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.
Chapter 5: Introduction to Information Retrieval
Elliot Holt Kelly Peterson. D4 – Smells Like D3 Primary Goal – improve D3 MAP with lessons learned After many experiments: TREC 2004 MAP = >
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
For ITCS 6265 Professor: Wensheng Wu Present by TA: Xu Fei.
CS 430 / INFO 430 Information Retrieval
Information Retrieval in Practice
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Lucene Lab General IR Process Start Indexing (start stepping though all files) Tokenize & stem each file Index 1 st, Index User enters (roughly)
INEX 2009 XML Mining Track James Reed Jonathan McElroy Brian Clevenger.
Evaluating the Performance of IR Sytems
Lucene Brian Nisonger Feb 08,2006. What is it? Doug Cutting’s grandmother’s middle name Doug Cutting’s grandmother’s middle name A open source set of.
Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Overview of Search Engines
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Advance Computer Programming Java Database Connectivity (JDBC) – In order to connect a Java application to a database, you need to use a JDBC driver. –
Lemur Indri Search Engine Yatish Hegde 03/03/2010.
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Terrier: TERabyte RetRIevER An Introduction By: Kavita Ganesan (Last Updated April 21 st 2009)
Lucene Open Source Search Engine. Lucene - Overview Complete search engine in a Java library Stand-alone only, no server – But can use SOLR Handles indexing.
Advanced Lucene Grant Ingersoll Center for Natural Language Processing ApacheCon 2005 December 12, 2005.
Lucene Boot Camp I Grant Ingersoll Lucid Imagination Nov. 3, 2008 New Orleans, LA.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Vyhľadávanie informácií Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
Linking Wikipedia to the Web Antonio Flores Bernal Department of Computer Sciencies San Pablo Catholic University 2010.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
NoteSearch - Find what you’re looking for. Prototype Team B.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Search Dr Ian Boston University of Cambridge Image © University of Cambridge December :30 INTL 6.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Lucene. Lucene A open source set of Java Classses ◦ Search Engine/Document Classifier/Indexer 
Search Tools and Search Engines Searching for Information and common found internet file types.
Design a full-text search engine for a website based on Lucene
University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Lucene Jianguo Lu.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Today… Strings: –String Methods Demo. Raising Exceptions. os Module Winter 2016CISC101 - Prof. McLeod1.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
Information Retrieval in Practice
Lucene Tutorial Chris Manning and Pandu Nayak
Adam Koehler Index Speed Demons - How To Turbo-Charge Your Text Based Queries Using Full-Text Indexing.
CS276 Lucene Section.
Searching and Indexing
Senior Solutions Architect, MongoDB Inc.
Detailed search stats from DSpace Solr
Lucene in action Information Retrieval A.A
Lucene/Solr Architecture
Introduction to Computer Science
Presentation transcript:

Lucene-Demo Brian Nisonger

Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional info See Treehouse Wiki- Lucene for additional info Set of Java classes Set of Java classes Not an end to end solution Not an end to end solution Designed to allow rapid development of IR tools Designed to allow rapid development of IR tools

Index The first step is to take a set of text documents and build an Index The first step is to take a set of text documents and build an Index Demo:IndexFiles on Pongo Demo:IndexFiles on Pongo Two major classes Two major classes Analyzer Analyzer Used to Tokenize data Used to Tokenize data More on this later More on this later IndexWriter IndexWriter IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true); IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);

Index Writer Index Writer creates an index of documents Index Writer creates an index of documents First argument is a directory of where to build/find the index First argument is a directory of where to build/find the index Second argument calls an Analyzer Second argument calls an Analyzer Third argument determines if a new index should be created Third argument determines if a new index should be created

Analyzer Standard Analyzer Standard Analyzer Porter Stemming w/ Stop Words Porter Stemming w/ Stop Words Krovetz Stemmer-Example Krovetz Stemmer-Example package org.apache.lucene.analysis; package org.apache.lucene.analysis; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.LowerCaseTokenizer; import org.apache.lucene.analysis.LowerCaseTokenizer; import org.apache.lucene.analysis.KStemFilter; import org.apache.lucene.analysis.KStemFilter; import java.io.Reader; import java.io.Reader; public class KStemAnalyzer extends Analyzer public class KStemAnalyzer extends Analyzer { public final TokenStream tokenStream(String fieldName, Reader reader) public final TokenStream tokenStream(String fieldName, Reader reader) { { return new KStemFilter(new LowerCaseTokenizer(reader)); return new KStemFilter(new LowerCaseTokenizer(reader)); } } }

Analyzer-II Snowball Stemmer Snowball Stemmer A stemmer language created by Porter used to build Stemmers A stemmer language created by Porter used to build Stemmers Multilingual analyzers/Stemmers Multilingual analyzers/Stemmers Porter2 Porter2 Fully Integrated with Lucene Fully Integrated with Lucene MyAnalyzer(Home Built) MyAnalyzer(Home Built) Demo Demo

Adding Documents The Next step after creating an index is to add documents The Next step after creating an index is to add documents writer.addDocument(FileDocument.Document (file)); writer.addDocument(FileDocument.Document (file)); Remember we already determined how the document will be tokenized Remember we already determined how the document will be tokenized Fields Fields Can split document in to parts such as document title,body,date created, paragraphs Can split document in to parts such as document title,body,date created, paragraphs

Adding Documents-II Assigns Token/doc ID Assigns Token/doc ID For why this is important see Lucene –TreeHouse Wiki For why this is important see Lucene –TreeHouse Wiki Create some type of loop to add all the documents Create some type of loop to add all the documents This is the actual creation of the Index before we merely set the Index parameters This is the actual creation of the Index before we merely set the Index parameters

Finalizing Index Creation After that the Index is optimized with writer.optimize(); After that the Index is optimized with writer.optimize(); Merges etc. Merges etc. The Index is close with writer.close(); The Index is close with writer.close();

Searching an Index Open Index Open Index IndexReader reader = IndexReader.open(index); IndexReader reader = IndexReader.open(index); Create Searcher Create Searcher Searcher searcher = new IndexSearcher(reader); Searcher searcher = new IndexSearcher(reader); Assign Analyzer Assign Analyzer Use the same Analyzer used to create Index (Why?) Use the same Analyzer used to create Index (Why?)

Searching an Index-II Parse/Create query Parse/Create query Query query = QueryParser.parse(line, field, analyzer); Query query = QueryParser.parse(line, field, analyzer); Takes a line, looks for a particular field, and runs it through an analyzer to create query Takes a line, looks for a particular field, and runs it through an analyzer to create query Determine which documents are matches Determine which documents are matches Hits hits = searcher.search(query); Hits hits = searcher.search(query);

Retrieving Documents Hits creates a collection of documents Hits creates a collection of documents Using a loop we can reference each doc Using a loop we can reference each doc Document doc = hits.doc(i); Document doc = hits.doc(i); This allows us to get info about the document This allows us to get info about the document Name of document, date is was created, words in document Name of document, date is was created, words in document Relevancy Score(TF/IDF) Relevancy Score(TF/IDF) Demo Demo

Finishing Searching Return list of documents Return list of documents Close Reader Close Reader

Other Functions Spans (Example from dex.html) Spans (Example from dex.html) dex.html dex.html Useful for Phrasal matching Useful for Phrasal matching Allows for Passage Retrieval Allows for Passage Retrieval

Questions? Any Questions, comments, jokes, opinions?? Any Questions, comments, jokes, opinions??

I said “Good Day” The END The END