Download presentation
Presentation is loading. Please wait.
Published byBathsheba Ramsey Modified over 9 years ago
1
Lucene-Demo Brian Nisonger
2
Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional info See Treehouse Wiki- Lucene for additional info Set of Java classes Set of Java classes Not an end to end solution Not an end to end solution Designed to allow rapid development of IR tools Designed to allow rapid development of IR tools
3
Index The first step is to take a set of text documents and build an Index The first step is to take a set of text documents and build an Index Demo:IndexFiles on Pongo Demo:IndexFiles on Pongo Two major classes Two major classes Analyzer Analyzer Used to Tokenize data Used to Tokenize data More on this later More on this later IndexWriter IndexWriter IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true); IndexWriter writer = new IndexWriter(INDEX_DIR, new StandardAnalyzer(), true);
4
Index Writer Index Writer creates an index of documents Index Writer creates an index of documents First argument is a directory of where to build/find the index First argument is a directory of where to build/find the index Second argument calls an Analyzer Second argument calls an Analyzer Third argument determines if a new index should be created Third argument determines if a new index should be created
5
Analyzer Standard Analyzer Standard Analyzer Porter Stemming w/ Stop Words Porter Stemming w/ Stop Words Krovetz Stemmer-Example Krovetz Stemmer-Example package org.apache.lucene.analysis; package org.apache.lucene.analysis; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.analysis.standard.*; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.StopFilter; import org.apache.lucene.analysis.LowerCaseTokenizer; import org.apache.lucene.analysis.LowerCaseTokenizer; import org.apache.lucene.analysis.KStemFilter; import org.apache.lucene.analysis.KStemFilter; import java.io.Reader; import java.io.Reader; public class KStemAnalyzer extends Analyzer public class KStemAnalyzer extends Analyzer { public final TokenStream tokenStream(String fieldName, Reader reader) public final TokenStream tokenStream(String fieldName, Reader reader) { { return new KStemFilter(new LowerCaseTokenizer(reader)); return new KStemFilter(new LowerCaseTokenizer(reader)); } } }
6
Analyzer-II Snowball Stemmer Snowball Stemmer A stemmer language created by Porter used to build Stemmers A stemmer language created by Porter used to build Stemmers Multilingual analyzers/Stemmers Multilingual analyzers/Stemmers Porter2 Porter2 Fully Integrated with Lucene 1.9.1 Fully Integrated with Lucene 1.9.1 MyAnalyzer(Home Built) MyAnalyzer(Home Built) Demo Demo
7
Adding Documents The Next step after creating an index is to add documents The Next step after creating an index is to add documents writer.addDocument(FileDocument.Document (file)); writer.addDocument(FileDocument.Document (file)); Remember we already determined how the document will be tokenized Remember we already determined how the document will be tokenized Fields Fields Can split document in to parts such as document title,body,date created, paragraphs Can split document in to parts such as document title,body,date created, paragraphs
8
Adding Documents-II Assigns Token/doc ID Assigns Token/doc ID For why this is important see Lucene –TreeHouse Wiki For why this is important see Lucene –TreeHouse Wiki Create some type of loop to add all the documents Create some type of loop to add all the documents This is the actual creation of the Index before we merely set the Index parameters This is the actual creation of the Index before we merely set the Index parameters
9
Finalizing Index Creation After that the Index is optimized with writer.optimize(); After that the Index is optimized with writer.optimize(); Merges etc. Merges etc. The Index is close with writer.close(); The Index is close with writer.close();
10
Searching an Index Open Index Open Index IndexReader reader = IndexReader.open(index); IndexReader reader = IndexReader.open(index); Create Searcher Create Searcher Searcher searcher = new IndexSearcher(reader); Searcher searcher = new IndexSearcher(reader); Assign Analyzer Assign Analyzer Use the same Analyzer used to create Index (Why?) Use the same Analyzer used to create Index (Why?)
11
Searching an Index-II Parse/Create query Parse/Create query Query query = QueryParser.parse(line, field, analyzer); Query query = QueryParser.parse(line, field, analyzer); Takes a line, looks for a particular field, and runs it through an analyzer to create query Takes a line, looks for a particular field, and runs it through an analyzer to create query Determine which documents are matches Determine which documents are matches Hits hits = searcher.search(query); Hits hits = searcher.search(query);
12
Retrieving Documents Hits creates a collection of documents Hits creates a collection of documents Using a loop we can reference each doc Using a loop we can reference each doc Document doc = hits.doc(i); Document doc = hits.doc(i); This allows us to get info about the document This allows us to get info about the document Name of document, date is was created, words in document Name of document, date is was created, words in document Relevancy Score(TF/IDF) Relevancy Score(TF/IDF) Demo Demo
13
Finishing Searching Return list of documents Return list of documents Close Reader Close Reader
14
Other Functions Spans (Example from http://lucene.apache.org/java/docs/api/in dex.html) Spans (Example from http://lucene.apache.org/java/docs/api/in dex.html) http://lucene.apache.org/java/docs/api/in dex.html http://lucene.apache.org/java/docs/api/in dex.html Useful for Phrasal matching Useful for Phrasal matching Allows for Passage Retrieval Allows for Passage Retrieval
15
Questions? Any Questions, comments, jokes, opinions?? Any Questions, comments, jokes, opinions??
16
I said “Good Day” The END The END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.