Download presentation
Presentation is loading. Please wait.
Published byOpal Poole Modified over 8 years ago
1
1 Using the Lucene Search Engine
2
2 Team Phil Corcoran Project Leader 10 Years Software Telecoms, Finance, Manufacturing Reqs, Design, Test Derek O’ Keeffe Senior Developer 10 Years Software Java, ERP Joomla, Knowledge Tree Design, Code
3
3 Concepts
4
4 Lucene Full Text Search Cross Platform Lucene Document Inverted Index
5
5 Lucene Inserts index records Searches index Gets Lucene documents You Manage the indexing Select data files Parse files Manage querying Accept user’s query Display results to user Retrieve original documents
6
6 iViewXT
7
7 Search Improvements
8
8 Test Document Collections 1. UAT 2.
9
9 Super Mario
10
10 Implementation Derek
11
11 Performance
12
12 Lucene Implementation
13
13 Lucene Implementation: Indexing
14
14 Lucene Implementation: Indexing
15
15 Lucene Implementation: Indexing
16
16 Lucene Indexing
17
17 Lucene Indexing Step 1 of 5
18
18 Lucene Indexing Step 2 of 5
19
19 Lucene Indexing Step 3 of 5
20
20 Lucene Indexing Step 4 of 5
21
21 Lucene Indexing Step 5 of 5
22
22 Lucene Indexing
23
23 Text Extraction Lucene not a complete application. PDF files text extraction Microsoft files text extraction
24
24 Lucene Implementation
25
25 Lucene Implementation
26
26 Searching:
27
27 Searching: Step 1 of 6
28
28 Searching: Step 2 of 6
29
29 Searching: Step 3 of 6
30
30 Searching: Step 4&5 of 6
31
31 Searching: Step 6 of 6
32
32 Searching:
33
33 Luke - Lucene Index Toolbox Client application to link directly into your index. Java-webstart app http://www.getopt.org/luke/ Handy for testing searches and performance.
34
34 Some problems encountered Max clause count exception: Take care automatically adding wildcards!! Performance: Do the work while indexing, not while searching. Pagination: Get one page at a time from the Hits. Our security model Stored collection of allowed containers in UserSession. Visibility of indexing job. Added logging “Indexing document 426 of 204,532”
35
35 Resources (general) An open source document management system in php with a java lucene search engine http://lucene.apache.org/ http://www.ibm.com/developerworks/web/library/wa-lucene2/ http://www.ibm.com/developerworks/library/wa-lucene/ Handy ajax autocomplete component.
36
36 Resources (text extraction) http://pdfbox.org Text extractor for pdf files JXL http://jexcelapi.sourceforge.net/ Text extractor for excel files. Text extractor for word documents. API to access Microsoft format files. (xls/doc/ppt). I would recommend this one over jxl or text-mining above.
37
37 Summary Lucene querying is fast (take care what you do with the results) Indexing is slow (Make indexing job visible) Use Luke Add lots to the index (Do the work while indexing)
38
38 END
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.