Download presentation
Presentation is loading. Please wait.
Published byAmice McCoy Modified over 9 years ago
1
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department of Computer Science Old Dominion University
2
June 22-23, 2005 Technology Infusion Team Committee2 Outline Motivation Approach Architecture Preliminary Results Current Status and Demo Conclusion
3
June 22-23, 2005 Technology Infusion Team Committee3 Motivation With the growing number of OAI-PMH compliant collections, building a scalable federated digital library that can handle millions of records and work with thousands of digital libraries is a big challenge. Problem ? Indexing and searching a large collection is time consuming. For example, one of the existing federated digital library, ARC, running on a single processor takes About two days for indexing Search resulting in a large sorted result set takes of the order of 15 minutes
4
June 22-23, 2005 Technology Infusion Team Committee4 Approach Use of many cheap PCs in a cluster to improve performance and reliability of searching and indexing – Google like approach for OAI federation. Why not adopt Google solution? Expensive Proprietary Does not work with OAI federation
5
June 22-23, 2005 Technology Infusion Team Committee5 Approach Use of open source indexing and searching software like Lucene, which is optimized for text searching. Why not use a relational database like Oracle, MYSQL, SQL,..? Relational databases are not built for text based searching and have problems with scaling. Additionally relational databases do not have parallelism support.
6
June 22-23, 2005 Technology Infusion Team Committee6 Architecture - Modular
7
June 22-23, 2005 Technology Infusion Team Committee7 High Performance Parallel Lucene ARC : Basic Advanced Search Scenario User Initiating Search / OAI Request Received (DataProvider Capability) Presentation Layer JSP Controller Servlet Action Servlets Browse, Search, OAIData Service Layer Service Browse, Search, OAIData Configuration Data Access Http Request Config Files Storage /Indexing Layer (Lucene-based Implementation) Lucene Metadata Search / Retrieval Scheduler Search Metadata Query Harvester Disk 1 Metadata Distribution System Node 1Node 2Node N Disk 2Disk N Forward RMI Read Search Machines List Then Initiate Parallel Multiple Search
8
June 22-23, 2005 Technology Infusion Team Committee8 High Performance Parallel Lucene ARC : Basic Advanced Search Scenario Search Results Returned to user Presentation Layer JSP Controller Servlet Action Servlets SearchResults Return Service Layer Service SearchResults ReturnService Configuration Data Access Config Files Storage /Indexing Layer (Lucene-based Implementation) Lucene Metadata Search / Retrieval Scheduler Merged Search Results Harvester Disk 1 Metadata Distribution System Node 1Node 2Node N Disk 2Disk N Http Response Redirect RMI
9
June 22-23, 2005 Technology Infusion Team Committee9 High Performance Parallel Lucene ARC Harvesting Flow using Console Presentation Layer JSP Controller Servlet Action Servlets ( Browse, Search, OAIData, Harvest, etc) Service Layer Service ( Browse, Search, OAIData, Harvest, Etc ) Configuration Data Access Config Files Storage /Indexing Layer (Lucene-based Implementation) Lucene Metadata Search / Retrieval Scheduler Harvester Disk 1 Metadata Distribution System Node 1Node 2Node N Disk 2Disk N Forward RMI Console DataProvider Console initiates Scheduler / Harvester which fetches DataProvider lists from Config Files and harvest the DataProvider. Harvested Chunks are sent to Metadata Distribution Service which distributes it to indexing nodes
10
June 22-23, 2005 Technology Infusion Team Committee10 High Performance Parallel Lucene ARC Harvesting Flow initiated by Administrator Presentation Layer JSP Controller Servlet Action Servlets Harvest Service Layer Service Harvest Configuration Data Access Admin Request Config Files Storage /Indexing Layer (Lucene-based Implementation) Lucene Metadata Search / Retrieval Scheduler Harvest Harvester Disk 1 Metadata Distribution System Node 1Node 2Node N Disk 2Disk N Forward RMI DataProvider Adiministrator initiates Scheduler / Harvester which fetches DataProvider lists from Config Files and harvest the DataProvider. Harvested Chunks are sent to Metadata Distribution Service which distributes it to indexing nodes
11
June 22-23, 2005 Technology Infusion Team Committee11 Architecture - Index and Search Clusters RMI… Sorter Remote Searchable (128.82.7.251 ) Search Service NodeCluster Nodes JSP/Servlet Lucene Parallel Search Service Remote Searchable (128.82.7.242) Results Query Searchers.xml User Interface Remote Searchable (128.82.7.244) … … Remote Searchable (128.82.7.243) Lucene API
12
June 22-23, 2005 Technology Infusion Team Committee12 Architecture - Search Event-flow Incremental Indexing Presentation Layer Search Service Layer Index and Search Cluster Nodes 1 ) Get new query Parameters Group Field Sort Field Filter Fields Keyword 7) Output results Files Chunk Arriving 2)Read the file “searcher.xml” 3)Create ParallelMultiSearcher 4)Divide query into subqueries according to group field 5 ) Do Parallel Multiple Search for each archive …… 6) Marge and sort results
13
June 22-23, 2005 Technology Infusion Team Committee13 Preliminary Results
14
June 22-23, 2005 Technology Infusion Team Committee14
15
June 22-23, 2005 Technology Infusion Team Committee15 Current Status and Demo
16
June 22-23, 2005 Technology Infusion Team Committee16 Future Work Handling of Duplicates (becomes more complicated compared to a single processor case) Handling of Deleted Records (same as Duplicates) Current prototype supports simple searches, need to extend for advance searches and post processing Reliability support using redundant PCs Administration support
17
June 22-23, 2005 Technology Infusion Team Committee17 Conclusion Current searching solutions for an OAI federation do not scale with growing number of OAI data providers. Use of cheap PCs cluster along with use of Open Source search software Lucene address scalability issues with the OAI federation.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.