Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.

Similar presentations


Presentation on theme: "June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department."— Presentation transcript:

1 June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department of Computer Science Old Dominion University

2 June 22-23, 2005 Technology Infusion Team Committee2 Outline Motivation Approach Architecture Preliminary Results Current Status and Demo Conclusion

3 June 22-23, 2005 Technology Infusion Team Committee3 Motivation With the growing number of OAI-PMH compliant collections, building a scalable federated digital library that can handle millions of records and work with thousands of digital libraries is a big challenge. Problem ? Indexing and searching a large collection is time consuming. For example, one of the existing federated digital library, ARC, running on a single processor takes About two days for indexing Search resulting in a large sorted result set takes of the order of 15 minutes

4 June 22-23, 2005 Technology Infusion Team Committee4 Approach Use of many cheap PCs in a cluster to improve performance and reliability of searching and indexing – Google like approach for OAI federation. Why not adopt Google solution? Expensive Proprietary Does not work with OAI federation

5 June 22-23, 2005 Technology Infusion Team Committee5 Approach Use of open source indexing and searching software like Lucene, which is optimized for text searching. Why not use a relational database like Oracle, MYSQL, SQL,..? Relational databases are not built for text based searching and have problems with scaling. Additionally relational databases do not have parallelism support.

6 June 22-23, 2005 Technology Infusion Team Committee6 Architecture - Modular

7 June 22-23, 2005 Technology Infusion Team Committee7 High Performance Parallel Lucene ARC : Basic Advanced Search Scenario User Initiating Search / OAI Request Received (DataProvider Capability) Presentation Layer JSP Controller Servlet Action Servlets Browse, Search, OAIData Service Layer Service Browse, Search, OAIData Configuration Data Access Http Request Config Files Storage /Indexing Layer (Lucene-based Implementation) Lucene Metadata Search / Retrieval Scheduler Search Metadata Query Harvester Disk 1 Metadata Distribution System Node 1Node 2Node N Disk 2Disk N Forward RMI Read Search Machines List Then Initiate Parallel Multiple Search

8 June 22-23, 2005 Technology Infusion Team Committee8 High Performance Parallel Lucene ARC : Basic Advanced Search Scenario Search Results Returned to user Presentation Layer JSP Controller Servlet Action Servlets SearchResults Return Service Layer Service SearchResults ReturnService Configuration Data Access Config Files Storage /Indexing Layer (Lucene-based Implementation) Lucene Metadata Search / Retrieval Scheduler Merged Search Results Harvester Disk 1 Metadata Distribution System Node 1Node 2Node N Disk 2Disk N Http Response Redirect RMI

9 June 22-23, 2005 Technology Infusion Team Committee9 High Performance Parallel Lucene ARC Harvesting Flow using Console Presentation Layer JSP Controller Servlet Action Servlets ( Browse, Search, OAIData, Harvest, etc) Service Layer Service ( Browse, Search, OAIData, Harvest, Etc ) Configuration Data Access Config Files Storage /Indexing Layer (Lucene-based Implementation) Lucene Metadata Search / Retrieval Scheduler Harvester Disk 1 Metadata Distribution System Node 1Node 2Node N Disk 2Disk N Forward RMI Console DataProvider Console initiates Scheduler / Harvester which fetches DataProvider lists from Config Files and harvest the DataProvider. Harvested Chunks are sent to Metadata Distribution Service which distributes it to indexing nodes

10 June 22-23, 2005 Technology Infusion Team Committee10 High Performance Parallel Lucene ARC Harvesting Flow initiated by Administrator Presentation Layer JSP Controller Servlet Action Servlets Harvest Service Layer Service Harvest Configuration Data Access Admin Request Config Files Storage /Indexing Layer (Lucene-based Implementation) Lucene Metadata Search / Retrieval Scheduler Harvest Harvester Disk 1 Metadata Distribution System Node 1Node 2Node N Disk 2Disk N Forward RMI DataProvider Adiministrator initiates Scheduler / Harvester which fetches DataProvider lists from Config Files and harvest the DataProvider. Harvested Chunks are sent to Metadata Distribution Service which distributes it to indexing nodes

11 June 22-23, 2005 Technology Infusion Team Committee11 Architecture - Index and Search Clusters RMI… Sorter Remote Searchable (128.82.7.251 ) Search Service NodeCluster Nodes JSP/Servlet Lucene Parallel Search Service Remote Searchable (128.82.7.242) Results Query Searchers.xml User Interface Remote Searchable (128.82.7.244) … … Remote Searchable (128.82.7.243) Lucene API

12 June 22-23, 2005 Technology Infusion Team Committee12 Architecture - Search Event-flow Incremental Indexing Presentation Layer Search Service Layer Index and Search Cluster Nodes 1 ) Get new query Parameters Group Field Sort Field Filter Fields Keyword 7) Output results Files Chunk Arriving 2)Read the file “searcher.xml” 3)Create ParallelMultiSearcher 4)Divide query into subqueries according to group field 5 ) Do Parallel Multiple Search for each archive …… 6) Marge and sort results

13 June 22-23, 2005 Technology Infusion Team Committee13 Preliminary Results

14 June 22-23, 2005 Technology Infusion Team Committee14

15 June 22-23, 2005 Technology Infusion Team Committee15 Current Status and Demo

16 June 22-23, 2005 Technology Infusion Team Committee16 Future Work Handling of Duplicates (becomes more complicated compared to a single processor case) Handling of Deleted Records (same as Duplicates) Current prototype supports simple searches, need to extend for advance searches and post processing Reliability support using redundant PCs Administration support

17 June 22-23, 2005 Technology Infusion Team Committee17 Conclusion Current searching solutions for an OAI federation do not scale with growing number of OAI data providers. Use of cheap PCs cluster along with use of Open Source search software Lucene address scalability issues with the OAI federation.


Download ppt "June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department."

Similar presentations


Ads by Google