Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert Sanderson Dept. of Computer Science University of Liverpool Building Data Grids with iRODS iRODS Workshop, May 27 th 2008 Slide 1
Cheshire3 Introduction Architecture SRB Integration Architecture Grid Usage iRODS Integration Possible Architectures Overview iRODS Workshop, May 27 th 2008 Slide 2
Cheshire3: Information Analysis Framework Digital Library/Information Retrieval engine with... Data Mining/Machine Learning Text Mining/Natural Language Processing Computational Grid Data Grid Standards Based: Unicode, XML/XPath, MPI, Z39.50/SRU,... Object Oriented Architecture Easy to develop and extend in Python,... but heavy lifting possible in imported C libraries Developed at University of Liverpool, plus UC Berkeley Version: Mostly stable, needs thorough testing/documentation Introduction iRODS Workshop, May 27 th 2008 Slide 3
Context iRODS Workshop, May 27 th 2008 Slide 4
Architecture iRODS Workshop, May 27 th 2008 Slide 5 Index Extractor Server ConfigStore UserStore User Object Database Query Normalizer Record Document PreParser Parser Transformer Records ProtocolHandler RecordStore Terms Documents Ingest Process ResultSet PreParser DocumentFactor y DocumentStore IndexStore Tokenizer TokenMerger
Architecture 2 iRODS Workshop, May 27 th 2008 Slide 6 Index Record IndexStore Extractor XPathObject Extractor XPathObject Extractor Normalizer Index Normalizer Tokenizer TokenMerger Tokenizer TokenMerger Index Normalizer
SRB Integration iRODS Workshop, May 27 th 2008 Slide 7 RecordStore / DocumentStore Filesystem Berkeley DBSQL RDBMS (postgresql) SRB record, document data
SRB Integration iRODS Workshop, May 27 th 2008 Slide 8 IndexStore SRB terms a-bc-d e-fg-h... Index dbs db with query term
Grid Implementation iRODS Workshop, May 27 th 2008 Slide 9 Focus on ingest, not discovery (yet) Instantiate architecture on every node Assign one node as master, rest as slaves. Master then divides the processing as appropriate. Calls between slaves possible Calls as small, simple as possible: (objectIdentifier, functionName, *arguments) Typically: (workflow_id, 'process', document_id)
Grid Architecture iRODS Workshop, May 27 th 2008 Slide 10 Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (workflow, process, document) fetch document document extracted data
Grid Architecture 2 iRODS Workshop, May 27 th 2008 Slide 11 Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (index, load) store index fetch extracted data
NARA ERA Demonstrator 20Gb of web crawled data in SRB, indexes stored in SRB Interface generated by easily deployable Python layer Medline Dataset Experiments 16.5 Million Abstracts plus associated metadata Parsed data stored in SRB Indexes in filesystem NSDL Grade Level Analysis NSDL web crawl data (3 Tb+) Data already in SRB, analysis stored to SRB Usage iRODS Workshop, May 27 th 2008 Slide 12
Simple Integration (ala SRB) possible: Store data in iRODS for Storage classes Requires Python interface to iRODS Doesn't really benefit from rule capabilities Other (more interesting) Options: Cheshire3 as External Microservice Platform Cheshire3 as Internal Microservice Platform Cheshire3 as Rules Platform(?) iRODS Integration iRODS Workshop, May 27 th 2008 Slide 13
External Microservice Platform iRODS Workshop, May 27 th 2008 Slide 14 iRODS Cheshire3 C3 Microservice C3 Interface Microservice data processed data Possible Interfaces: MPI/PVM RPC SOAP Xml Over Http Arbitrary Transport Protocol etc. Loose Coupling via Client Interface
Internal Microservice Platform iRODS Workshop, May 27 th 2008 Slide 15 iRODS C3 Microservice data Cheshire3 Requires iRODS to have Python interpreter as alternative Microservice platform, rather than a Python client API. Much tighter integration: Cheshire3 would have access to iRODS internal information rather than just what was passed over interface. Microservice definition problem becomes Cheshire3 Workflow definition – XML description No bandwidth problems of transferring large amounts of data back and forth Tight Coupling via Python Integration
Rules Platform? iRODS Workshop, May 27 th 2008 Slide 16 iRODS data Cheshire3 Rules C3 Microservice Microservice s Requires Python interpreter at the Rules execution level, rather than (as well as) at the Microservice level. More flexible in terms of rule design Easier to write rules than current rule language Event system rather than rules execution? Integration of Computational Grid for rule/microservice execution?
Website: Me: Acknowledgements: SHAMAN: EU 7 th Framework Programme Cheshire3: JISC, NSF Questions? Thank You! iRODS Workshop, May 27 th 2008 Slide 17