Presentation is loading. Please wait.

Presentation is loading. Please wait.

REPLIX Max Planck Institute for Psycholinguistics, TLA.

Similar presentations


Presentation on theme: "REPLIX Max Planck Institute for Psycholinguistics, TLA."— Presentation transcript:

1 REPLIX www.mpi.nl/replix Willem.Elbers(@mpi.nl) Max Planck Institute for Psycholinguistics, TLA

2 Agenda  Goals  Motivation  Infrastructure  Language Archive specific  Results  Discussion REPLIX – Repository / Workspace Workshop – September 2010

3 REPLIX Goals  Data replication / synchronization between repositories at a logical level.  What is this logical level?  More than just moving files.  What about access rights?  What about structure defined on top of the data?  What about persistent identifiers (PIDs)?  What about things we didn’t think about?  Workflow based and easy to configure and adapt for different scenarios.  Workflow is a chain of small tasks (a.k.a. blocks).  Easy to develop and integrate new blocks. REPLIX – Repository / Workspace Workshop – September 2010

4 REPLIX Goals  Independent of repository implementation.  Repositories use different software solutions and we do not expect them to change.  How do we synchronize between different software solutions?  Inter-connection layer.  Originating repository is and should remain owner of the data.  Repository should never depend on REPLIX for anything else than the synchronization.  REPLIX has access to the file system but does not control it.  Repository controls its data with policies. REPLIX – Repository / Workspace Workshop – September 2010

5 REPLIX Goals Repository Software / Tools Repository Software / Tools Repository File System Repository File System REPLIX Repository Software / Tools Repository Software / Tools Repository File System Repository File System REPLIX inter-connect REPLIX – Repository / Workspace Workshop – September 2010

6 Motivation – REPLIX TLA  Open up the MPI LAT backup sites (B1,B2) as read-only archives.  Improve the replication process in general.  Speed.  Validation.  Parts of the archive.  Update PID information.  Keep in mind to try to generalize to provide an out-of-the box solution for other repositories. B2B2 B2B2 LAT B1B1 B1B1 Garching REPLIX – Repository / Workspace Workshop – September 2010 Nijmegen Gottingen

7 Motivation - Software  Implementation of the REPLIX communication system.  iRODS looks like a promising candidate (federated zones).  Implementation of the interface to the repository file system.  iRODS looks like promising candidate.  Implementation of the inter-connection layer.  REPLIX side.  iRODS looks like a promising candidate by using a custom module.  Repository side  Will require custom programming. REPLIX – Repository / Workspace Workshop – September 2010

8 Motivation  Perform iRODS performance tests.  See if iRODS lives up to our expectations.  How does iRODS compare to the current rsync process?  Develop a concrete test-case, based on iRODS, to test our ideas.  Main archive located at the MPI in Nijmegen, the Netherlands.  Backup archive located at the RZG in Garching, Germany.  Approximately 25-30 TB.  How much do we have to change in the existing software? REPLIX – Repository / Workspace Workshop – September 2010

9 Infrastructure  iRODS zones provide archive to archive connection.  Single archive data exists inside iRODS zone.  Use federated zones ensuring each archive remains autonomous.  Loose connection to the file system.  iRODS mounted collections.  iRODS regular collections are too strict since iRODS controls all files and their metadata.  XML-RPC interface to existing software (inter-connection layer).  Develop an iRODS micro-service to facilitate XML-RPC communication.  Use some reserved disk space for caching purposes.  How to handle different method signatures? REPLIX – Repository / Workspace Workshop – September 2010

10 REPLIX Infrastructure iRODS icommands rule-base virtual file system scripts micro services mounted collection(s) msiXmlRPC replix scipts replix rule-base replix rule-base jargon core + rule engine WM Existing software stack Repository (local) file system REPLIX – Repository / Workspace Workshop – September 2010

11 Infrastructure  Two ways of interacting with the REPLIX system:  Use the workflow manager (WM).  Invoke the workflow rules directly through the icommands.  Workflow manager is preferred.  Exposed through a REST-service interface. REPLIX – Repository / Workspace Workshop – September 2010

12 Language Archive Specific  How does this fit into the existing LAT infrastructure? REPLIX – Repository / Workspace Workshop – September 2010 LAT 1 AMS IMDI Browser LAMU S PID pid am s corpus- structure crawler SOURCE LAT 2 AMS IMDI Browser LAMU S am s corpus- structure crawler DESTINATION REP LIX

13 Language Archive Specific  LAT synchronization workflow:  (1) File synchronization.  iRODS sync.  (2) Start crawler (index all files).  msiExecCmd.  (3) Permission synchronization.  msiXmlRpc.  (4) Update PID information.  msiXmlRpc.  Each step implemented as an iRODS action. REPLIX – Repository / Workspace Workshop – September 2010

14 Language Archive Specific  (1) Synchronize based on nodes in the archive tree.  If the node is the root node, synchronize all files.  If the node is not the root node, create a list of files that need to be synchronized and synchronize them.  File list export functionality needs to be available.  Do not touch file content.  (2) Start the crawler.  Use the iRODS “msiExecCmd” micro-service to start the crawler at the destination archive through a script.  The time this could take might be a problem.  PIDs should remain untouched and can be used as a reference to the parent archive.  Do not touch file content. REPLIX – Repository / Workspace Workshop – September 2010

15 Language Archive Specific  (3) Replicate Archive permissions.  AMS is in charge of the permissions in the archive.  (node id, user id, permission) triples.  Create an export, based on the selected node, at the source archive.  Transfer the export to the destination archive.  Import the data into AMS at the destination archive.  Export based on PIDs.  Constant between source and destination archive.  Translate between PID and node id, since AMS internally uses node id’s.  How to synchronize users?  Discard triples for non-existing users. REPLIX – Repository / Workspace Workshop – September 2010

16 Language Archive Specific  (4) Update PID information.  After replicating a file from the source archive to another archive, the files PID record has to be updated.  Create an export at the destination archive.  (pid, url) pairs.  Transfer to the parent archive.  Import into PID system.  How to administrate these changes to the PID record?  New domains are always allowed to be added.  Only allowed to update ‘own’ url  assume domain is constant. REPLIX – Repository / Workspace Workshop – September 2010

17 Results  Performance test executed.  Transfer files from one zone (MPI) to another federated zone (RZG).  Gigabit connection.  Two sets of tests:  Increasing amount of small files (100KB).  Decreasing amount of increasing files (1MB  1GB). REPLIX – Repository / Workspace Workshop – September 2010

18 Results  (local) Pilot to test initial workflow.  Transfer files.  Trigger crawler.  Invoke script at destination.  Export permissions.  Invoke xmlRPC at source to create export.  Transfer export file.  Invoke xmlRPC at destination to import.  Initial results look promising. REPLIX – Repository / Workspace Workshop – September 2010

19 Results  What to do:  Implement local pilot project in Nijmegen-Garching environment.  Support sub-tree synchronization.  Support updating of handle records.  The interconnection layer requires changes in existing software.  The repository is required to provide the interconnection functionality for the used synchronization workflow actions. REPLIX – Repository / Workspace Workshop – September 2010

20 Questions / Discussion Any questions? REPLIX – Repository / Workspace Workshop – September 2010


Download ppt "REPLIX Max Planck Institute for Psycholinguistics, TLA."

Similar presentations


Ads by Google