REPLIX Max Planck Institute for Psycholinguistics, TLA.

Slides:



Advertisements
Similar presentations
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
Advertisements

© 2006 Open Grid Forum OGF19 Federated Identity Rule-based data management Wed 11:00 AM Mountain Laurel Thurs 11:00 AM Bellflower.
IBM Software Group ® Integrated Server and Virtual Storage Management an IT Optimization Infrastructure Solution from IBM Small and Medium Business Software.
Welcome to Middleware Joseph Amrithraj
Chapter 4 : File Systems What is a file system?
Intro to SharePoint 2013 Architecture Liam Cleary.
Repositories, Federations, APIs, Policies - wrap up - Peter Wittenburg these slides are just a personal summary of major points they do not represent per.
An Open Source Google Apps Integration (Bboogle) Patricia Goldweic, Sr. Software Engineer, Northwestern University.
A Very Brief Introduction to iRODS
DCAPE Project Update Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management.
Chapter 9 Chapter 9: Managing Groups, Folders, Files, and Object Security.
CLAG 2004 – April/041 A Workflow-based Architecture for e- Learning in the Grid Luiz A. Pereira, Fábio A. Porto, Bruno Schulze, Rubens N. Melo
File Management Systems
Web Servers How do our requests for resources on the Internet get handled? Can they be located anywhere? Global?
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Repositories, Workspaces, Web Services - some ideas - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure Nijmegen,
Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.
Mobility in the Virtual Office: A Document-Centric Workflow Approach Ralf Carbon, Gregor Johann, Thorsten Keuler, Dirk Muthig, Matthias Naab, Stefan Zilch.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Windows Server 2008 R2 CSIT 320 (Blum) 1. Server Consolidation – Today’s chips have enhanced capabilities compared to those of the past. In particular.
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
DCC Conference, Glasgow November, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
OSG Public Storage and iRODS
Presenter: John Tkaczewski Duration: 30 minutes February Webinar: The Basics of Remote Data Replication.
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Wishes from Hum infrastructures Examples: DOBES and CLARIN Peter Wittenburg Max Planck Institute for Psycholinguistics.
Database-Driven Web Sites, Second Edition1 Chapter 5 WEB SERVERS.
Maintaining Active Directory Domain Services
Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management ServicesSALT DCAPE.
Technical Details of Collaboration Narration by Nicholas J. Parks.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
Luxembourg January CORE ESSnet (COmmon Reference Environment) final meeting Carlo Vaccari Istat - Italy.
AIP Backup & Restore Sunita Barve NCRA, Pune. AIP The latest version of DSpace 1.7.0, supports backup and restore of all its contents as a set of AIP.
Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka presented by Johannes Reetz.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
ATLAS Detector Description Database Vakho Tsulaia University of Pittsburgh 3D workshop, CERN 14-Dec-2004.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
S. Shumilov – Zürich Analytical Visualization Framework - a visual data processing and knowledge discovery system Ivan Denisovich, Serge Shumilov Department.
Virtual Memory 1 1.
PRACE-2IP WP10 - iRODS workshop iRODS CINES Gerard GIL (CINES) – (Linkoping September 2012)
Module 4 Planning for Group Policy. Module Overview Planning Group Policy Application Planning Group Policy Processing Planning the Management of Group.
1 iRODS: A Rule Oriented Data ManagementSystem SRB Space.
Exploring ‘Workspaces’ Tom Visser, SARA compute and networking services, Amsterdam Garching Workshop 21 st September 2010.
Java EE Patterns Dan Bugariu.  What is Java EE ?  What is a Pattern ?
1 The EDIT System, Overview European Commission – Eurostat.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
1 VLDB - Data Management in Grids B. Del-Fabbro, D. Laiymani, J.M. Nicod and L. Philippe Laboratoire d’Informatique de l’Université de Franche-Comté Séoul,
©MIT LKTR Workshop, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego Supercomputer.
The library is open Digital Assets Management & Institutional Repository Russian-IUG November 2015 Tomsk, Russia Nabil Saadallah Manager Business.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Top 10 missing features (and ways to add them) Axel Faust / Oksana Kurysheva.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
CLARIN EUDAT2020 uptake plan Dieter Van Uytvanck CLARIN ERIC EUDAT User Forum, Rome.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
AAI for a Collaborative Data Infrastructure
Shared Services with Spotfire
An Overview of iRODS Integrated Rule-Oriented Data System
Securing the Network Perimeter with ISA 2004
AMGA Web Interface Salvatore Scifo INFN sez. Catania
OGSA Data Architecture Scenarios
Chapter VIIII File Systems Review Questions and Problems
Virtual Memory 1 1.
Presentation transcript:

REPLIX Max Planck Institute for Psycholinguistics, TLA

Agenda  Goals  Motivation  Infrastructure  Language Archive specific  Results  Discussion REPLIX – Repository / Workspace Workshop – September 2010

REPLIX Goals  Data replication / synchronization between repositories at a logical level.  What is this logical level?  More than just moving files.  What about access rights?  What about structure defined on top of the data?  What about persistent identifiers (PIDs)?  What about things we didn’t think about?  Workflow based and easy to configure and adapt for different scenarios.  Workflow is a chain of small tasks (a.k.a. blocks).  Easy to develop and integrate new blocks. REPLIX – Repository / Workspace Workshop – September 2010

REPLIX Goals  Independent of repository implementation.  Repositories use different software solutions and we do not expect them to change.  How do we synchronize between different software solutions?  Inter-connection layer.  Originating repository is and should remain owner of the data.  Repository should never depend on REPLIX for anything else than the synchronization.  REPLIX has access to the file system but does not control it.  Repository controls its data with policies. REPLIX – Repository / Workspace Workshop – September 2010

REPLIX Goals Repository Software / Tools Repository Software / Tools Repository File System Repository File System REPLIX Repository Software / Tools Repository Software / Tools Repository File System Repository File System REPLIX inter-connect REPLIX – Repository / Workspace Workshop – September 2010

Motivation – REPLIX TLA  Open up the MPI LAT backup sites (B1,B2) as read-only archives.  Improve the replication process in general.  Speed.  Validation.  Parts of the archive.  Update PID information.  Keep in mind to try to generalize to provide an out-of-the box solution for other repositories. B2B2 B2B2 LAT B1B1 B1B1 Garching REPLIX – Repository / Workspace Workshop – September 2010 Nijmegen Gottingen

Motivation - Software  Implementation of the REPLIX communication system.  iRODS looks like a promising candidate (federated zones).  Implementation of the interface to the repository file system.  iRODS looks like promising candidate.  Implementation of the inter-connection layer.  REPLIX side.  iRODS looks like a promising candidate by using a custom module.  Repository side  Will require custom programming. REPLIX – Repository / Workspace Workshop – September 2010

Motivation  Perform iRODS performance tests.  See if iRODS lives up to our expectations.  How does iRODS compare to the current rsync process?  Develop a concrete test-case, based on iRODS, to test our ideas.  Main archive located at the MPI in Nijmegen, the Netherlands.  Backup archive located at the RZG in Garching, Germany.  Approximately TB.  How much do we have to change in the existing software? REPLIX – Repository / Workspace Workshop – September 2010

Infrastructure  iRODS zones provide archive to archive connection.  Single archive data exists inside iRODS zone.  Use federated zones ensuring each archive remains autonomous.  Loose connection to the file system.  iRODS mounted collections.  iRODS regular collections are too strict since iRODS controls all files and their metadata.  XML-RPC interface to existing software (inter-connection layer).  Develop an iRODS micro-service to facilitate XML-RPC communication.  Use some reserved disk space for caching purposes.  How to handle different method signatures? REPLIX – Repository / Workspace Workshop – September 2010

REPLIX Infrastructure iRODS icommands rule-base virtual file system scripts micro services mounted collection(s) msiXmlRPC replix scipts replix rule-base replix rule-base jargon core + rule engine WM Existing software stack Repository (local) file system REPLIX – Repository / Workspace Workshop – September 2010

Infrastructure  Two ways of interacting with the REPLIX system:  Use the workflow manager (WM).  Invoke the workflow rules directly through the icommands.  Workflow manager is preferred.  Exposed through a REST-service interface. REPLIX – Repository / Workspace Workshop – September 2010

Language Archive Specific  How does this fit into the existing LAT infrastructure? REPLIX – Repository / Workspace Workshop – September 2010 LAT 1 AMS IMDI Browser LAMU S PID pid am s corpus- structure crawler SOURCE LAT 2 AMS IMDI Browser LAMU S am s corpus- structure crawler DESTINATION REP LIX

Language Archive Specific  LAT synchronization workflow:  (1) File synchronization.  iRODS sync.  (2) Start crawler (index all files).  msiExecCmd.  (3) Permission synchronization.  msiXmlRpc.  (4) Update PID information.  msiXmlRpc.  Each step implemented as an iRODS action. REPLIX – Repository / Workspace Workshop – September 2010

Language Archive Specific  (1) Synchronize based on nodes in the archive tree.  If the node is the root node, synchronize all files.  If the node is not the root node, create a list of files that need to be synchronized and synchronize them.  File list export functionality needs to be available.  Do not touch file content.  (2) Start the crawler.  Use the iRODS “msiExecCmd” micro-service to start the crawler at the destination archive through a script.  The time this could take might be a problem.  PIDs should remain untouched and can be used as a reference to the parent archive.  Do not touch file content. REPLIX – Repository / Workspace Workshop – September 2010

Language Archive Specific  (3) Replicate Archive permissions.  AMS is in charge of the permissions in the archive.  (node id, user id, permission) triples.  Create an export, based on the selected node, at the source archive.  Transfer the export to the destination archive.  Import the data into AMS at the destination archive.  Export based on PIDs.  Constant between source and destination archive.  Translate between PID and node id, since AMS internally uses node id’s.  How to synchronize users?  Discard triples for non-existing users. REPLIX – Repository / Workspace Workshop – September 2010

Language Archive Specific  (4) Update PID information.  After replicating a file from the source archive to another archive, the files PID record has to be updated.  Create an export at the destination archive.  (pid, url) pairs.  Transfer to the parent archive.  Import into PID system.  How to administrate these changes to the PID record?  New domains are always allowed to be added.  Only allowed to update ‘own’ url  assume domain is constant. REPLIX – Repository / Workspace Workshop – September 2010

Results  Performance test executed.  Transfer files from one zone (MPI) to another federated zone (RZG).  Gigabit connection.  Two sets of tests:  Increasing amount of small files (100KB).  Decreasing amount of increasing files (1MB  1GB). REPLIX – Repository / Workspace Workshop – September 2010

Results  (local) Pilot to test initial workflow.  Transfer files.  Trigger crawler.  Invoke script at destination.  Export permissions.  Invoke xmlRPC at source to create export.  Transfer export file.  Invoke xmlRPC at destination to import.  Initial results look promising. REPLIX – Repository / Workspace Workshop – September 2010

Results  What to do:  Implement local pilot project in Nijmegen-Garching environment.  Support sub-tree synchronization.  Support updating of handle records.  The interconnection layer requires changes in existing software.  The repository is required to provide the interconnection functionality for the used synchronization workflow actions. REPLIX – Repository / Workspace Workshop – September 2010

Questions / Discussion Any questions? REPLIX – Repository / Workspace Workshop – September 2010