1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities.

Slides:



Advertisements
Similar presentations
Panel 2 – Promoting Re-Use of Scientific Collections John Harrison SHAMAN Project University of Liverpool
Advertisements

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
GFS OGF-22 Global Resource Naming Developers: Reagan Moore Arcot Mike.
OGF-23 iRODS Metadata Grid File System Reagan Moore San Diego Supercomputer Center.
© 2006 Open Grid Forum OGF19 Federated Identity Rule-based data management Wed 11:00 AM Mountain Laurel Thurs 11:00 AM Bellflower.
Joint CASC/CCI Workshop Report Strategic and Tactical Recommendations EDUCAUSE Campus Cyberinfrastructure Working Group Coalition for Academic Scientific.
Data Grid: Storage Resource Broker Mike Smorul. SRB Overview Developed at San Diego Supercomputing Center. Provides the abstraction mechanisms needed.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE SAN DIEGO SUPERCOMPUTER CENTER Particle Physics Data Grid PPDG Data Handling System Reagan.
San Diego Supercomputer Center NARA Research Prototype Persistent Archive Building Preservation Environments with Data Grid Technology (NARA Research Prototype.
San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Grid Based Solutions for Distributed Data Management Reagan.
Integrated Rule Oriented Data System (iRODS) Reagan W. Moore Arcot Rajasekar Mike Wan
Wayne Schroeder, Paul Tooby Data Intensive Cyber Environments Team (DICE) DICE Center, University of North Carolina at Chapel Hill; Institute for Neural.
1 Applied CyberInfrastructure Concepts ISTA 420/520 Fall Nirav Merchant Bio Computing & iPlant Collaborative Eric Lyons.
A Very Brief Introduction to iRODS
Sustainable Preservation Services for Archivists through Distributed Custody Caryn Wojcik State of Michigan Records Management Services.
Towards a Federated Infrastructure for the Preservation and Analysis Archival Data Chien-Yi HOU Richard MARCIANO {chienyi, School.
iRODS: Interoperability in Data Management
Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
UMIACS PAWN, LPE, and GRASP data grids Mike Smorul.
PAWN: A Novel Ingestion Workflow Technology for Digital Preservation Mike Smorul, Joseph JaJa, Yang Wang, and Fritz McCall.
Richard MARCIANO Chien-Yi HOU School of Information and Library Science (SILS) Sustainable Archives & Leveraging Technologies Group (SALT) University of.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
National Science Foundation Cooperative Agreement: OCI
Computing in Atmospheric Sciences Workshop: 2003 Challenges of Cyberinfrastructure Alan Blatecky Executive Director San Diego Supercomputer Center.
National Data Infrastructure Projects EarthCube Layered Architecture (GEO) DataNet Federation Consortium (OCI) integrated Rule Oriented Data System (SDCI)
San Diego Supercomputer CenterUniversity of California, San Diego Preservation Research Roadmap Reagan W. Moore San Diego Supercomputer Center
Information Management and Distributed Data Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
Working Group: Practical Policy Rainer Stotzka, Reagan Moore.
OSG Public Storage and iRODS
USING METADATA TO FACILITATE UNDERSTANDING AND CERTIFICATION ABOUT THE PRESERVATION PROPERTIES OF A PRESERVATION SYSTEM Jewel H. Ward, Hao Xu, Mike C.
Rule-Based Data Management Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar {moore, schroede, mwan, {moore, schroede, mwan,
Rule-Based Distributed Data Management iRODS Jan 23, Reagan W. Moore Mike Wan Arcot Rajasekar Wayne Schroeder San Diego.
Richard MarcianoChien-Yi Hou Caryn Wojcik University of University of State of Michigan North Carolina North Carolina Records Management ServicesSALT DCAPE.
Production Data Grids SRB - iRODS Storage Resource Broker Reagan W. Moore
OOI CI LCA REVIEW August 2010 Ocean Observatories Initiative OOI Cyberinfrastructure Architecture Overview Michael Meisinger Life Cycle Architecture Review.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Ocean Observatories Initiative Data Management (DM) Subsystem Overview Michael Meisinger September 29, 2009.
Working Group Practical Policy based on slides and latest documents from the PP WG chaired by Reagan Moore, Rainer Stotzka presented by Johannes Reetz.
Interoperability Grids, Clouds and Collaboratories Ruth Pordes Executive Director Open Science Grid, Fermilab.
Rule-Based Preservation Systems Reagan W. Moore Wayne Schroeder Mike Wan Arcot Rajasekar Richard Marciano {moore, schroede, mwan, sekar,
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
Policy Based Data Management Data-Intensive Computing Distributed Collections Grid-Enabled Storage iRODS Reagan W. Moore 1.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
From SRB to IRODS: Policy Virtualization using Rule-Based Data Grids Reagan W. Moore Wayne Schroeder Arcot Rajasekar Mike Wan San Diego Supercomputer Center.
GGF-17 Preservation Environments Research Group Preservation Environment Working Group Officers: Bruce Barkstrom (NASA Langley) Reagan.
Introduction to The Storage Resource.
National Science Foundation Cooperative Agreement: OCI Reagan Moore, PI Mary Whitton, Project Manager.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
©MIT LKTR Workshop, Digital Archive Policies and Trusted Digital Repositories MacKenzie Smith, MIT Libraries Reagan Moore, San Diego Supercomputer.
National Archives and Records Administration1 Integrated Rules Ordered Data System (“IRODS”) Technology Research: Digital Preservation Technology in a.
Rights Management for Shared Collections Storage Resource Broker Reagan W. Moore
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Active Directory Domain Services (AD DS). Identity and Access (IDA) – An IDA infrastructure should: Store information about users, groups, computers and.
All Hands Meeting 2005 BIRN-CC: Building, Maintaining and Maturing a National Information Infrastructure to Enable and Advance Biomedical Research.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Policy Based Data Management Environments (iRODS) Reagan W. Moore Arcot Rajasekar Mike Wan Mike Conway Antoine de Torcy Richard Marciano Jewel Ward
Use of Policies to Enforce Collection Properties Richard Marciano Reagan Moore University of North Chapel Hill Data Intensive Cyber Environments.
Fedora Commons Overview and Background Sandy Payette, Executive Director UK Fedora Training London January 22-23, 2009.
Preservation Data Services Persistent Archive Research Group Reagan W. Moore October 1, 2003.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
Working Group: Data Foundations and Terminology (Practical Policy Considerations) Reagan Moore.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Policy-Based Data Management integrated Rule Oriented Data System
Joseph JaJa, Mike Smorul, and Sangchul Song
Presentation transcript:

1 integrated Rule Oriented Data System Tutorial: iRODS Capabilities

2 Outline  Introduction to iRODS capabilities  Data-driven science and full Data Life Cycle  Policy-based Management of Distributed Data  Scaling: petabytes, 100s of millions of files  Enabling unified sharable "virtual" collections  Enabling data grids (sharing), digital libraries (publishing), persistent archives (preservation)  Unified Data Space: Interoperate via Federation

3 Introduction to iRODS Capabilities

4 Data Driven Science Enable new science through collaborative research on shared data collections Management of entire scientific data life cycle from data analysis pipelines to long-term sustainability of reference collections Implement national scale data cyber-infrastructure Federation of exemplar data management technologies in exemplar research initiatives Creation of production data management systems Proven technology implemented in extant data grids Integrate “live” research data collections into education initiatives Policy-based data management across distributed data Project Shared Collection Processing Pipeline Digital Library Reference Collection Federation Data Life Cycle

5 Data are Inherently Distributed Distributed sources Projects span multiple institutions, nations Distributed analysis platforms Grid computing Distributed data storage Minimize risk of data loss, optimize access Distributed users Caching of data near user Multiple stages of data life cycle Data repurposing for use in broader context

Cloud Storage Institutional Repositories Federal Repositories Carolina Digital Repository Texas Digital Library National Climatic Data Center National Optical Astronomy Observatory

Data Processing Pipelines Preservation Environment Ocean Observatories Initiative NARA Transcontinental Persistent Archive Prototype Carolina Digital Repository Large Synoptic Survey Telescope Digital Library Texas Digital Library French National Library Data Grid Teragrid Temporal Dynamics of Learning Center Australian Research Collaboration Service Taiwan National Archive

8 Data Life Cycle Project Collection Private Local Policy Data Grid Shared Distribution Policy Digital Library Published Description Policy Data Processing Pipeline Analyzed Service Policy Reference Collection Preserved Representation Policy Federation Sustained Re-purposing Policy Each stage adds new policies for a broader community Virtualize the stages of data life cycle through evolution of policies Interoperability across data life cycle representations Each stage of the data life cycle re-purposes the original collection

9 Tracing the Data Life Cycle Collection Creation using a Data Grid Data manipulation / Data ingestion Processing Pipelines Pipeline processing / Environment administration Data Grid Policy display / Micro-service display / State information display / Replication Digital Library Access / Containers / Metadata browsing / Visualization Preservation Environment Validation / Audit / Federation / Deep Archive / SHAMAN

10 Goal - Generic Infrastructure Manage all stages of the data life cycle Data organization Data processing pipelines Collection creation Data sharing Data publication Data preservation Create reference collection against which future information and knowledge is compared Each stage uses similar storage, arrangement, description, and access mechanisms

11 Concept Roadmap Purpose - reason a collection is assembled Properties - attributes needed to ensure the purpose Policies - enforce and maintain required properties Procedures – computer functions to implement Policies State information - results of applying procedures (iCAT) Assessment criteria - validate that state information conforms to desired purpose Federation – interoperate w/shared logical name spaces These are the required elements for data life cycle virtualization

12 Policy-based Management Each data life cycle stage is driven by extensions of management policies to address broader user communities Data arrangement Project policies Data analysis Processing pipeline standards Data sharing Research collaborations Data publication Discipline standards Data preservation Reference collection Reference collections need to be preserved and interpretable by future generations, most stringent standard Data grids - integrated Rule Oriented Data System

13 iRODS - Policy-based Management Turn Policies into computer-actionable Rules Compose Rules by chaining Micro-services Manage state information (in iCAT metadata catalog) as attributes on namespaces: Files / collections /users / resources / rules Validate assessment criteria Queries on state information, parsing audit trails Automate administrative functions Enable scaling to today's massive collections

14 User w/ Client Can Search, Access, Add and Manage Data & Metadata Access distributed data with Web-based Browser or iRODS GUI or Command Line clients. Overview of iRODS Architecture iRODS Data Server Disk, Tape, etc. iRODS Metadata Catalog Track information iRODS Data System iRODS Rule Engine Tracks Policies

iput../src/irm.c - Checks 10 Policy hooks when file put into iRODS brick14:10900:ApplyRule#116:: acChkHostAccessControl brick14:10900:GotRule#117:: acChkHostAccessControl brick14:10900:ApplyRule#118:: acSetPublicUserPolicy brick14:10900:GotRule#119:: acSetPublicUserPolicy brick14:10900:ApplyRule#120:: acAclPolicy brick14:10900:GotRule#121:: acAclPolicy brick14:10900:ApplyRule#122:: acSetRescSchemeForCreate brick14:10900:GotRule#123:: acSetRescSchemeForCreate brick14:10900:execMicroSrvc#124:: msiSetDefaultResc(demoResc,null) brick14:10900:ApplyRule#125:: acRescQuotaPolicy brick14:10900:GotRule#126:: acRescQuotaPolicy brick14:10900:execMicroSrvc#127:: msiSetRescQuotaPolicy(off) brick14:10900:ApplyRule#128:: acSetVaultPathPolicy brick14:10900:GotRule#129:: acSetVaultPathPolicy brick14:10900:execMicroSrvc#130:: msiSetGraftPathScheme(no,1) brick14:10900:ApplyRule#131:: acPreProcForModifyDataObjMeta brick14:10900:GotRule#132:: acPreProcForModifyDataObjMeta brick14:10900:ApplyRule#133:: acPostProcForModifyDataObjMeta brick14:10900:GotRule#134:: acPostProcForModifyDataObjMeta brick14:10900:ApplyRule#135:: acPostProcForCreate brick14:10900:GotRule#136:: acPostProcForCreate brick14:10900:ApplyRule#137:: acPostProcForPut brick14:10900:GotRule#138:: acPostProcForPut brick14:10900:GotRule#139:: acPostProcForPut brick14:10900:GotRule#140:: acPostProcForPut

16 Scale of iRODS Data Grid Number of files Desktop to 10s to 100s of millions of files Size of data Desktop to 100s of terabytes to petabytes of data Number of policy enforcement points 64 actions define when policies are checked System state information 112 metadata attributes of system information per file Number of functions 185 composable iRODS Micro-services Number of storage systems that are linked Desktop to 10s to 100 storage resources Number of data grids that can interoperate Federation of 10s of data grids

17 User With Client Views & Manages Data My Data Disk, Tape, Database, Filesystem, etc. The iRODS Data System can install in a “layer” over existing or new data, letting you view, manage, and share part or all of diverse data in a unified Collection. iRODS Shows Unified “Virtual Collection” Project Data Disk, Tape, Database, Filesystem, etc. User Sees Single “Virtual Collection” Reference Data Remote Disk, Tape, Filesystem, etc.

18 Organize Distributed Data into a Sharable "Virtual" Collection Project repository MotifNet - manage collection of analysis products Institutional repository Carolina Digital Repository for UNC collections Regional collaboration RENCI Data Grid linking resources across North Carolina National collaboration NSF Temporal Dynamics of Learning Center Australian Research Collaboration Service National Library French National Library National Archive NARA Transcontinental Persistent Archive Prototype, Taiwan International collaboration BaBar High Energy Physics (SLAC-IN2P3) National Optical Astronomy Observatory (Chile-US)

19 Infrastructure Independence Manage properties of the collection independently of the choice of technology Access, authentication, authorization, description, location, distribution, replication, integrity, retention Enforce policies globally at all storage locations Rule Engine resident at each storage site Apply procedures at each remote storage site Chain encapsulated operations into workflows Infrastructure independence enables evolution to new technology without interruption Integrate new access methods, new storage systems, new network protocols, new authentication systems

20 Data Virtualization Storage System Storage Protocol Access Interface Standard Micro-services Data Grid Map from actions requested by access method to standard set of iRODS Micro- services. Map standard Micro- services to standard operations. Map the operations to protocol supported by operating system. Standard Operations

21 Data Grid Security Manage global name spaces for: {users, files, storage} Assign access controls as constraints imposed between two logical name spaces Access controls remain invariant as files are moved within the data grid Controls on: Files / Storage systems / Metadata Authenticate each user access PKI, Kerberos, challenge-response, Shibboleth Use internal or external identity management system Authorize all operations ACLs (Access Control Lists) on users and groups Separate condition for execution of each Rule Internal approval flags (e.g. IRB) within a Rule

NOAO Zone Architecture Archive Telescope

Ocean Observatories Initiative Sensors Cloud Computing External Repositories Cloud Storage Cache Message Bus Aggregate sensor data in cache SuperComputer Event Detection Remote locations Simulations Digital Library Archive Clients Remote Users iRODS Data Grid Multiple Protocols Large-scale workflows from real-time data to steerable instruments, dig. Library.

Access: Data Grid Clients

25 iRODS Distributed Data Management

26 Towards a Unified Data Space Sharing data across Space Organize data as a shared "virtual" Collection Define unifying properties for the Collection Sharing data across Time Preservation is communication with the future Preservation validates communication from the past Managing full Data Life Cycle Evolution of the Policies that govern a data Collection at each stage of the life cycle From data creation, to collection, to publication, to reference collection, to analysis pipeline

27 Intellectual Property Given generic infrastructure, intellectual property resides in the Policies and procedures that manage the Collection Consistency of the Policies Capabilities of the procedures Automation of internal Policy assessment Validation of desired Collection properties Automation of administrative tasks Interacting with DataDirectNetwork, HP, IBM, MicroSoft on commercial application of open source technology.

28 Societal Impact Many communities are assembling digital holdings that represent an emerging consensus: Common meaning associated with the data Common interpretation of the data Common data manipulation mechanisms The development of a consensus is described as Socialization of Collections An example is Trans-border Urban Planning

29 Social consensus for sharing data, policies, methods, practice Each community controls their own collection Policies Policies enforced at each storage location Explicit computer-actionable rules control type of federation interactions e.g. peer-to-peer, central archive, master-slave data distribution, chained data grids, deep archives Interoperability mechanisms support technology integration Community specific clients Bulk data export / import Cross registration of data Structured information resource drivers Federation of Collections

30 Data Grid Federation Motivation Improve performance, scalability, and independence To initiate Federation, each Data Grid administrator establishes trust and creates a remote user iadmin mkzone B remote Host:Port iadmin mkuser rods#B rodsuser Use cases Chained data grids - National Optical Astronomy Observatory Master-slave data grids - NIH BIRN Central archive - UK e-Science Deep archive - NARA TPAP Replication - NSF Teragrid

31 Federated irodsUser (use iRODS clients) Federated irodsUsers can upload, download, replicate, share, manage & track access to some or all data (depending on access permissions) in either zone. Accessing Data in Federated iRODS “Gets data to user” “With access permissions” “Finds the data” iRODS/ICAT system at University of North Carolina at Chapel Hill (renci zone) Two federated iRODS data grids iRODS/ICAT system at University of Texas at Austin (tacc zone)

32 Development Team DICE team Arcot Rajasekar - iRODS Development Lead Mike Wan - iRODS Chief Architect Wayne Schroeder - iRODS Product Mgr., Sr. Developer Bing Zhu - Fedora, Windows Mike Conway - Java (Jargon) Paul Tooby - Documentation, Foundation Sheau-Yen Chen - Data Grid Administration Reagan Moore - PI Preservation Richard Marciano - Preservation Development Lead Chien-Yi Hou - Preservation Micro-services Antoine de Torcy - Preservation Micro-services

33 Foundation Data Intensive Cyber Environments Foundation Nonprofit open source software development Promotes use of iRODS technology Supports standards efforts, intellectual prop. Coordinates international development efforts IN2P3 - quota and monitoring system King’s College London - Shibboleth Australian Research Collaboration Services - WebDAV Academia Sinica - SRM interface More information:

34 iRODS Wiki More information… Descriptions, tutorials, documentation Publications / presentations Download of iRODS open source s.w. Performance tests irods-chat page