Presentation is loading. Please wait.

Presentation is loading. Please wait.

San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan.

Similar presentations


Presentation on theme: "San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan."— Presentation transcript:

1 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan W. Moore San Diego Supercomputer Center http://www.npaci.edu/DICE moore@sdsc.edu

2 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure2 Archive Definition Computer science - archive is the hardware and software infrastructure used to manage data Preservation community - archives is the material that is being preserved

3 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure3 Persistent Archive Software system that manages evolution of the hardware and software infrastructure –A persistent archive preserves the authenticity and integrity of digital entities while the underlying technology evolves Combination of the material that is being preserved and the infrastructure used to preserve the material

4 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure4 Data Grid Grid Community definition –The infrastructure used to manage distributed data as a collection Digital library and preservation community definition –The distributed data that is being organized and managed as a collection A data grid is a mechanism to support sharing of data and the collection that is being shared

5 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure5 Data Sharing Management of access controls on local resources to share data –Put controls on resources Creation of a collection that is being shared across distributed resources –Put controls on collection The SRB data grid does both, enacts controls on both resources and on collections (data and metadata)

6 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure6 Topics Data Grids - managing distributed data –Distributed data management for a project Digital Libraries - publication of data –Management of collection hierarchies Persistent Archives - preservation of data –Management of technology evolution Storage Resource Broker example –Currently supporting all three (seven) data management environments

7 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure7 Data Management Systems (Supported by Storage Resource Broker) Data collecting –Sensor systems, object ring buffers and portals Data organization –Collections, manage data context Data sharing –Data grids, manage heterogeneity of resources Data publication –Digital libraries, support discovery Data preservation –Persistent archives, manage technology evolution Data analysis –Processing pipelines, manage knowledge extraction

8 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure8 Data Management Systems Data grid for managing distributed data –Latency management for bulk analyses of collections –Infrastructure independent name spaces for describing data, resources, users, and state information Digital library for managing data context –Curation services for managing collections –Descriptive metadata for discovery Persistent archive to manage technology evolution –Interoperability mechanisms between heterogeneous storage systems and user access mechanisms

9 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure9 Provide Context for Data Properties of files –Provenance - source –Descriptive attributes –Structure Organize properties as metadata in a collection hierarchy –Define operations on file properties –Manage state information - location, replicas, containers Separate context management from content management –Maintain consistency of context as operations are done on content

10 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure10 Data Grids Software systems that manage distributed data Control global name spaces for –Resources –Users –Files –Metadata context Provide standard operations on each name space Provide single sign-on authentication, collection management, latency management, replication, and federation Generic distributed data management technology

11 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure11 Managing Distributed Data Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Access Methods (Web Browser, DSpace, OAI-PMH) Naming conventions provided by storage systems

12 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure12 Data Grids Provide a Level of Indirection for Each Naming Convention Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection Data Access Methods (C library, Unix, Web Browser) Data is organized as a collection

13 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure13 Logical Name Spaces Storage resources –Logical names for managing collections of resources User names (user-name / domain / data grid) –Distinguished names for users to manage access controls Digital Entities (files, blobs, structured data, …) –Logical name space for global identifiers for files Context - Metadata attributes –Standard metadata attributes, Dublin Core –State information resulting from data grid operations –User-defined metadata

14 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure14 Logical Resource Name Represents a list of physical resources Operations on the logical resource name result in operations on the list of physical resources –Load leveling -write to the next physical resource in the list –Fault tolerance - write to “k” of “n” physical resources –Replication - write to each physical resource –Compound resource - write to the disk cache in front of the tape archive –Federated resource - write to the controlled resource in another data grid

15 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure15 Storage Repository Virtualization Archive DatabaseFile System User Application How does one access data stored on multiple systems?

16 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure16 Storage Repository Virtualization (Standard Operations on Logical Resource Names) Archive DatabaseFile System Common set of operations for interacting with every type of storage repository User Application Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Collective operations Load leveling Fault tolerance Replication

17 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure17 Logical File Name Abstraction Archive at SDSC Database At U Md File System at NARA User Application How does one identify files stored on multiple systems?

18 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure18 Context Abstraction Archive at SDSC Database At U Md File System at U Texas Common naming convention and set of attributes for describing digital entities User Application Logical name space Location independent identifier Persistent identifier Collection owned data Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system

19 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure19 SRB server SRB agent SRB server Federated Server Architecture MCAT Read Application SRB agent 1 2 3 4 6 5 Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6

20 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure20 SRB Latency Management Replication Server-initiated I/O Streaming Parallel I/O Caching Client-initiated I/O Remote Proxies, Staging Data Aggregation Containers Source Destination Prefetch Network Destination Network

21 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure21 Latency Management -Bulk Operations Bulk register –Create a logical name for a file Bulk load –Create a copy of the file on a data grid storage repository Bulk unload –Provide containers to hold small files and pointers to each file location Bulk delete –Mark as deleted in metadata catalog –After specified interval, delete file Bulk metadata load Requests for bulk operations for access control setting, …

22 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure22 Data Grid Federation Link multiple independent data grids –Coordinate metadata between independent metadata catalogs Provide consistency and access constraints for each of the four logical name spaces (resources, users, files, metadata) –Peer-to-peer federations, data access –Replication federations, shared resources –Hierarchical federations, consistency constraints Tune data grid federation by implementing different consistency and access constraints

23 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure23 Federation Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection A Access controls and consistency constraints on cross registration of digital entities

24 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure24 Replicated Catalog Deep Archive Partial User-ID Sharing Partial Resource Sharing No Metadata Synch Hierarchical Zone Organization One Shared User-ID System Managed Replication Connection From Any Zone Complete Resource Sharing System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Super Administrator Zone Control System Controlled Complete Synch No User-ID Sharing Peer-to-Peer Data Grids Replication Data Grids Hierarchical Data Grids Occasional Interchange Free Floating Resource Interaction User and Data Replica Nomadic Snow Flake Master Slave Replicated Data Federation Environments Replication Constraints Consistency Constraints Access Constraints

25 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure25 Generic Infrastructure SDSC developed the Storage Resource Broker (SRB) to support access to distributed data –Effort started in 1996 as a DARPA funded project –Now support over 30 national/international projects Development team of 12 staff is led by –Michael Wan, data management systems –Arcot Rajasekar, information management systems

26 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure26 Data Grid Capabilities Data manipulation –Containers –Parallel I/O –Firewall interactions Resource interactions –Fault tolerance –Load leveling –Replication HIPAA security requirements –Authentication of all users –Access controls on data and metadata –Audit trails –Data encryption –Centralized control Application interfaces –C library, Shell commands, Java, Perl, Python, WSDL, workflow

27 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure27 Digital Library Collection hierarchy for organizing data –User-defined metadata –Collection level metadata Metadata manipulation –Schema extension –Bulk metadata processing –Queries on metadata –Access controls on metadata –Views on collections Digital library APIs –DSpace, Fedora, OAI-PMH, web browsers –METS metadata XML schema

28 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure28

29 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure29 Persistent Archives Authenticity metadata –Provenance –User logical name space Integrity metadata –Audit trails, checksums –Access controls Consistency –Context update on all content operations Persistency –Infrastructure independence Storage repository abstraction Information repository abstraction Access abstraction (standard operations)

30 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure30 National Archives Persistent Archive NARAU MdSDSC MCAT Principle copy stored at NARA with complete metadata catalog Replicated copy at U Md for improved access, load balancing and disaster recovery Deep Archive at SDSC, no user access, but complete copy

31 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure31 Unix Shell Java, NT Browser Kepler Actors OAI, WSDL, WSRF HTTP DSpace OpenDAP Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Virtualization Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix C, C++, Java Libraries Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization,Authentication,Audit Linux I/O DLL / Python, Perl Federation Management Data Grid Federation - zoneSRB

32 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure32 Examples of Extensibility Storage Repository Driver evolution –Initially supported Unix file system –Added archival access - UniTree, HPSS –Added FTP/HTTP –Added database blob access –Added database table interface –Added Windows file system –Added project archives - Dcache, Castor, ADS –Added Object Ring Buffer, Datascope –Adding GridFTP version 3.3 Database management evolution –Postgres –DB2 –Oracle –Informix –Sybase –mySQL (most difficult port - no locks, no views, limited SQL)

33 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure33 Examples of Extensibility The 3 fundamental APIs are C library, shell commands, Java –Other access mechanisms are ported on top of these interfaces API evolution –Initial access through C library, Unix shell command –Added iNQ Windows browser (C++ library) –Added mySRB Web browser (C library and shell commands) –Added Java (Jargon) –Added Perl/Python load libraries (shell command) –Added WSDL (Java) –Added OAI-PMH, OpenDAP, DSpace digital library (Java) –Added Kepler actors for dataflow access (Java) –Adding GridFTP version 3.3 (C library)

34 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure34 Sites Using the SRB

35 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure35

36 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure36 Grid Interfaces GSI, support versions 1, 2, 3, Java GridFTP version 3.3 interface to SRB collection –Use GSI certificate to identify the user to the SRB –Reference file by a SRB logical name space –Use SRB access controls for allowed operations –Initially support serial transport –SRB supports 4 different firewall interaction protocols (client-driven parallel I/O, server-driven parallel I/O, bulk file registration, federated data grid access) GridFTP version 3.3 driver for SRB collection –Store data at a remote site under the SRB ID Data will be shareable through SRB access controls\ –Store data at a remote site under user GSI certificate Data will not be shareable through SRB access controls

37 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure37 Grid Interfaces Replica Location Service Interface –Simon Metson –GMCat mimics the LRC interface, enabling the files registered in an MCat to appear on the giggle framework (RLS). –Available from http://tuber1.phy.bris.ac.uk:8080/GMCatWS3 –(also linked from the third party software on the SRB page) Storage Resource Manager –SRM Version 1, SRB driver created to store data in SRM –SRM Version 2, development effort to put SRM interface on top of SRB (Alasdair Earl) –SRM Version 3, development effort to put SRM interface on top of SRB (Peter Kunszt)

38 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure38 Conclusion Distributed data management systems can be built on generic data grid infrastructure –Data grids to support bulk access across remote sites –Integration of data grid and digital library capabilities to manage massive data collections –Federation of data grids to build international discipline-wide collections

39 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure39 SDSC SRB Team (left to right) Arun Jagatheesan George Kremenek Sheau-Yen Chen Arcot Rajasekar (SRB development lead) Reagan Moore (SRB PI) Michael Wan (SRB architect) Roman Olschanowsky (BIRN) Bing Zhu Charlie Cowart Lucas Gilbert Tim Warnock Wayne Schroeder (SRB product) Adam Birnbaum (SRB production) Antoine De Torcy Vicky Rowley (BIRN) Marcio Faerman (SCEC) Students & emeritus –Erik Vandekieft –Reena Mathew –Xi (Cynthia) Sheng –Allen Ding –Grace Lin –Qiao Xin –Daniel Moore –Ethan Chen –Jon Weinburg Supported by about 20 projects (NSF, DOE, NASA, NARA, NIH, LOC, NHPRC)

40 San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure40 For More Information Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.npaci.edu/DICE http://www.npaci.edu/DICE/SRB http://www.npaci.edu/dice/srb/mySRB/mySRB.html


Download ppt "San Diego Supercomputer CenterNational Partnership for Advanced Computational Infrastructure1 Data Grids, Digital Libraries, and Persistent Archives Reagan."

Similar presentations


Ads by Google