Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar, M. Wan Presenter: Yedugani Pawan Kumar
Overview Introduction Grid Evolution Integrating Digital Libraries and Data Grids: Spanning The Information Divide Storage Resource Broker – Integration Of Digital Libraries and Data Grids. Data Management Concepts Grid Implementations Grids and Digital Libraries
Introduction Data grids support massive data collections that are distributed across multiple institutions. For example: Worldwide Universities Network support the sharing of data between academic institutions in the United States and United Kingdom. The SIOExplorer project manages an archive of ship logs from oceanographic research vessels. The above projects use collections to provide a context for the interpretations of their digital entities which are based upon a generic data management infrastructure, the San Deigo Supercomputer Center Storage Resource Broker (SDSC SRB).
Storage Resource Broker Storage Resource Broker manages distributed data. Data Grid technology provides the fundamentals mechanisms for distributed data. Digital libraries can be implemented on top of data grids through the addition of mechanisms to support collection creation, browsing and discovery. Persistent archives can be implemented on data grids by addition of integrity metadata needed to assert the invariance of the deposited material.
Definitions of Data, Information. Data Grid Community defines “data” as a strings of bits that comprise a digital entity. Information as a set of semantic labels that are assigned to strings of bits. The combination of a semantic label and associated data is termed as metadata.
Grid Evolution Each evolutionary steps required new naming conventions for resources, users, file etc. The naming conventions made it possible to create uniform labels. Virtual Organization Assignment of naming convention depends on the following: Cultural Consideration Organizational Consideration Choice of Infrastructure
Grid Evolution
INTEGRATING DIGITAL LIBRARIES AND DATA GRIDS: SPANNING THE INFORMATION DIVIDE Knowledge management is needed to support constraints in federation of data grids and semantic crosswalks between digital libraries. The integration of constraint based knowledge management technology with data grids, digital libraries and persistent archives, requires the relationship based constraints for all three environments. The assignment of semantic label to a digital entity requires processing step. For a scientific data one might attach the following semantic label to a filed: Name of the physical variable represented by the file. Units associated with the physical variable. Data model by which the bits are organized. Structural mapping implied by the data model Procedural mapping imposed on the data model.
Knowledge is the expression of relationship between semantic labels. Relationships are typed as logical (“is a”, ”has”) structural (existence of a structure within the string of bits) spatial (mapping of a string of bits to a coordinate system) temporal (mapping to a point in time) procedural (mapping to process results) functional (mapping of features to evaluation algorithm) systemic (properties that cover all members of a collection)
Each semantic label is the result of the application of a process. Information is created by application of constraints appropriate to given community. Each type of knowledge constraints can be given a name and associated with a digital entity as a semantic label. The DLib community encapsulates the knowledge constraints in the curation processes that are applied when collection is assembled. The preservation community encapsulates them in the archival processes that are applied when the archival collection is created. The data grid community characterizes knowledge constraints as applied processes or functions that transform digital entities into derived data products. A major change in perspective is needed when dealing with sociological imperatives that arise from interaction between independent group of researchers.
Requirement for management of knowledge constraints are pervasive. Constraints are needed to enforce controls on interaction between federated data management systems both for access and for consistency. Constraints constitute relationship or rules that must be evaluated each the item is shared or accessed. For DLib community access constraints controls the mapping of semantic label within one community that can be mapped to semantic labels used by another community. The preservation community associates authenticity metadata with each digital entity with each digital which asserts the archival processes applied on it.
STORAGE RESOURCE BROKER – INTEGRATION OF DIGITAL LIBRARIES AND DATA GRIDS A generic data management system developed to build digital libraries to build data, data grids for the sharing of data, and persistent archives for preservation of data. SRB is used extensively within the NPACI project over 350 terabytes of data stored under the management of SRB at SDSC, comprising over 50 millions of files. The implementation of the SRB technology for use within the NPACI grid required the development of fundamental virtualizations. Storage Repository Virtualization. Data Virtualization Mechanisms. Information Repository Virtualization. Service Virtualization.
Projects Using SRB Technology
SRB SRB is the underlying data management technology in each of the table 2 projects. The resulting architecture have similar components to those used in National Virtual Observatory (NVO). It has the following components: Portals that provide a user interface to the NVO services Registry for publishing the existence of NVO services Web-based services that implement interactive data manipulation or analysis tasks Workflow environments for support of processing pipelines SRB data grid for access to the storage repositories Grid software for distributed computation Catalogs and image archives of sky surveys Storage systems and archives
NVO Architecture
SRB The concepts implemented in the SRB are now being used by all other data grid implementation. The concepts include : Use of federated client server architecture. Use of a logical name space. Mapping of attributes onto the logical name space. Use of access controls on digital entities. Explicit services developed within the SRB for replication, aggregation of data into containers, support for user- defined metadata, role-based access controls, and ticket- based authentication, are now being implemented in other data grids.
Data Management Concepts Creation of logical name space. Mapping state information to logical names as attributes. Consistency constraints is maintained by imposing multiple levels of constraints on logical name space. Middleware Data grids manage and manipulate consistency on distributed state information. DLib add mappings to manage user defined metadata to support discovery and browsing. Persistent archives add mappings to manage the authenticity of the deposited digital entities.
Data Management Concepts
Grid Implementation Middleware is an infrastructure that manages the information flow between processes and distributed collections. Organization of computational results makes it possible to associate a context. A digital entity becomes useless without a context. Grids focus on execution of access services. Dlibs focus on the management of the results
Grid Implementation
Grids and Digital Libraries Federating Name Spaces Replica Location Service Community Authorization Metadata Catalog Services Processing Pipeline Dataflow Environment Workflow Environment Consistency Management Information Flow
Conclusion By integrating data grids, digital libraries, and persistent archives we will be able to maintain the consistency of federated data collections while flowing information and data from digital libraries through grid services into preservation environments.
Thank You