Download presentation
Presentation is loading. Please wait.
Published byBlaze McCormick Modified over 8 years ago
1
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi
2
Topics covered How to preserve data successfully How to preserve data successfully some concepts some concepts Terminology used Terminology used Storing Digital Components Storing Digital Components Storage Attributes Storage Attributes Usage of data grids Usage of data grids Successful Digital Preservation Environment Successful Digital Preservation Environment How is it achieved? How is it achieved? Capabilities provided by data grids Capabilities provided by data grids Mapping storage attributes to data grid namespaces Mapping storage attributes to data grid namespaces
3
Capabilities of data grids Capabilities of data grids Data grid Architecture Data grid Architecture Importing records into a data grid Importing records into a data grid Preservation Environments based on data grids Preservation Environments based on data grids Topics covered (cont)
4
How to preserve data successfully? Separation of digital records from the infrastructure in which they are created. Separation of digital records from the infrastructure in which they are created. Some of the requirements issued by InterPARES project for maintaining the authenticity of records and preserving them Some of the requirements issued by InterPARES project for maintaining the authenticity of records and preserving them Preservation is measured by the ability to reproduce the record Preservation is measured by the ability to reproduce the record Preservation of authentic records is a continuous process Preservation of authentic records is a continuous process How can we say that an electronic record is authentic? How can we say that an electronic record is authentic? If the identity and integrity metadata cannot be separated from it If the identity and integrity metadata cannot be separated from it What do the Identity and Integrity metadata of a record consist of? What do the Identity and Integrity metadata of a record consist of?
5
Source from Moore’s paper
6
Storing Digital Components Storage attributes are provided by the environment in which the digital component is created. Storage attributes are provided by the environment in which the digital component is created. Storage system name (i.e. network address) Storage system name (i.e. network address) File name (i.e. location within the storage system) File name (i.e. location within the storage system) Names of the file management properties (e.g. size, file creation date) Names of the file management properties (e.g. size, file creation date) Names of users (e.g. owner of the file and others having access) Names of users (e.g. owner of the file and others having access) Access privileges (e.g. allowed operations on a file by a user) Access privileges (e.g. allowed operations on a file by a user) File management properties also include the identity and integrity metadata File management properties also include the identity and integrity metadata What happens if the document is moved between sites or storage systems? What happens if the document is moved between sites or storage systems? All or some of the metadata attributes may change and need to be updated All or some of the metadata attributes may change and need to be updated Data grids automate the management of the storage attributes Data grids automate the management of the storage attributes
7
Successful Digital Preservation Environment How is the separation achieved while still maintaining its authenticity? How is the separation achieved while still maintaining its authenticity? We need to use a data management software infrastructure (Data Grid) between the storage system where the digital component is stored and the access applications (Unix, Web Browser etc) used to retrieve the records We need to use a data management software infrastructure (Data Grid) between the storage system where the digital component is stored and the access applications (Unix, Web Browser etc) used to retrieve the records Mapping of the storage system attributes onto the data grid namespaces Mapping of the storage system attributes onto the data grid namespaces Quote from paper : “A successful digital preservation environment is one in which each digital record is separated from the software and hardware technology used to create its original instantiation. This means that the digital record can be preserved in a storage system different from the original one, and can be accessed and displayed through applications different from those originally used, while remaining intact with its own metadata.”
8
Logical name space associated with each storage system attribute Logical name space associated with each storage system attribute Logical namespaces produce GUIDs which identify the attributes of the creator’s record keeping system. Logical namespaces produce GUIDs which identify the attributes of the creator’s record keeping system.
9
Logical storage name : creates permanent storage location identifiers (ip addresses corresponding to the actual physical location) Logical storage name : creates permanent storage location identifiers (ip addresses corresponding to the actual physical location) ex: doi:10.1045/january2005-bollen* Logical file name : mapped to the physical file name Logical file name : mapped to the physical file name from: http://www.dlib.org/dlib/january05/bollen/01bollen.html http://www.dlib.org/dlib/january05/bollen/01bollen.html In the above URL “/dlib/january05/bollen/01bollen.html” is the logical file name. * it could be stored in a Unix system as: /var/www/htdocs/dlib/january05/bollen/01bollen.html which is the physical file name it could be stored in a Unix system as: /var/www/htdocs/dlib/january05/bollen/01bollen.html which is the physical file name or it could be stored in Windows as: C:\htdocs\dlib\january05\bollen\01bollen.html or it could be stored in Windows as: C:\htdocs\dlib\january05\bollen\01bollen.html Can be used to manage copies of the digital component. Can be used to manage copies of the digital component. Location of each copy is stored as metadata along with the time it was produced. Location of each copy is stored as metadata along with the time it was produced. *Source: provided by Dr Nelson
10
Logical records metadata elements Logical records metadata elements Associated with each digital component that is registered to the logical file namespace Associated with each digital component that is registered to the logical file namespace Storage attributes, preservation attributes and records metadata are all linked to the logical file name which is used as the key to the remaining information Storage attributes, preservation attributes and records metadata are all linked to the logical file name which is used as the key to the remaining information Just like a cardboard box that can have paper documents, a digital file can have several digital components Just like a cardboard box that can have paper documents, a digital file can have several digital components So now the storage attributes could comprise of location of each digital component in the container, name and location of the container So now the storage attributes could comprise of location of each digital component in the container, name and location of the container Logical user name Logical user name Data grid maintains a unique name for a person authorized to perform preservation procedures Data grid maintains a unique name for a person authorized to perform preservation procedures Control/consistency constrains Control/consistency constrains Access privileges for each logical file name and records metadata element Access privileges for each logical file name and records metadata element Logical records metadata elements updated after a preservation procedure Logical records metadata elements updated after a preservation procedure
11
Capabilities of data grids Data collection management technology Data collection management technology Present between the user access mechanism and the storage system Present between the user access mechanism and the storage system Issue queries on metadata and retrieve records without knowing the file name or the storage location Issue queries on metadata and retrieve records without knowing the file name or the storage location Quote from paper: “ “The capabilities provided by data grids are essential for automating preservation processes, mitigating risk of data loss through reproduction of digital components, assuring the permanent association of identity and integrity metadata with records, and supporting retrieval and access. At the same time, data grids are designed to manage digital entities stored in any type of storage system, while providing access through a very wide variety of access mechanisms. This ability to interact with multiple types of storage systems and access systems ” forms the core of data grid support for technology evolution.”
12
Data grid Architecture
13
Three lower layers manage the digital components and the two upper layers manage access to the preservation environment Three lower layers manage the digital components and the two upper layers manage access to the preservation environment Bottom layer is the storage system where the digital components actually reside Bottom layer is the storage system where the digital components actually reside Storage system driver for each new type of storage system Storage system driver for each new type of storage system Data grids store the logical attributes in a database called the ‘data grid registry’ Data grids store the logical attributes in a database called the ‘data grid registry’ The link between the attributes in the database and their respective records is maintained in a metadata catalog (MCAT) The link between the attributes in the database and their respective records is maintained in a metadata catalog (MCAT) Data grids support ownership. They have their own user ids. They also support multiple access roles Data grids support ownership. They have their own user ids. They also support multiple access roles access permissions set separately for each digital component access permissions set separately for each digital component
14
Standard set of operations on the digital components Standard set of operations on the digital components read and write a file in the storage system, moving files from one storage system to another, loading files into the data grid read and write a file in the storage system, moving files from one storage system to another, loading files into the data grid implemented using standard access interfaces like C library calls, Unix Shell commands and Java classes implemented using standard access interfaces like C library calls, Unix Shell commands and Java classes ex: Web Services Resource Framework implemented using java classes The above interfaces are used to call a particular access mechanism to the data grid which stores all the logical data pertaining to various sites and storage systems which have copies of the digital component The above interfaces are used to call a particular access mechanism to the data grid which stores all the logical data pertaining to various sites and storage systems which have copies of the digital component Failure of a data grid registry (which stores all logical attributes) might effect the data grids performance Failure of a data grid registry (which stores all logical attributes) might effect the data grids performance this can be overcome by using federation of data grid registries this can be overcome by using federation of data grid registries selected records and related metadata are reproduced in a second data grid selected records and related metadata are reproduced in a second data grid a third security data grid (public access is restricted) can be implemented that reproduces all the logical attributes and digital components a third security data grid (public access is restricted) can be implemented that reproduces all the logical attributes and digital components
15
Importing records into a data grid Data grid technology is installed in the storage system used by the custodian Data grid technology is installed in the storage system used by the custodian ‘data grid command’ is issued to register the records into the data grid registry ‘data grid command’ is issued to register the records into the data grid registry the records metadata can be extracted from the record keeping system and link them to the electronic records the records metadata can be extracted from the record keeping system and link them to the electronic records Using the staging area Using the staging area Transfer of the records are kept in a ‘staging area’ in the storage system Transfer of the records are kept in a ‘staging area’ in the storage system Before adding the records into the registry, custodian can check them for their authenticity and make some modifications if necessary Before adding the records into the registry, custodian can check them for their authenticity and make some modifications if necessary The records and their metadata are checked for errors using a checksum The records and their metadata are checked for errors using a checksum
16
The digital records are aggregated and are grouped according to their archival aggregation. The digital records are aggregated and are grouped according to their archival aggregation. Each group is stored as a physical container (digital file) which has the related records Each group is stored as a physical container (digital file) which has the related records A logical container comprises of some physical containers A logical container comprises of some physical containers If the records are obsolete, transformative migrations are done to match the current formats If the records are obsolete, transformative migrations are done to match the current formats Both the versions are stored for future reference Both the versions are stored for future reference Digital transfer process of the National Archives of Australia (3 minutes animation) Digital transfer process of the National Archives of Australia (3 minutes animation) http://www.naa.gov.au/recordkeeping/preservation/digital/animation/index.html Source : Australian Government, NAA
17
Preservation Environments based on data grids Data grid technology is currently being used to Data grid technology is currently being used to support research projects at various communities like US National Archives and Records Administration (NARA) and US State Archives support research projects at various communities like US National Archives and Records Administration (NARA) and US State Archives Manage scientific data collections that have millions of files and hundreds of terabytes of data Manage scientific data collections that have millions of files and hundreds of terabytes of data What the SRB data grid uses? What the SRB data grid uses? Oracle database for the data grid registry Oracle database for the data grid registry Sun F15k server to support the Oracle database and servers Sun F15k server to support the Oracle database and servers IBM High Performance Storage System (HPSS) IBM High Performance Storage System (HPSS) Sun Sam-QFS file system to manage files written to tape Sun Sam-QFS file system to manage files written to tape Grid brick technology to provide online disk caches for interactive access to stored records Grid brick technology to provide online disk caches for interactive access to stored records
19
Conclusion Data grids provide the software which make the digital records independent from the infrastructure in which they are created thus providing a successful preservation environment for digital records. Data grids provide the software which make the digital records independent from the infrastructure in which they are created thus providing a successful preservation environment for digital records. Data grids provide the reproduction of records at multiple sites, thereby reducing the risk of record loss. Data grids provide the reproduction of records at multiple sites, thereby reducing the risk of record loss. Federation of data grids make it easy to implement high security environments to retain the authenticity of the records. Federation of data grids make it easy to implement high security environments to retain the authenticity of the records.
20
Thank You
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.