Presentation is loading. Please wait.

Presentation is loading. Please wait.

Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker.

Similar presentations


Presentation on theme: "Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker."— Presentation transcript:

1 Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.sdsc.edu/srb/ Storage Resource Broker

2 Topics Preservation environments Digital library technology Data grid technology Fundamental concepts / future research Data / information / knowledge Persistent objects Knowledge management

3 Preservation Archival processes through which a digital entity is extracted from its creation environment, and then supported in a preservation environment, while maintaining authenticity and integrity information. Extraction process requires insertion of support infrastructure underneath the digital material Goal is infrastructure independence, the ability to use any commercial storage system, database, or access mechanism

4 Preservation Communities InterPARES - diplomatics Preservation of records NARA Preservation of records from federal agencies State archives Preservation of submitted “collections” Continuum model Preservation of active data and records

5 Preservation What differentiates a preservation environment from a digital library?

6 Digital Libraries Support the community vocabulary Discovery and browse using community relevant terms Support the community data format Maintain information on the data format of each item Support the community access services Provide services that manipulate and display the community data format

7 Preservation Mandates Diplomatics Authenticity Integrity NARA Infrastructure independence Scalability State archives Automation of archival processes

8 InterPARES - Diplomatics Authenticity - maintain links to metadata for: Date record is made Date record is transmitted Date record is received Date record is set aside [i.e. filed] Name of author (person or organization issuing the record) Name of addressee (person or organization for whom the record is intended) Name of writer (entity responsible for the articulation of the record’s content) Name of originator (electronic address from which record is sent) Name of recipient(s) (person or organization to whom the record is sent) Name of creator (entity in whose archival fonds the record exists) Name of action or matter (the activity for which the record is created) Name of documentary form (e.g. E-mail, report, memo) Identification of digital components Identification of attachments (e.g. digital signature) Archival bond (e.g. classification code)

9 InterPARES - Diplomatics Integrity - maintain links to metadata for Name(s) of the handling office / officer Name of office of primary responsibility for keeping the record Annotations or comments Actions carried out on the record Technical modifications due to transformative migration Validation

10 Preservation Approach Provide mechanisms to: Create archival context for the content Context is preservation metadata (provenance, administrative, descriptive, structural, behavioral) Content is the submitted digital entity Assert integrity - the consistency between the context and the content Track operations done on material and update context Assert authenticity - that the material represents the original site Track the chain of custody Manage technology evolution (encoding standard, storage repository, information repository, access methods)

11 Data Grids What is the difference between a preservation environment and a data grid?

12 Data Grids Manage shared collections that are distributed in space Location of item, access controls, checksums Implement infrastructure independence Standard operations for interacting with storage repositories Implement presentation independence Standard APIs to support porting of user interfaces

13 Preservation Environment Digital library infrastructure that supports Preservation metadata Arrangement and description of items Access mechanisms Data grid infrastructure that supports Shared collections that are migrated forward in time Management of technology evolution Administrative metadata providing status of records

14 Infrastructure Independence Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Access Methods (Web Browser, DSpace, OAI-PMH) Naming conventions provided by storage systems

15 Data Grids Provide a Level of Indirection for Each Naming Convention Storage Repository Storage location User name File name File context (creation date,…) Access constraints Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection Data Access Methods (C library, Unix, Web Browser) Data is organized as a shared collection

16 Demonstration Logical file name space Distinguished user name space Shared collection Distributed data storage Replication as a file property Digital entities

17 Data Grids Provide two levels of indirection: Low level API used to interact with storage repositories Standard operations for manipulating files in a storage system Standard operations for manipulating a catalog stored in a database High level API used to support user interfaces Three basic APIs - “C” library call, Unix shell commands, Java class library Other are interfaces ported on top of the basic APIs.

18 Unix Shell NT Browser, Kepler Actors OAI, WSDL, (WSRF) HTTP, DSpace, OpenDAP, GridFTP Archives - Tape, Sam-QFS, DMF, HPSS, ADSM, UniTree, ADS Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix File Systems Unix, NT, Mac OSX Application ORB Storage Repository Abstraction Database Abstraction Databases - DB2, Oracle, Sybase, Postgres, mySQL, Informix C Library, Java Logical Name Space Latency Management Data Transport Metadata Transport Consistency & Metadata Management / Authorization, Authentication, Audit Linux I/O C++ DLL / Python, Perl, Windows Federation Management Storage Resource Broker 3.3

19 Accessing Multiple Types of Storage Systems User Application Archive at SDSC Archive at NARA Archive at U Md

20 Standard Data Access Operations Common set of operations for interacting with every type of storage repository User Application Remote operations Unix file system Latency management Procedures Transformations Third party transfer Filtering Queries Collective operations Replication Fault tolerance Load leveling Archive at SDSC Archive at NARA Archive at U Md

21 Accessing Data at Multiple Sites User Application Each site has their own naming convention for files A data grid provides a uniform way to name and access the files across the sites Archive at SDSC Archive at NARA Archive at U Md

22 Building a Distributed Collection Archive at SDSC Data Grid Common naming convention and set of attributes for describing digital entities User Application Logical name space Location independent identifier Persistent identifier Collection owned data Authenticity metadata Access controls Audit trails Checksums Descriptive metadata Inter-realm authentication Single sign-on system Archive at NARA Archive at U Md

23 SRB server SRB agent SRB server Federated Server Architecture MCAT Read Application SRB agent 1 2 3 4 6 5 Logical Name Or Attribute Condition 1.Logical-to-Physical mapping 2.Identification of Replicas 3.Access & Audit Control Peer-to-peer Brokering Server(s) Spawning Data Access Parallel Data Access R1 R2 5/6

24 Managing Access Authenticate users independently of storage systems Preservation environment owns the data Authorize data access independently of storage system ACLs on both data and metadata Maintain audit trails of all accesses Both read and write

25 Collection-owned Data Store data at remote storage system under data-grid ID Access data through data grid servers Track all operations on data and update state information User authenticates to a data grid server Access controls are checked for permissions Data grid servers authenticate messages from other servers Remote server authenticates to remote storage system Multiple authentication mechanisms GSI / challenge-response / tickets

26 Provide Context for Data Properties of files Provenance - source Descriptive attributes Structure Organize properties as metadata in a collection hierarchy Define operations on file properties Manage state information - location, replicas, containers Separate context management from content management Maintain consistency of context as operations are done on content

27 Database Operations Standard interface to support Schema extension - user defined attributes Snowflake table creation SQL generation Import and export of XML files Bulk metadata load and unload Operations required to manage a catalog that resides in a database

28 National Archives and Records Administration - Research Prototype Persistent Archive NARAU MdSDSC MCAT Principle copy stored at NARA with complete metadata catalog Replicated copy at U Md for improved access, load balancing and disaster recovery Deep Archive at SDSC, no user access, but complete copy Demonstrate preservation environment Authenticity Integrity Management of technology evolution Mitigation of risk of data loss Replication of data Federation of catalogs Management of preservation metadata Scalability EAP collection 350,000 files 1.2 TBs in size Federation of Three Independent Data Grids

29 Preservation Requirements Maintain authenticity and integrity of electronic records Authenticity - assertion of provenance of data Integrity - assertion of invariance of bits Manage risk of data loss Media corruption / System failures / Operational errors / Natural disaster / Malicious users Manage technology obsolescence Support migration of collection to new systems Bulk data operations

30 Replication How many replicas are enough?

31 Federation Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection B Data Access Methods (Web Browser, DSpace, OAI-PMH) Data Grid Logical resource name space Logical user name space Logical file name space Logical context (metadata) Control/consistency constraints Data Collection A Access controls and consistency constraints on cross registration of digital entities

32 Data Grid Zones Choose how name spaces will be shared Cross register storage resources May the other data grid write to my storage? Cross register user names Users are authenticated by their home zone Cross register files Can replicate files into another data grid Cross register metadata Can build a copy of the metadata catalog

33 Replicated Catalog Deep Archive Partial User-ID Sharing Partial Resource Sharing No Metadata Synch Hierarchical Zone Organization One Shared User-ID System Managed Replication Connection From Any Zone Complete Resource Sharing System Set Access Controls System Controlled Complete Synch Complete User-ID Sharing System Managed Replication System Set Access Controls System Controlled Partial Synch No Resource Sharing Super Administrator Zone Control System Controlled Complete Synch No User-ID Sharing Peer-to-Peer Data Grids Replication Data Grids Hierarchical Data Grids Occasional Interchange Free Floating Resource Interaction User and Data Replica Nomadic Snow Flake Master Slave Replicated Data Federation Environments Replication Constraints Consistency Constraints Access Constraints

34 Examples of Extensibility Storage Repository Driver evolution Initially supported Unix file system Added archival access - UniTree, HPSS Added FTP/HTTP Added database blob access Added database table interface Added Windows file system Added project archives - Dcache, Castor, ADS Added Object Ring Buffer, Datascope Adding GridFTP version 3.3 Database management evolution Postgres DB2 Oracle Informix Sybase mySQL (most difficult port - no locks, no views, limited SQL)

35 Examples of Extensibility The 3 fundamental APIs are C library, shell commands, Java Other access mechanisms are ported on top of these interfaces API evolution Initial access through C library, Unix shell command Added inQ Windows browser (C++ library) Added mySRB Web browser (C library and shell commands) Added Java (Jargon) Added Perl/Python load libraries (shell command) Added WSDL (Java) Added OAI-PMH, OpenDAP, DSpace digital library (Java) Added Kepler actors for dataflow access (Java) Adding GridFTP version 3.3 (C library )

36

37 Sites Using the SRB

38 Research Areas Characterization of data / information / knowledge Preservation architecture Knowledge management - dynamic application of preservation policies Persistent object - characterization of digital entities

39 Characterizing Knowledge Data - bits that comprise a digital entity Information - a semantic label that is applied to data Knowledge - relationships between semantic labels Metadata - the combination of the semantic label and the data The creation of a semantic label is driven by the application of a process / relationship Information is the result of applying knowledge relationships Information is the reification of knowledge

40 Knowledge Management Reify relationships to improve access performance Easier to query on metadata than to apply the original relationships Manage state information about the reification process - support for relationship changes Support levels of granularity for application of relationships - collective properties versus procedural properties Goal is to build a scalable knowledge management system

41 Preservation Strategies Emulation Migrate the display application onto new operating systems Equivalent to forcing use of candlelight to look at 16th century documents Transformative migration Migrate the encoding format to the new standard Migration period is expected to be 5-10 years Persistent object Characterize the encoding format Migrate the characterization forward in time

42 Persistent Objects 1980 1990 200020102020 Display Applications 1980 19902000 20102020 Digital Entities Characterize standard manipulation operations Characterize encoding format - data structure

43 Preservation Standards OAIS - Open Archival Information System Submission Information Package (SIP) Archival Information Package (AIP) Dissemination Information Package (DIP) Producer Archive Interface Abstract Methodology Standard (CCSDS Document 651.0-R-1)

44 Containers SRB provides support for aggregation of files into a container AIP is the aggregation of both preservation context and the records into a container What is the appropriate form for a self- describing container?

45 Self-instantiating Archive Preservation of Digital Data with Self- Validating, Self-Instantiating Knowledge- Based Archives, B. Ludäscher, R. Marciano, R. Moore, SIGMOD Record, ACM, 30(3), pp. 54-63, 2001. An archives consists of the application of archival processes to create the collection managed in the preservation environment Instantiation corresponds to the application of the archival processes to the original data

46

47 Example Web Crawl National Science Digital Library maintains registry of URLs for education material at Cornell Crawl sites Recursion to a depth of 10 redirections Restriction to pages within initial site plus one level outside site Store material on processing platform 70,000 URLs - 2 million digital entities, 200 GB On average 30 files per URL, Each file with average size 100 kBytes

48 Collection Requirements Provide containers for managing small files 26 million files, average size 100 kB Aggregate data in containers before storage Support web-based access to archived data Redirect web page internal HTTP links to data grid handles Support integrity Manage checksums on files

49 Accessioning Web Sites Use OAI harvesting to extract URLs from the NSDL repository Crawl each URL and process each digital entity Replace internal URLs with data grid logical names Aggregate digital entities into containers (files) for storage. Archives store files that are 40 MBytes in size. Generate archival context Register digital entity into a data grid Use collection hierarchy to associate web crawl properties with each file (date, site, initial URL, …) Write processed files into a storage system managed by a data grid Replicate data on Grid Bricks and archival storage system Provide OAI interface for reporting validation results

50 Persistent Archive Collections Build collections based on date crawled For each collection, use separate folder to hold digital entities associated with the original URL Typically 30 digital entities per URL Aggregate digital entities into containers before storage Preservation metadata maintained for each digital entity Administrative, descriptive, structural, behavioral

51 A Few Statistics on NSDL Content SDSC Crawl (April 03, 4 Links Deep) received correctly no data received see other forbidden file not found internal server error application error service temp. overloaded WIMS User Error Gone unused redirection w/out location total digital entities error percentage —1,530,206 —51 —5 —311 —38,386 —946 —15 —8 —1 —1,569,932 —2.53%

52 Encoding Formats Present in Archive CSS - Cascading Style Sheet ASP - Microsoft Active Server Page

53 Automated Processes: Categorizing the “Space” of all Descriptive Patterns Data-driven validation of descriptive metadata from NARA Archival Information Locator records Exhaustive examination of every metadata occurrence Automatic creation of an open-source relational database implementation Accumulation of all descriptive patterns Based on deriving “Descriptive Signatures” relying on regular expressions Creation of a Perl-based Validation Regular Expression Tool Refined regular expression to identify anomalies in the legacy metadata Annotated artifacts introduced by archival processes

54 A String Analysis Approach Accumulate all occurrence strings at each level of description in the hierarchy, and derive a regular expression that characterizes all instances: Record Group OR Collection (total of ~550) Series File Unit Item OR ItemAV (audio-visual) ___________________________________________ Physical Occurrence Media Occurrence Object

55 A String Analysis Approach Example - structural characterization: At the Series level, possible patterns are (S=Series, I=Item, O=Object, F=FileUnit): SIOSIOOOOOO SIO SFFFF SIIIIII SIOIOOOOO SFIIII An inferred regular expression is: S( F*(I+O+)* | I+ )* Relational tables are derived from these regular expressions for each of the 9 levels

56 Metadata Validation Analyze each regular expression to identify the classes of anomalies Cases in which a subset of the objects have a unique characterization different from the majority of the objects Identify cases with incorrect metadata tags Identify cases with missing metadata or missing objects Identify changes in metadata definitions

57 Regular Expressions COLLECTION (2 characterizations): *********** ="TiMtldColid(XcXs)*(Date)?(Ab)?(Tcsd(Tcsdq)?Tced(Tced q)?)?(Tisd(Tisdq)?Tied(Tiedq)?)?" = "(Odonor)?(Pdonor)*(Daut(Ndad)?)?(FatFan)?Dcgsd" RECORD GROUP (2 characterizations): ************* ="TiMtldGrno(Date)?(Tcsd)?(Tcsdq)?(Tced)?(Tcedq)?Tisd(T isdq)?Tied(Tiedq)?" = "(FatFan)*Dcgsd"

58 Regular Expressions SERIES (5 characterizations): ******* = "Ti(Altti)*MtldS(Grno)?(Formerrg)*(Colid)?" ="(Acnum)*(Arra)?(Chn)?(Date)?(Funcu)?(Gen)*(Numb)?(Sc ale)?(Ab)?(Tran)?(Itn)?(Staff)?(Rctno)*(Dano)*(XcXs)+" ="((Tcsd)?(Tcsdq)?Tced(Tcedq)?)?(Tisd(Tisdq)?Tied(Tiedq) ?)?" ="(Grt)*(Srt)*(OcontrOcontrtp)*(Orefer)*(Tgn)*(Lan)*(PcontrP contrtp)*(Prefer)*(Subj)*((Ars)?(Sar)*(Arsn)?)" ="(Urrs)?(Surr)?(Urrn)?((Daut)?(Ndad)?)*(Fat(Fan)?)*(MpiMp t(Mpn)?)*(Taed)?(Tst)?(CrorgCrorgtp)?(CrindCrindtp)*(Dcgs d)" --> SarSar: Decision to combine them

59 Regular Expressions ITEM (4 characterizations): ***** ="Ti(Altti)*Mtld(Grno)?(Formerrg)?(Colid)?(Acnum )?(Arra)?(Date)?(Gen)*(Ab)?(Staff)?(XcXs)+" = "((Tcsd)*(Tcsdq)?(Tced)*(Tcedq)?)?" ="(Tpd(Tpdq)?)*(Grt)+(Srt)*(OcontrOcontrtp)*(Oref er)*(Tgn)*(PcontrPcontrtp)*(Prefer)*(Subj)*(Ars)?(S ar)*(Arsn)?" ="Urrs(Surr)?(Urrn)?((Daut)?Ndad)*(MpiMpt(Mpn) ?)*(CrorgCrorgtp)?Dcgsd" --> SarSar: Decision to combine them --> Tcsd/Tced:

60 Regular Expression FILE UNIT (4 characterizations): ********** ="Ti(Altti)?Mtld(Grno)?(Formerrg)?(Colid)?(Acnum)?(A rra)?(Gen)*(Ab)?(XcXs)+" ="((Tcsd)*(Tcsdq)?(Tced)*(Tcedq)?)?(Tisd(Tisdq)?Tied( Tiedq)?)?" ="(Grt)+(Srt)*(OcontrOcontrtp)?(Orefer)*(Tgn)*(PcontrP contrtp)?(Prefer)*(Subj)*(Ars)?(Sar)*(Arsn)?" ="Urrs(Surr)?(Urrn)?((Daut)?Ndad)?(MpiMpt(Mpn)?)?(C rorgCrorgtp)?Dcgsd" --> SarSar: Decision to combine them --> Tcsd/Tced:

61 Lessons Learned Data-driven analysis of actual preservation metadata can be used to implement a new catalog on new technology Variant of self-instantiating archive, in which the preservation structure and catalogs are re-created

62 Preservation Archival processes through which a digital entity is extracted from its creation environment and migrated to a preservation environment, while maintaining authenticity and integrity information. Extraction process requires insertion of support infrastructure underneath the digital material, characterization of the authenticity and integrity, characterization of the digital encoding format, and characterization of the display operations Goal is infrastructure independence, the ability to use any commercial storage system, database, or access mechanism

63 For More Information Reagan W. Moore San Diego Supercomputer Center moore@sdsc.edu http://www.sdsc.edu/srb/


Download ppt "Building Preservation Environments Reagan W. Moore San Diego Supercomputer Center Storage Resource Broker."

Similar presentations


Ads by Google