Planning Digital Projects Smithsonian Archivists Oct 29, 2001 Howard Besser UCLA School of Education & Information http://www.gseis.ucla.edu/~howard
Planning Digital Projects- Interoperability Importance of Standards Best Practices for Managing Digital Projects Implications of Digital Projects Longevity From Digital Collections to Digital Libraries & Museums
Key problems we’re facing Discovery Interoperability- Longevity-
Traditional Digital Collection Model DL DL DL DL search & presentation search & presentation search & presentation search & presentation user user
Ideal Digital Collection Model DL DL DL DL search & presentation user user
For Interoperability, Digital Collections Need Standards Descriptive Metadata for consistent description Discovery Metadata for finding Administrative Metadata for viewing and maintaining Structural Metadata for navigation ... Terms & Conditions Metadata for controlling access...
Metadata is not just indexing terms CBIR attributes used for retrieval on color, shape, texture, etc. Structural attributes used for page-turning Administrative attributes used for managing a digital work over time IPR attributes to limit unauthorized use Identification attributes to determine what application software is needed to view a particular digital work Can be located anywhere
Why are Standards and Metadata consensus important? Managing digital files over time Longevity Interoperability Veracity Recording in a consistent manner Will give vendors incentive to create applications that support this
Why Standards? Why do we need standards? To make information universally available to users facilitate sharing and interchange of information To preserve information (make it safe from changes in hardware and software) Standards only work if communities widely accept them, but they’re necessary for communities to work together 11
Metadata for Digital Asset Management- Importance of Metadata Standards Types and Uses of Metadata Discovery Metadata: The Dublin Core Administrative and Structural Metadata: The Making of America II Project Longevity Metadata Identification/Provenance
What is Metadata Structured data describing other data used to find or help manage information resources Aids in interoperability Titles, dates, captions, cataloging and indexing data, file headers, rights info, provenance, code books, transaction logs, ... One person’s metadata is another’s data
Sorting through the Standards Morass Data Structures (DC, CDWA, MARC, VRA Core, TEI, EAD, MESL data dict) Data Interchange (Z39.50) Data Values/vocabularies (LCSH, AAT, ULAN, TGN) Data Content/syntax (AACR2)
Semantics/Syntax/Structure meaning, as defined by a community to meet their particular needs (DC) Syntax a systematic arrangement of data elements for machine processing facilitates the exchange and use of metadata among various applications (HTML, XML, RDF) Structure a formal arrangement of the syntax with the goal of consistent representation of the semantics (rules defining field contents like 1/11/99)
What is Metadata Types & Uses lots of different ways of dividing the clusters
Uses of Metadata Discovery & Retrieval Identification/Provenance Rights Management Viewing Integrity Longevity Content rating
Some different schemes where Metdata is kept embedded within the object (HTML tags) in a separate related DB maintained by same organization (OPAC, MOA II) in a separate DB maintained by a separate organization (Books in Print, ratings systems) derived on-the-fly from a different scheme (MARC-to-DC)
Dublin Core Discovery Metadata- Containers & Packages Qualifiers Crosswalks
Containers and Packages of Metadata Warwick, not MARC modular overlapping extensible community-based designed for a networked world to aid commonality btwn communities while still providing full functionality within each community
DC Qualifiers allows one community to express important nuances and qualifications, while still making the basic importance available to communities with simple needs our community can reflect alternate title, transliterated title, and main title, yet they will all be found under a simple Web search under “title”
Crosswalks mapping btwn differing metadata structures eliminate the need for monolithic, universally adopted standards focus on flexibility and interoperatiblity RDF-based metadata registries
Crosswalk Example
Other types of Metadata Administrative Structural Longevity Identification
Important Planning Considerations File Formats Choosing Interoperable Systems Adhere to standards Vendors with large installed base Refreshing and/or Migration
Key Considerations for Imaging Projects Image Quality Archival Current online delivery Intellectual Property Standards Modular and Layered Architecture Terminology Technical imaging information
Best Practices for Managing Digital Projects- Who will your users be? Best Practices Guidelines Workflow and Management Issues
Why are you Managing this Information? Organizational mission & type Users Uses
Scanning Best Practices Think about users (and potential users), uses, and type of material/collection Scan at the highest quality that does not exceed the likely potential users/uses/material Do not let today’s delivery limitations influence your scanning file sizes; understand the difference between digital masters and derivative files used for delivery Many documents which appear to be bitonal actually are better represented with greyscale scans Include color bar and ruler in the scan Use objective measurements to determine scanner settings (do NOT attempt to make the image good on your particular monitor or use image processing to color correct) Don’t use lossy compression Store in a common (standardized) file format Capture as much metadata as is reasonably possible (including metadata about the scanning process itself)
Why Scale is important
Digital Object Behaviors Book example
Metadata Standards (from MOA2, now METS) Administrative Metadata for enhancing resource management Structural Metadata for reflecting internal hierarchies and relationships btwn parts Raw/Seared/Cooked
More general issues of Digital Projects- Workflow and Management Issues Implications for the Collection Implications for the Institution Implications for Scholarship & Interoperability Digital libraries Metadata Longevity Issues
Workflow and Management Issues- Managing multiple image files Persistent Identification Making your works accessible throughout the Net
Workflow and Tracking Procedures Need careful planning Procedures for managing many different files at many different stages Linking of file versions
The number of variant forms of a work can be enormous different views of the same object different scans of the same photo different resolutions different compression schemes different compression ratios different file storage formats different details of the same image ...
Image Families
Identification/Provenance how to deal with different versions (browse, hi-res, medium res) derived from the same scan or different encoding schemes (TIFF, PICT, JFIF) Vocabulary Standards to express this VRA Surrogate Categories CIMI's "Image Elements”
Persistent IDs--the Problem Need to separate work ID from work location URNs probably won’t be ready until 2003 Becomes a business process issue when one organization maintains the resource and another organization references it (ie. licensed from vendors or managed by separate administrative structures)
Making your works accessible throughout the Net Open Archives and Metadata Harvesting (DLF/Mellon) An administrative and political issue as much as a a technical one
More general issues of Digital Projects- Workflow and Management Issues Implications for the Collection Implications for the Institution Implications for Scholarship & Interoperability Digital libraries Metadata Longevity Issues
Implications for the Collection We’re already familiar with Reformatting Advantages & Disadvantages of Digitization- Protection- Unauthorized Use-
Broad Advantages & Disadvantages of Digitization good PR show off collection let people see items without having to needlessly pull them Disadvantages Can look like Edutainment Can commodify the works and make the repository look like it sold prestige to the highest bidder Authenticity called into question Decontextualization Representational problems-
Problems with How Works are Represented once a digital work is on the WWW, anyone can physically copy it and use it as they see fit often items are seen outside their context for images: using the normal method of mounting images on the WWW, the credit line often becomes separated from the image
Don’t advocate strong copyright or protection for the wrong reasons the people who find your content valuable are mostly your traditional audiences barriers before use will inhibit positive uses of your material threat of pursuit after misuse can effectively deter commercial misuse you can prevent commercial misuse without strong protection or copyright
Instead of fearing lost income, worry about Unauthorized Use Elimination of credit or attribution line (particularly for images) Someone else implying ownership Maintaining the Integrity of the Work
Protection Methods^ Encrypting or encapsulating the digital file Marking the digital image with ownership (visible or not) http://is.gseis.ucla.edu/impact/f95/Papers-projects/Projects/Trowbridge/labels.html Image quality Onscreen quality is far lower than printed quality http://is.gseis.ucla.edu/impact/f95 /Papers-projects/Projects/Trowbridge/resolution.html
Effect on the Institution Creating/Maintaining a WWW site Wear and tear on the original Handling external requests for Special Collection material Increase or decrease in requests to see originals? New Audiences Implications on the Institution’s Public Image
Scholarship and Interoperability Why Digital Libraries need standards for interoperability Metadata concerns Digitization means New Audiences-
Digitization means New Audiences more access for more people outreach to new groups but new groups have different usability requirements different user interfaces different vocabulary new methods of navigation we already have enough differences btwn different institution types (& even within the same type) MESL results Organization & indexing reflects the biases of the original intent when records were formed
Serious Longevity Problems What we know from prior widespread digital file formats Images separating from their metadata Inaccessibility of software needed to view an image Inability to even decode the file format of an image
The Short Life of Digital Info: Digital Longevity Problems- Disappearing Information The Viewing Problem The Scrambling Problem The Inter-relation Problem The Custodial Problem The Translation Problem
The Viewing Problem Digital Info requires a whole infrastructure to view it Each piece of that infrastructure is changing at an incredibly rapid rate How can we ever hope to deal with all the permutations and combinations
The Scrambling Problem Dangers from: Compression to ease storage & delivery Container Architecture to enhance digital commerce
The Inter-relation Problem -Info is increasingly inter-related to other info -How do we make our own Info persist when it points to and integrates with Info owned by others? -What is the boundary of a set of information (or even of a digital object)?
The Custodial Problem How do we decide what to save? Who should save it? How should they save it? -methods for later access: emulation, migration, etc. -issues of authenticity and evidence
The Translation Problem Content translated into new delivery devices changes meaning -A photo vs. a painting -If Info is produced originally in digital form in one encoded format, will it be the same when translated into another format? Behaviors
Pieces of the Solution (1/2) -We need to insist upon clearly readable standardized ways for digital objects to self-identify their formats -We should discourage scrambling -We need to better understand information inter-relates to other Info, and what constitutes “boundaries” of Info objects
Pieces of the Solution (2/2) -People and organizations wishing to make information persist need guidelines of how to go about doing it -We need to better understand how translating from one storage or display format to another affects the meaning of a work -We need to save the “behaviors” of a digital object, not just it’s “contents”
Conceptual Approaches to Digital Preservation Refreshing always necessary due to volatility of physical strata Impact on evidential value Migration -- advantages & disadvantages Emulation -- advantages & disadvantages
Migration/Refreshing Impact on evidential value
Pragmatic Issues Save Metadata!!! (Descriptive, Administrative, Structural, …) Separate master from delivery Consistent file-naming conventions Good work-flow Develop cooperative long-range plans
Metadata can be the first line of defense Can tell you where the file is (if you can’t find the file) where more info about the file is (if you have the file but most other metadata has become separated) what the file format is what the compression scheme is what application program and version is needed for the file
Groups Working on the Big Longevity Problem http://sunsite. Berkeley CPA Task Force Getty “Time & Bits” Conference & Follow-ups Emulation experiments in US and Europe NEDLIB, CURL, Michigan LC- Mellon-funded E-Journal Archive experiments- Internet Archive Long Now
Library to Lead National Effort to Develop Digital Information Infrastructure and Preservation Program (1 of 2) U.S. Congress Provides $100 Million Special Appropriation in Support of Project http://www.loc.gov/today/pr/2001/01-006.html Response to NAS report LC 21: A Digital Strategy for the Library of Congress The Library of Congress has been empowered by the U. S. Congress to develop a national program to preserve the burgeoning amounts of digital information, especially materials that are created only in digital formats, to ensure their accessibility for current and future generations. "This collaborative strategy will permit the long-term acquisition, storage and preservation of digital materials, that will assure access to the growing electronic historical and cultural record of our nation," said Dr. Billington. "Just as the Congress enabled the Library of Congress to begin the last century by making its printed catalog cards widely available, the Congress has enabled its Library to begin this century by building a digital record and making it available in the information age.”
Library to Lead National Effort to Develop Digital Information Infrastructure and Preservation Program (2 of 2) U.S. Congress Provides $100 Million Special Appropriation in Support of Project http://www.loc.gov/today/pr/2001/01-006.html In December 2000, the 106th Congress appropriated $100 million for this effort, which instructs the Library to spend an initial $25 million to develop and execute a congressionally approved strategic plan for a National Digital Information Infrastructure and Preservation Program. Congress specified that, of this amount, $5 million may be spent during the initial phase for planning as well as the acquisition and preservation of digital information that may otherwise vanish. The legislation authorizes as much as $75 million of federal funding to be made available as this amount is matched by nonfederal donations, including in-kind contributions, through March 31, 2003. The effect of a government-wide recission of .22 percent in late December was to reduce this pecial appropriation to $99.8 million. The Library will consult with federal partners to assess joint planning considerations for shared responsibilities. The Library will also seek participation from the nonfederal sector and will execute its overall strategy in cooperation with the library, creative, publishing, technology and copyright communities in this country and abroad.
Moving from Digital Collections to Digital Libraries/Museums/Archives What’s the difference? not experiments real users service longevity Recent history of Library Automation-
Ideal Digital Collection Model DL DL DL DL search & presentation user user
Developmental Stages Experiment with methods Build real operational systems Build interoperable operational systems Make the system useful for users For DL Initiatives For OPACs For I & A Services For Image Retrieval
One Final Question: Who will collect the digital works of today that should become the Special Collections of tomorrow? web sites zines electronic journals listserve and email discussions drafts of works that later become famous
Planning Digital Projects http://www.getty.edu/research/institute/standards/introimages/ http://www.getty.edu/gri/standard/intrometadata/ http://www.cdlib.org/libstaff/technology/tas/Standards/ http://sunsite.berkeley.edu/Imaging/Databases/ http://sunsite.berkeley.edu/moa2/ http://sunsite.berkeley.edu/Longevity/ http://www.oac.cdlib.org/ http://lcweb.loc.gov/ead/ http://purl.oclc.org/metadata/dublin_core/ http://www.gseis.ucla.edu/~howard/image-meta.html http://www.gseis.ucla.edu/~howard/Metadata/UC-May00/ http://sunsite.berkeley.edu/Metadata/sp2000.html http://www.gseis.ucla.edu/~howard/ http://is.gseis.ucla.edu/impact/f95/special-collectns.html http://is.gseis.ucla.edu/impact/f95/Papers-projects/Projects/Trowbridge/