Project Prism Virtual Remote Control: Preservation Risk Management for Web Resources Nancy Y. McGovern, ECURE 2002
The Project Part of a 4-year NSF-funded project –supported by the Digital Libraries Initiative, Phase 2 (Grant No. IIS , the Prism Project) An umbrella project that includes –Digital Libraries research team (Computer Science) –Human Computer Interface (HCI) –Cornell University Library (CUL) For updates: – Project Prism
The Team Anne R. Kenney Nancy Y. McGovern Peter Botticelli Richard Entlich William R. Kehoe Carl Lagoze Sandra Payette Project Prism
Preservation Risk Management Increased reliance by research libraries on Web resources not owned or controlled Need to monitor and evaluate resources Identify risks to resources and appropriate responses Technology introduces new threats, enables new solutions Project Prism
The Research Agenda see, "Preservation Risk Management for Web Resources: Virtual Remote Control in Cornell's Project Prism," by Anne R. Kenney, Nancy Y. McGovern, Peter Botticelli, Richard Entlich, Carl Lagoze, and Sandra Payette in DLib Magazine, January Project Prism
The Approach 1.Process 2.Identification 3.Analysis 4.Appraisal 5.Strategy 6.Detection 7.Response Project Prism
Process Adapt the Risk Management Model stages: Project Prism
Identification Establish boundary; Characterize content: example: parse the URL Project Prism
Analysis Define risks associated with: A Web page: –as a stand-alone object, ignoring its hyperlinks –in local context, considering the internal and external links A Web site: –as a semantically coherent set of linked Web pages –as an entity in a broader technical and organizational context Project Prism
Contextual Layers Project Prism
Page-level Monitoring Formatting: TIDY Standards compliance Document structure Metadata: –HTTP headers –HTML headers Changes –Content –Location Links –Out-link structure –In-link structure –Intra-site –Hub –Volatility Page provenance –URL parsing Log analysis Project Prism
Site-level Monitoring Graph analysis Static site analysis and Longitudinal study Aggregate page analyses Site maintenance indicators –Backup and archiving policies and procedures –Hardware and software environment –Network configuration and maintenance Project Prism
Appraisal Enable portfolio management: Hypothetical appraisal of a Web resource: Scope: highly relevant Value: high value, not essential; numerous links to page Relationship: secondary archives; informal agreement Maintenance: key indicators of good management Redundancy: captured by more than one archive Risk response: very responsive to risk notifications Capture: complex structure; cyclical updates; formats Size: medium-sized; 3-level crawl Project Prism
Portfolio Management Project Prism
Strategy Develop an organization-specific program: Project Prism
Detection Monitor change; initiate response: Track indicators of management practices: - markup language: version, formatting, compliance - HTTP: status codes, header content - changes: content, location - links: internal, external, volatility - server: security, version, upgrades, responsiveness Project Prism
Detection (cont.) Monitor change; initiate response: Identify potential risks - probable occurrence - frequency of occurrence - degree of impact Correlate to program-define response levels Identify appropriate risk/response scenario(s) Project Prism
Response Develop a toolkit: Inventory and evaluate existing tools Assess functionality for Prism stages Adopt/adapt existing tools Develop new tools Apply to appropriate contextual layers Integrate tools into customizable toolkit Project Prism
Types of Tools link analyzers log analyzers Web crawlers Web visualization programs Web site management utilities Project Prism
Future Directions Preservation Risk Management Program: –Develop program using Prism framework –Provide organizational scenarios Toolkit: –Complete inventory of tools –Build toolkit demonstrator Applications: –Develop presentation techniques for stored resources –Enable risk/response scenario development Project Prism