Digital Preservation and the Open Web: A Curatorial Perspective Terence K. Huwe Institute of Industrial Relations University of California, Berkeley Computers.

Slides:

Advertisements

Similar presentations

What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.

Advertisements

Recent developments in digital archiving and preservation Jan Fullerton Director General National Library of Australia.

Moving Forward With Digital Preservation at the Library of Congress Laura Campbell Associate Librarian for Strategic Initiatives Library of Congress.

Introduction to Research Data Management Services, January 2013 Library Data Services Functions and activities.

Panel: What Changes With Digital? Web Archiving ARL Forum 2009 Tracy Seneca – California Digital Library.

HATHITRUST A Shared Digital Repository HathiTrust current work, challenges, and opportunities for public libraries Creating a Blueprint for a National.

1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.

The Library of Congress Cooperative Web Archiving Project Abbie Grotke, Library of Congress Grant Harris, Library of Congress Jennifer Long, Georgetown.

National Digital Information Infrastructure and Preservation Program (NDIIPP) Data-PASS/NDIIPP: A new effort to harvest our history A funder view May 25,

Background Chronopolis Goals Data Grid supporting a Long-term Preservation Service Data Migration Data Migration to next generation technologies Trust.

Chronopolis: Preserving Our Digital Heritage David Minor UC San Diego San Diego Supercomputer Center.

Linking State DOTs and University Research and Resources Rita Evans Institute of Transportation Studies Library University of California, Berkeley March.

Robust Tools for Archiving and Preserving Digital Data Joseph JaJa, Mike Smorul, and Mike McGann Institute for Advanced Computer Studies Department of.

1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

Use of METS in CDL Digital Special Collections Brian Tingle.

National Digital Information Infrastructure and Preservation Program (NDIIPP) Building a Network of Preservation Partners CNI Spring Task Force Meeting.

Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.

The Digital Motion Picture Archive Framework Project © 2008 AMPAS Academy of Motion Picture Arts and Sciences Science and Technology Council Nancy Silver,

Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.

WebArchiv Czech Web Archive IIPC 2007, Paris.

Marty Harris aka TEXT QUERY SYSTEM Marty Harris Mgr TRD.

HATHITRUST A Shared Digital Repository HathiTrust: Putting Research in Context HTRC UnCamp September 10, 2012 John Wilkin, Executive Director, HathiTrust.

The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive.

City of Seattle Office of the City Clerk Open Government = Access Challenges and Opportunities with Digital Records.

Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.

The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital.

Copyright © 2008, Open Geospatial Consortium, Inc., All Rights Reserved. NDIIPP Partnership Update: North Carolina and Multi-state Demonstration Projects.

Digitization Panel August 12, 2010 Christopher C. Brown, coordinator Mike Culbertson, Colorado State U. James Mauldin, GPO.

Digital Preservation through Cooperation: LOCKSS Gail McMillan Digital Library and Archives, University Libraries Virginia Polytechnic Institute and State.

The web has revolutionized our access to information. Documents and publications that were once difficult to fin are now readily available to anyone. Government.

The ECHO DEPository Project A project of the University of Illinois at Urbana-Champaign and OCLC in partnership with the Library of Congress ALA Annual.

ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.

Web Archiving Challenges: Collaborative Collection Building.

1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.

ESRI User Conference, August 8, 2006 Long-term archiving of geospatial data: the NGDA project Julie Sweetkind-Singer John Banning Stanford University.

Next Generation Technical Services Rethinking Library Technical Services for the University of California R Bruce Miller.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

HATHITRUST A Shared Digital Repository HathiTrust and TRAC DigitalPreservation 2012 July 25, 2012 Jeremy York, Project Librarian, HathiTrust.

The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.

November 2004 NDIIPP: Future Directions and Relevance to Other Countries Beth Dulabahn Office of Strategic Initiatives Library of Congress November 7,

Long-Term Preservation of At- Risk Digital Geospatial Data: A Cooperative Agreement with Library of Congress Steve Morris NCSU Libraries Zsolt Nagy NC.

Web Archiving Service (WAS) Rosalie Lack Data Curation for Practitioners 2012 Workshop.

CyberCemetery Preserving At-Risk Government Web Content.

Implementing an Institutional Repository: Part III 16 th North Carolina Serials Conference March 29, 2007 Resource Issues.

HATHITRUST A Shared Digital Repository HathiTrust and the Future of Research Libraries American Antiquarian Society March 31, 2012 Jeremy York, Project.

California Digital Library – California Digital Library DL Interoperability (InterLib) DLI-2 All projects meeting June 2000 John Ober Director.

Cyberinfrastructure for data curation Greg Janée UC Santa Barbara; CDL.

The Web-at-Risk NDIIPP Sponsored Project Partners include: California Digital Library – project lead University of North Texas New York University California.

Web Archiving Service Public Access Release Date: July

Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.

Minnesota: Land of 10,000 E-Folios Going online with eFolio Minnesota Paul Wasko – Project Director.

Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.

A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.

Al Cornish, Systems Librarian Washington State University Libraries Preserving Access to Multimedia Collections.

Digital Preservation through Cooperation: LOCKSS Gail McMillan Digital Library and Archives, University Libraries Virginia Polytechnic Institute and State.

Library of Congress Partnerships for Managing Geospatial Data North Carolina Geographic Information Coordinating Council Raleigh, NC November 7, 2007 William.

HATHITRUST A Shared Digital Repository HathiTrust Large Digital Libraries: Beyond Google Books Modern Language Association January 5, 2012 Jeremy York,

Challenges in Web Archiving UNT Perspective NDIIPP – July 21, 2010.

Grant Writing for Digital Projects September 2012 IODE Project Office IODE Project Office Oostende, Belgium Oostende, Belgium Sustainability and.

HathiTrust: A valuable and visionary Partnership.

Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.

Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.

Web Archiving Service (WAS) Rosalie Lack Data Curation for Practitioners 2012 Workshop.

Digitization Workflows From the Digital Projects Unit University of North Texas Libraries Mark E. Phillips Jeremy D. Moore February 12, 2009.

Joanne Archer University of Maryland Libraries

Joseph JaJa, Mike Smorul, and Sangchul Song

& Support for Data Management Planning

László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.

Preserving Our Collective Digital History

Implementing an Institutional Repository: Part III

Presentation transcript:

Digital Preservation and the Open Web: A Curatorial Perspective Terence K. Huwe Institute of Industrial Relations University of California, Berkeley Computers In Libraries March, 2006

Overview A Brief Description of “The Web at Risk” Project –How it’s organized, who’s involved Objectives of the Project –Preservation of the open Web –Development of an open source “Tool Kit” How it works, where it’s going, from a “special collections” perspective

The Web at Risk Project 3 year, 2.4 million dollar grant from the Library of Congress/National Digital Information Infrastructure (NDIIPP) Coordinating Agency: The California Digital Library Primary focus on developing open access archiving tools that can be applied to any discipline with Web content worth keeping Extensible, modular, easily configured to work with existing technologies that are already in place

Project Stages Content Identification and Selection –Key issues for analysis, framework for sample crawls, working with collection partners, exploring extensibility Content Acquisition –Content Harvest and Acquisition, configuring of Web Crawler, Analyzer, Content User Interface (CUI), Export/Import Handler Content Retention –Data model for Web Archive Digital Objects (WADO) testing and modification, assessing the CDL Digital Preservation Repository for ingest and retention Partnership Building –Model Agreements for content retention, evaluate future steps, assess costs of sustaining a distributed approach to Web archiving

Partners in this NDIIPP Grant Main Partners: –New York University –University of North Texas, The Libraries –Texas Center for Digital Knowledge Technical Partners: –UC San Diego Supercomputer Center –Stanford University Computer Science Department –Sun Microsystems, Inc.

National Curatorial Partners Arizona State University Library and Archive New York University Tamiment Library University of North Texas, The Libraries Stanford University Library’s Social Sciences Research Center

University of California Curatorial Partners UCLA Online Campaign Literature Archive UC Berkeley Institute of Governmental Studies Library UC Berkeley Institute of Industrial Relations Libray Eight UC Libraries in the Federal Depository Library Program: –Berkeley, Davis, Irvine, UCLA, Riverside, San Diego, Santa Barbara, Santa Cruz

The Institute of Industrial Relations: Capturing Labor History in Action News, data and links are being generated by unions at both the international and local level Union priorities are necessarily “just in time” and they operate in a state of high triage Preserving these data is a high priority for IIR and the NYU Tamiment Library It’s not likely that a non-academic host will do so, making the challenge more urgent

Where Things Stand Now We’ve got a Wiki and curators are in touch IIR and NYU/Tamiment are coordinating on labor issues Technical issues have moved to the fore –Figuring out the configuration of the crawler, what to crawl The first crawl report has come back The results are provocative and interesting

First Crawl Highlights 30 sites crawled, max set to 1 gigabyte –18 hit the 1 gigabyte limit Average files on host: 6,359 Average with Linked hosts included: 17,247 Most files on a single server: 46,197 Median Duration of crawl (host): 7hr 33m The crawler, Heritrix 1.5.1, returned different data than other crawlers (HTTrack, Wget)

Rights and Permissions Vary According to Host A three level scheme for future rights management: Consent Implied: Crawl without permission –14 sites in this category Consent Sought: Crawl but also identify and notify the data owner –13 sites in this category Consent Required: A dvance permission needed –3 sites in this category

Web aRchive Access (WERA) An open source tool for viewing crawl results Very new, very much still in development Relies upon a search query to display the crawled resources Does not really present how an average user would utilize a finished collection

The Fine Print Matters Hetrix doesn’t capture the directory tree of servers —it follows links Many domains involve multiple servers, and crucial files (such as CSS libraries) need to be captured The value of capturing linked files varies from site to site, from irrelevant to vitally important

Curator Perspectives Need to capture “new publications” as they appear By a slight majority, monthly intervals are favored for crawl frequency How much multimedia be captured? The 1 gigabyte limit obscured the answer About 70 percent of curators rated the crawl as “mostly effective” Curators approached the process collaboratively from the very beginning—communicating proactively. This implies that collaborative collection development is viable

What’s Needed Curators want to see some sort of user interface to evaluate the experience of viewing archived Web resources The relationship between a particular host and whatever it links to is stimulating debate—probably, both are needed Long term sustainability of this project will depend on attracting interest from government and industry

Looking Ahead The Open Access toolkit will be rigorously tested (and will not appear for at least 2 years) This approach places most responsibility with curators—just as special collection development activity would mandate This is a new stream of work for information professionals—but the standarization of the toolkit could be an important innovation

Conclusions The profession-wide culture of collaborative collection development is alive and well—and digesting new digital collection strategies The combination of a toolkit “deliverable” and the pooled experience of the cohort will be enormously useful for all digital librarians Hands-on collection experts are in an excellent position to advise technologists in the creation of new digital archiving tools— at the ground level

URLs Referenced The Web at Risk: Heritrix Web Site: Web aRchive Access: UCLA Campaign Literature Archive: The AFL-CIO: Service Employees International Union: Change to Win: The Institute of Industrial Relations Library:

Digital Preservation and the Open Web: A Curatorial Perspective Terence K. Huwe Institute of Industrial Relations University of California, Berkeley Computers In Libraries March, 2006