2008 DOT GOV HARVEST PRESERVING ACCESS UNIVERSITY OF NORTH TEXAS LIBRARIES Cathy N. Hartman Mark E. Phillips FDLC Oct 21, 2008.

Slides:



Advertisements
Similar presentations
An Introduction To Heritrix
Advertisements

Digital Initiatives at the University of North Texas Libraries Cathy Nelson Hartman University of North Texas Libraries Texas Conference on Digital Libraries.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Digital & Preservation Resources Managing the digital collection life cycle.
Texas Newspaper PDF Preservation: A Low-Cost Solution with Tremendous Value Ana Krahmer, Digital Newspaper Program Coordinator Mark.
The Library of Congress Cooperative Web Archiving Project Abbie Grotke, Library of Congress Grant Harris, Library of Congress Jennifer Long, Georgetown.
National Digital Information Infrastructure and Preservation Program (NDIIPP) Data-PASS/NDIIPP: A new effort to harvest our history A funder view May 25,
Looking Ahead Archive-It Partner Meeting November 12, 2013.
Transformations at GPO: An Update on the Government Printing Office's Future Digital System George Barnum Coalition for Networked Information December.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
The FDLP Web Archive Dory Bower Archive-It Partner Meeting November 18, 2014.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
1 Archive-It Training University of Maryland July 12, 2007.
©2011 Quest Software, Inc. All rights reserved. Steve Walch, Senior Product Manager Blog: November, 2011 Partner Training Webcast.
Annick Le Follic Bibliothèque nationale de France Tallinn,
QuestionPoint and the Library of Congress FEDLINK Fall OCLC Users Group Meeting Linda J. White Public Service Collections Library of Congress FEDLINK Fall.
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Leadership for Student Success through After School Programs Presented by: California County Superintendents Educational Services Association with the.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
A Partnership Born of Urgency and Civic Responsibility Preserving Access to Government Websites Through the CyberCemetery Starr Hoffman Librarian for Digital.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
The Web Archiving Service Tracy Seneca California Digital Library California Digital LibraryNew York UniversityUniversity of North Texas National Digital.
Digitization of the Federal Depository Library Program Judith C. Russell Superintendent of Documents & Managing Director, Information Dissemination “Electronic.
Annick Le Follic Bibliothèque nationale de France Tallinn,
Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library.
Google Classroom.
CNI Fall Task Force, December 2007 International Internet Preservation Consortium Abbie Grotke IIPC Communications Officer Library of Congress & George.
The Metadata Object Description Schema (MODS) NISO Metadata Workshop May 20, 2004 Rebecca Guenther Network Development and MARC Standards Office Library.
ECHO DEPository Project: Highlight on tools & emerging issues The ECHO DEPository Project is a 3-year digital preservation research and development project.
WAS to Archive-It Metadata Migration March 11, 2015.
ERIC and the WorldCat Registry Lawrence Henry ERIC Program Manager Joanna White WorldCat Registry Product Manager.
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
ERIKA Eesti Ressursid Internetis Kataloogimine ja Arhiveerimine Estonian Resources in Internet, Indexing and Archiving.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
11 October 2015 MAVIS v “Sneak Preview”. 11 October 2015 Enhancements in the Release  Reference Material  Brief Accessioning View  Template.
Can we be doing more? Beth Tillinghast University of Hawaii at Manoa October 19, 2011 Archive-It Partner Meeting ACCESS TO OUR ARCHIVED WEBSITE COLLECTIONS.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
GPO’s Federal Digital System August 17, 2010 U.S. Government Printing Office.
The Portal to Texas History: Harnessing Technology to Enable Collaboration with Small Museums and Libraries CNI, December 6, 2005 Cathy Nelson Hartman.
0 GPO’s Federal Digital System October 21, 2008 Selene Dalecky – Lisa LaPlant – Blake Edwards U.S. Government Printing Office.
Web Archiving at the National Library of Australia Russell Latham Senior Web Archivist, National Library of Australia.
The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.
CyberCemetery Preserving At-Risk Government Web Content.
ALA Institutional Repository Update ALA Archives at the University of Illinois Urbana-Champaign Chris Prom Cara Bertram Denise Rayman.
EThOS: Where have we got to and where do we go next? FIL Conference 29 June 2015 Sara Gould.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
Encouraging An Informed Citizenry: Locating and Using Congressional Research Service Reports Starr Hoffman Librarian for Digital Collections University.
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
National Archives and Records Administration Status of the ERA Project RACO Chicago Meg Phillips August 24, 2010.
A Project of the University Libraries Ball State University Libraries A destination for research, learning, and friends.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Challenges in Web Archiving UNT Perspective NDIIPP – July 21, 2010.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
Digitization Workflows From the Digital Projects Unit University of North Texas Libraries Mark E. Phillips Jeremy D. Moore February 12, 2009.
Archiving & Preserving Digital Content
Workshop on Web Archiving
LMEvents SharePoint Portal How-to Guide
Joanne Archer University of Maryland Libraries
OER Commons Hubs A Primer
Building Search Systems for Digital Library Collections
The Digital Library for Earth System Science
MSC photo:  It was taken some time in the late 1930s, but we don’t have an exact date.  The college was known as MSC from 1925 until 1955 when we became.
Wisconsin County and Municipal Government Collections in Archive-It
Metadata to fit your needs... How much is too much?
Preserving Our Collective Digital History
MSC photo:  It was taken some time in the late 1930s, but we don’t have an exact date.  The college was known as MSC from 1925 until 1955 when we became.
Márton Németh – László Drótos How to catalogue a web archive?
Presentation transcript:

2008 DOT GOV HARVEST PRESERVING ACCESS UNIVERSITY OF NORTH TEXAS LIBRARIES Cathy N. Hartman Mark E. Phillips FDLC Oct 21, 2008

Outline  Project History  Tool Building  Partner Activities  Future Work

Project History  Collaborating Institutions :  Library of Congress  Internet Archive  California Digital Library  University of North Texas  US Government Printing Office

Project History  First Meeting – Canberra, Australia  Early April 2008, at the National Library of Australia, International Internet Preservation Consortium (IIPC) – formed the partnership and discussed implications and possible roles for each institution.  Agreed from the beginning to share all content with any partner who wished a copy.

Project History  Monthly meetings since that time – conference calls and one face-to-face meeting.  Defined roles  Released an announcement  Sought help from specialists to nominate URLs for harvesting  Shared technology planning  Developed URL Nomination Tool

Tool Building  URL Nomination Tool Allows for combining multiple seed lists Allows for collaboration with subject experts Helps create future seed lists Helps to define overall scope of project

Pause for Vocabulary  Seed List – List of URLs fed to the crawler for harvesting.  Crawler– Software which downloads file, parses text to extract URLs, adds URLs to list and repeats  Scope – Whether a URL should be included or not  Crawl – Running a crawler on a given seed list  SURT – Sort-friendly URI Reordering Transform

List of URLs 1890scholars.program.usda.gov 2001.cancer.gov 2001.nci.nih.gov acc.nos.noaa.gov access.usgs.gov access.wa.gov accessamerica.gov accesstospace.gsfc.nasa.gov acrim.jpl.nasa.gov acs.oes.ca.gov acweb.fsl.noaa.gov adc.gsfc.nasa.gov

List of SURTs gov.accessamerica gov.ca.oes.acs gov.cancer.2001 gov.nasa.gsfc.accesstospace gov.nasa.gsfc.adc gov.nasa.jpl.acrim gov.nih.nci.2001 gov.noaa.fls.acweb gov.noaa.nos.acc gov.usda.program.1890scholars gov.usgs.access gov.wa.access

Back to the tool…  Tool Requirements  Ingest seed lists from different sources  Keep track of who nominated seed  Record known metadata for seed  Allow people to help with nomination Search Browse “easy to use”  Create seed lists for crawls

Tool Concepts  URL – Single instance of metadata in system  URL  Attribute – (metadata element)  Value – (metadata value)  Nominator ID  Project ID  Timestamp  Nominator  Nominator Address  Nominator Name  Nominator Institution  Project  Project Metadata

Batch Ingest  Administrator can import CSV files with URLs and associated metadata with batch importer  An ingest needs to be associated with a Nominator and a Project.  Arbitrary metadata is recognized and added to the system.

Nomination – In Scope/Out of Scope  On batch import a URL is given a positive nomination +1  A user of the Nomination Tool has the ability to nominate a URL as in scope (+1) or out of scope (-1)  Nominations are calculated to give a possible measure of importance for a project.

EOT 2008 Project Metadata  Metadata fields defined for EOT 2008 Harvest Branch Department/Agency Name Title Comment  Nominators don’t need to register but Name, and Institution are required.

Nomination Tool Demo  URL Nomination Tool URL Nomination Tool  URL Nomination Tool - Admin URL Nomination Tool - Admin

You can give us a hand…  For more information about the project and partners, visit: html html  To sign up to participate, send an to for more

Partner Roles  Internet Archive – Broad Crawls  Library of Congress – In-depth Legislative branch crawls  University of North Texas – Sites/Agencies that meet current UNT interests and collections, as well as several “deep web” sites.  California Digital Library – Executive/Judicial branch emphasis with one site per crawl scope  Government Printing Office - Support

Project Schedule  August 14: Project Announced via Press Release  End of August: Call for volunteer nominators went out. Nominators began prioritization of URLs.  September 15: Internet Archive began first broad crawl of everything that was in the nomination tool at the project start.  October 17: UNT began crawl of prioritized URLs of topic areas interest to their collection development.

Project Schedule 2008 – cont.  October 29: LC expanded monthly Congressional Crawl  November 1 – January 19: CDL begins weekly crawl of all prioritized URLs. Will continue weekly until the Inauguration.  November 4: Election Day - UNT begins second crawl of prioritized URLs of topic areas interest to their collection development.  November 26: LC expanded monthly Congressional Crawl  December 31: LC expanded monthly Congressional Crawl

Project Schedule 2009  January 20: Inauguration Day  January 21: Internet Archive begins second broad crawl - de-duplication will be on.  UNT begins third crawl of prioritized URLs of topic areas interest to their collection development.  January 28: LC expanded monthly Congressional Crawl begins  Late February/March: Public access via Internet Archive’s website

Near-Future Work  Centralizing web data into a single collection at the Internet Archive  Providing WayBack access to content  Providing search access to content  Distributing collection among partners (25-35 TB projected)  Investigation of browse by Agency/Branch

Other Future Work  Extracting topical collections from crawl data  Providing programmatic access for data- mining  Research in calculating “size” of collection in relation to real world measures  Number of pages of text collected  Number of 8x10 in equivalent images collected  Hours of Audio  Hours of Video  Physical library space requirements to hold collection if in physical format.

Questions?