Archiving the Web: why bother ? LA Times (March 2000)

Slides:

Advertisements

Similar presentations

Kulturarw³ Capturing the web The Swedish experience

Advertisements

Harvesting and archiving the Web Nordunet2000, Juha Hakala Helsinki University Library.

Subject Based Information Gateways in The UK Coordinated Activities in The UK Within the UK Higher Education community, the JISC (Joint Information Systems.

Sirsi Rooms implementation project at the University of Leicester Janet Guinea Systems Librarian EUUG annual conference Amsterdam 2004.

Issues and approaches to preservation metadata Michael Day UKOLN: UK Office for Library and Information Networking University of Bath

A survey of Web preservation initiatives Michael Day UKOLN, University of Bath 7 th European Conference on Research and Advanced Technology.

THE JOKOMO / YAMADA LIBRARY DIGITAL LIBRARY PROJECT.

Bibliothèque nationale de France Tallinn, BnF update: production and development priorities in 2015.

14 mai 2007Evolution of Scientific Publications, Colloque de l'Académie des sciences1 Preservation of electronic publications mission Catherine Lupovici.

Internet Research Internet Applications. The Internet is not the Web Because of the great popularity of the World Wide Web, people think the Internet.

BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall

Colin Potter and Caroline Foxon – Sunshine Coast Regional Library Service

Creating the User’s European Digital Library Jill Cousins The European Library Knowbynet, Berlin, June 2007.

Constructing the Memories Creating a Digital Collection Linda J. White, Digital Project Coordinator.

1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.

APSR Forum on Long-Term Repositories National Library of Australia, 31 August – 1 September, Trust and the Web: Can the audit criteria apply to.

William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.

The Internet. What is the Internet? A community with about 100 million users Available in almost every country about 160,000 people are added each month.

1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.

Preserving and Accessing Our Cultural Heritage – The Role of Copyright Law, Digitisation and the Internet The Community Dimension Dr. Jens Gaster King’s.

Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, June 2001 Warwick Cathro,

Preserving the Unpreservable: Form, Content, Copyright and the Archiving of Born-Digital Newspapers Lisa Lynch Concordia University Paul Fontaine McGill.

1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

1 WEB ARCHIVING IN THE BRITISH LIBRARY John Tuck Head of British Collections February 2004.

EMu and Archives NA EMu Users Conference – Oct Slide 1 EMu and Archives Experiences from the Canada Science and Technology Museum Corporation.

The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.

Promoting Digital Preservation Partnerships at the U.S. Library of Congress April 2004.

Danish Legal Deposit on the Internet: Current Solutions and Approaches for the Future ECDL, September 2001 by Birgit N. Henriksen Head of Digitization.

Annick Le Follic Bibliothèque nationale de France Tallinn,

1 WebWatch: Monitoring Web Developments In The UK Brian Kelly UK Web Focus UKOLN University of BathURL Bath, BA2 7AY

Danish Legal Deposit Experiences & the Need for Adjustments by Birgit N. Henriksen Head of Digitization and Web Department The Royal Library, Denmark.

Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.

1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.

Danish Legal Deposit on the Internet National Diet Library, Tokyo, January 2002 by Birgit N. Henriksen Head of Digitization and Web Department The Royal.

Svein Arne Brygfjeld National Library of Norway Nordic Web Archive.

Selecting journals for digitisation Piecing together the puzzle to create a European model Dr Hazel Woodward Cranfield University, UK

A centre of expertise in digital information managementwww.ukoln.ac.uk Digital Preservation / UK Web Focus Brian Kelly UKOLN University of Bath Bath, BA2.

City of Seattle Office of the City Clerk Open Government = Access Challenges and Opportunities with Digital Records.

An Example of Multinational Cooperation with a Special View on Multilinguality and Interoperability Hagelin, Ritva and Myllys, Heli, Viikki Science Library,

The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.

Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.

Erin Kinney, Wyoming State Library. Motivation #1 priority that came out of 2004 statewide digitization meeting WSL received many reference questions,

Kulturarw³ The Swedish WWW Archive Eller, att fånga den V ärlds V ida V även

Annick Le Follic Bibliothèque nationale de France Tallinn,

Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library.

Wellcome Library & JISC Web Archiving Project Presented by Michael Day, UKOLN, University of Bath [Author of the Web Archiving feasibility study] Digital.

Cataloguing Electronic resources Prepared by the Cataloguing Team at Charles Sturt University.

Publisher’s Perspective: Digitization of print resources, and archiving of digital resources Judy Best, June 13, 2006.

Concepts and phrases From ODLIS (Online Dictionary of Library and Information Science)

The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.

The World Digital Library Initiative John Van Oudenaren Library of Congress Presentation to the Third SEEDI Conference Cetinje, Montenegro September 13,

From here to perpetuity: challenges (and a few confessions) in preserving web-based AV content ASRA Conference 2011 Paul Koerbin Manager Web Archiving.

The Internet 8th Edition Tutorial 5 Information Resources on the Web.

Digital Archiving in the Hungarian Széchényi Library The story and the plans of the Hungarian Electronic Library Rome, 21. Oct István Moldován OSZK,

Web Archiving at the National Library of Australia Russell Latham Senior Web Archivist, National Library of Australia.

Digital library projects in the Nordic national libraries Juha Hakala Helsinki University Library – The National Library of Finland.

UKOLN is supported by: Iniciativas de preservación de la Web: una visión actual Michael Day Digital Curation Centre, UKOLN, University of Bath, UK

European Commission on Preservation and Access Preservation of digital heritage Yola de Lusenet Lisbon, November

Digital Commons & Open Access Repositories Johanna Bristow, Strategic Marketing Manager APBSLG Libraries: September 2006.

Uganda Scholarly Digital Library (USDL) Makerere University’s Institutional Repository By Margaret Nakiganda URL:

Metadata for digital preservation: a review of recent developments Michael Day UKOLN, University of Bath ECDL2001, 5th European Conference.

Development of Electronic Services in Public Libraries: Issues and Possibilities Sally Criddle UKOLN University of Bath Bath, BA2 7AY UKOLN is funded by.

1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.

A centre of expertise in digital information management 1 UKOLN is supported by: Approaches to Archiving Professional Blogs Hosted in the.

Preservation metadata and the Cedars project Michael Day UKOLN: UK Office for Library and Information Networking University of Bath

CERN Document Server 19 tth January 2006 CERN Document Server Jean-Yves Le Meur 19 th January 2006.

Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus

Archiving & Preserving Digital Content

Workshop on Web Archiving

László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.

Presentation transcript:

Archiving the Web: why bother ? LA Times (March 2000)

Archiving the Web: why bother ? “Web sites are an increasingly important part of [an] institution’s digital assets and of [a] country’s information and cultural heritage.” (JISC – April 2002) “A lot of history is born digital. This should not be like early television where there is no record.” (Brewster Kahle – May 2002)

Archiving the Web: who bothers? Australia USA Nordic countries: Denmark, Finland, Sweden Other countries: UK, France, Japan Internet Archive –“Wayback Machine”

Three conferences: What’s next for Digital Deposit Libraries ? Darmstadt, September 2001 International Symposium on Web Archiving. Tokyo, January 2002 DPC Forum: Web-archiving. London, March 2002.

The Web Sites ECDL2001.htmhttp:// ECDL2001.htm eng.htmlhttp:// eng.html ion/webforum.htmlhttp:// ion/webforum.html

Issues and Questions Legal Deposit of Digital Information ? –European Union Copyright Directive Copyright ? Open or closed access ? Selective or comprehensive ? When in the life cycle ? How often ? Capturing the experience – –Dynamic web sites

Technical challenges Embedded external links and executable programs Persistent naming and date stamping Duplicate control Change in content over time Surface web vs Deep web

Australia (PANDORA Archive) – NLA As yet no legal deposit. Mandate for collecting C’wlth Government publications Selective –(Australian e-journals, organisational sites, government publications, ephemera) Accessible by public –Catalogued in the NBD

Australia (PANDORA Archive) ~1700 titles in the Archive (Nov. 2001) –Growth rate: 40 sites/month –Regathering: 35 sites/month ADRI (Australian Digital Resource Identifier) –Unique identification scheme –In-house resolving system

USA (Minerva) - Library of Congress (Mapping the Internet Electronic Resources Virtual Archive) Open access materials from the Web Changes in copyright law under discussion Selective inclusion Public access

LC/IA Pilot Project – “Election 2000” Joint pilot project Library of Congress and Internet Archive Objectives: –Library pilot : selection, collection and cataloguing web sites; build prototype access system –Internet Archive pilot: gain experience in harvesting and archiving sites Over 800 websites (150+ selected sites and major sites hyperlinked to/from those sites) 2-3 terabytes of data Archived daily August 2000 to January 2001

Denmark Royal Library, Copenhagen. Limited legal deposit of electronic publications –Static, not dynamic publications – finite units Access only from workstations at Royal Library and State and University Library Archiving static websites (monographs, periodicals) Server mirrored nightly to State and University Library, Arhus

Denmark (Statistics) June archived 9000 net publications –31% monographs, 69% periodicals –67.5% public sector/university, 32.5 private sector publications Staff resources 0.5 technical; 0.8 librarian

Sweden (Royal Library) Take snapshots of Swedish Web several times/year –No selection - take everything –All www pages in Sweden, all articles in e-journals, all Swedish newspapers –Definition of Sweden:.se -.com,.org.net with Swedish address or telephone number –Archive only - no public access as yet.

Sweden (Software) Uses Whois to identify Swedish sites in non-.se domains Harvesting with COMBINE Robot software (Univ. of Lund) –Collects papers by automatically following hypertext links –Also collects pictures and sound –Fully automatic - no human intervention

Swedish Archive (Kulturarw3) Everything associated with an object and metadata stored in one file as a multipart MIME object Name of the file: 33 character string with time stamp Sept 2001: 110 million files Gbytes of data from 97,000 web servers Stored on disk and magnetic tapes using Hierarchical Storage Management (HSM)

Swedish Archive (Kulturarw3) (2) Prior to July 2002: Limited legal deposit (fixed form e- documents) December 2001 : Data Inspection Board team confirms project is illegal. Project suspended July Amendments to Swedish copyright law. Gives Royal Library right to collect the Swedish web and to make the archive publicly available.

Finland - National Library Follows Swedish approach - only.fi domain initially Finnish Copyright Act under revision to permit harvesting web resources Uses harvesting software developed in Finland from NEDLIB specification Archive Metadata –Uses MD5 checksum for duplicate control, authentication and create unique access key –Time stamped upon retrieval

Finland - Results of current Harvesting Round (1) Harvesting round –Commenced August completed in April 2002 –9.4 million files from 29 million locations (URL’s) –Compressed data occupied 340 Gbytes of storage –Stored on a tape robot in national supercomputing centre –Hardware used: Sun E450 server

FINLAND - Results of current Harvesting Round (2) Finnish experience: “the NEDLIB harvester can deal with any national Web space (except perhaps the USA) with reasonably modest hardware, provided that there is sufficient storage space available somewhere”. (Juha Haleka, leader of the Finnish team)

Nordic Web Archive Joint project of Nordic national libraries Not dependent on what harvester is used –NEDLIB (Finland, Norway, Denmark), COMBINE (Sweden) Selected Norwegian search engine (FAST) Software –Convert documents from 100 different MIME types to HTML –Recognises most European languages Budget: 260,000 Euros (AUS $475,000)

“The homogeneous (surface) Web” 59.3% - Text/HTML 37.9% - Image (GIF,JPEG,PNG) 1.7% - PDF 1.1% - Other formats 1.5 million HTML 1 million GIF 550,000 JPEG 36,500 PDF 11,800 plain text 6,000 Word 5,300 Java etc. DenmarkFinland

United Kingdom (1) British Library –“Domain.uk” experiment (commenced 2002) Select and capture 100 UK websites (2001 election, GM crops) selected sites for approval Revisit every three weeks Uses Bluesquirrel Web Whacker software Audit change, loss and links over time –Intention to scale up (2004 funding bid)

United Kingdom (2) UKOLN Research Project –Estimates of size of.uk domain: 3 million sites, 24 million pages Wellcome Library/JISC Archiving Study to find a solution to web archiving –The “medical web” –Consultancy awarded March Completion date October –Draft report August Final report to be disseminated to the community

Germany (Deutsche Bibliotek) –Experiments with targeted harvesting –Two incomplete snapshots 12/2000 and 02/2001

France (Bibliotheque de France) –In 2001: two experiments with small numbers of sites (16,100), including music, video and multimedia. –Unsatisfactory results: Unexpected features Exceptionally large sites –Planning new feasibility study with with 2 different robot providers –Change in legal deposit law proposed in June Not yet adopted by Parliament.

Japan National Diet Library WARP (Web Archiving Program) Initially selective Major changes in Japanese copyright law expected to permit more comprehensive collecting.

Internet Archive (1) Founded by Brewster Kahle in $15 million from sale of WAIS Non-profit organisation. –Sponsors include AT&T Research, Compaq, Xerox PARC, Quantum DLT, National Science Foundation. Archived web pages from 1996+, movies from 1903 to 1973 Site has archived over 10 billion pages (Oct. 2001) = more than 100 terabytes Growth rate : 10 terabytes/month

Internet Archive (2) Complete sweep of the Web every two months “Robot exclusions” - many newspapers, individuals, photographers Complete copy of Archive at Bibliotheca Alexandrina (April 2002) Duplicates in other continents proposed. “Best method of preservation is replication”. Copyright ? “May be a massive violation of copyright law”. (Lawrence Lessig, Stanford University expert on IP law in Cyberspace)

“Wayback machine” Front end to the Internet Archive collection of public web pages Includes most image files in the collection Launched October 2001 Fully available to public 20,000 users/day; up to 200 queries per second Not yet text searchable (URL search only) Financial sustainability ? (No advertising)

Conclusion We’re not here to test laws. We’re trying to build a world we want to live in. The world without a library is a world without memory, and that would be tragic.” B. Kahle, October On the Web, anyone can be a publisher; now there is a library for their work.” B. Kahle, May 2002