The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive
We are a Digital Library Mission Statement: Universal access to all knowledge o Founded by Brewster Kahle in San Francisco, California in 1996 o Largest publicly available web archive in existence o Officially designated a Library by the State of California in 2007 About Internet Archive
What is Web Archiving? The goal of web archiving is to document changes to web resources over time, archive them and make them accessible.
What is a Web Archive? A web archive is a collection of archived Urls grouped by theme, event, subject area, or web address. A web archive contains as much as possible from the original resources. It is a priority to recreate the same experience a user would have had if they had visited the live site.
Why Web Archiving? o Billions of people around the world have grown accustomed to using the web as their primary resource to acquire information. o The web is a crucial part our culture and our social fabric, and we don’t want to lose any of it, so it is essential that we collect and preserve these digital resources and make them accessible in creative ways. o The availability of this digital information is taken for granted and it is a fallacy that if something is on the web it will be there forever.
Limited lifespan of a webpage It is a a fairly common misconception that content that exists on the web will remain there forever. A report in Scientific American claims 44 days. A subsequent academic study in IEEE suggests 75 days. A Washington Post article indicates the number is 100 days. Over 95% of government information today is born- digital. But less than 50% is being maintained with an active preservation plan. State of the Federal Web Report
Historically important events for researchers and scholars Much of the record of any historic event in today’s world is “born digital.” And many items born in print are also available in digital form, or soon will be. To understand major world events—not only disasters but political upheavals—and to keep a record and a memory of them for survivors, for scholars, for policy-makers, and for a wider public, it is simply essential that we collect and preserve these digital resources and make them accessible in creative ways. Andrew Gordon, Harvard University.
It’s a requirement. o Records Retention policy. Several state and federal laws or policies require universities to maintain various statistics and reports. o Responsibility: preserve things like course information, course roster information and policies — documents now showing up only as digital content
The Role of Libraries o Libraries and archives have long collected information that serve scholars and the general public in understanding history, culture, and society. o So much of today's information is easily (and only) found on the world wide web -- web pages have replaced hard copy records and documents, blogs are today's diaries, and newspapers and socio-political commentary exist solely online. o As part of an effort to appropriately document and capture today's information for tomorrow's use, institutions must adopt a web archiving strategy. o However, for many institutions, the prospect of capturing and storing web pages, websites, or entire web domains is a daunting prospect
First deployed in February 2006 Web based application allowing users to create, manage and preserve collections of digital content Includes tools for selection and scoping, harvesting, cataloging with metadata, full text search, and QA Ability to capture content using 10 different crawl frequencies Archived content includes: html, videos, audio, PDF, images, social networking sites, online newspapers View archived content within 24 hours after a capture is complete Annual subscription service, includes hosting, access and storage (primary and back-up) About Archive-It
205 partners around the world in 43 U.S. States and 15 countries Who Uses Archive-It?
How Partners Use Archive-It
o Essential part of a mandate to capture and preserve institutional memory and history. Construct an historical record of an institution’s web presence over time. o Capture state/ local agency publications that aren’t being deposited in print form. Collect and aggregate state/ local government websites and presence. o Capture websites that relate to historical/traditional collections and link them with existing collections around the same thematic focus. o Create a thematic/topical web archive on a specific subject or event, including different perspectives and social commentary (tweets, blogs, comments). Gather thematically-related resources of value to researchers and scholars o Support an electronic records system to meet record retentions requirements. o Closure crawls Archive-It Use Cases
Stanford University/New York University Islamic & Middle Eastern Collection Purpose: harvest and preserve Iranian Blogs o Archiving 300+ blogs written by and for Iran and the Iranian people o Includes coverage of 2009 Iranian elections and the current Middle East unrest
Stanford University/New York University Islamic & Middle Eastern Collection
University of Texas at Austin: LAGDA Purpose: Archive documents from 18 different countries, 300 government ministries/presidencies. Content includes: o Full-text versions of official documents o Original video and audio recordings of key regional leaders o Thousands of annual and "state of the nation" reports o Specific collections for Latin American elections and political parties
University of Texas at Austin: LANIC Honduras Presidential site 2008 (before the Coup)
University of Texas at Austin: LANIC Honduras Presidential site 2009 (during the Coup)
University of Texas at Austin: LANIC Honduras Presidential site (after the Coup)
Purpose: archive born digital literature – works created explicitly for the computer. o ELO seeks to foster and promote the reading, writing, teaching, and understanding of literature as it develops in a digital environment o Content includes: individual works, collections and journals, poems and stories Electronic Literature Organization
Indiana University Purpose: archive all university records to maintain strong electronic records systems o Main university website, 8 different campus websites and other organizations on campus university culture, teacher blogs, student groups, and online publications
Indiana University Main University website
Columbia University Purposes: Archive copies of its university web presence in order to meet required mandates Archive websites on thematic/topical subjects.
Columbia University Human Rights Collection
Columbia University Avery Architectural & Fine Arts Library
Columbia University Archives Collection
North Carolina State Archives & State Library of North Carolina Purpose: archive state agency websites and publications o Includes pages in a variety of formats: text, images, audio, video and social networking sites
North Carolina State Archives & State Library of North Carolina
Access to Collections Partners: o Can view through private web application with login/password General Public: o Can view from Archive-It website: o Can view from organization’s website from a landing page that links back to Archive-It hosted data o Host from organization’s own servers -Restricted and private access options are available
What’s next for Archive-It Collaboration and Partnerships Web application development o Continue to develop features and functionalities requested by partners o Enhance our preservation policy/access model o Integrate our data with partner’s external services, systems and catalogs
Thank you! Lori Donovan Partner Specialist Questions?