Web and Twitter Archiving at the Library of Congress Nicholas Taylor Web Archiving Team Library of Congress Web Archive Globalization.

Slides:



Advertisements
Similar presentations
Cultural Heritage Institutions and Big Data Collections Leslie Johnston Chief of Repository Development Library of Congress.
Advertisements

What is HathiTrust and How Can it Make a Difference? Sourcing and Scaling brought to the collective collection.
Moving Forward With Digital Preservation at the Library of Congress Laura Campbell Associate Librarian for Strategic Initiatives Library of Congress.
DSpace: the MIT Libraries Institutional Repository MacKenzie Smith, MIT EDUCAUSE 2003, November 5 th Copyright MacKenzie Smith, This work is the.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
The Library of Congress Cooperative Web Archiving Project Abbie Grotke, Library of Congress Grant Harris, Library of Congress Jennifer Long, Georgetown.
National Digital Information Infrastructure and Preservation Program (NDIIPP) Data-PASS/NDIIPP: A new effort to harvest our history A funder view May 25,
BUILDING DIGITAL WEB ARCHIVES FOR FUTURE SCHOLARS Jani Stenvall
1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.
DAEDALUS: Facing the Challenges of eTheses at Glasgow William J Nixon Project Manager: Service Development (DAEDALUS) ETD Berlin, May 2003.
1 Strategies for Collecting and Preserving Open Access Materials on the Web William Y. Arms Cornell University Federal Library and Information Center Committee.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
Archiving the Web: the PANDORA archive at the National Library of Australia Preserving the Present for the Future Copenhagen, June 2001 Warwick Cathro,
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
The capture and preservation of websites at the National Library of New Zealand Gillian Lee Alexander Turnbull Library.
1 Archive-It Training University of Maryland July 12, 2007.
The National Digital Newspaper Program (NDNP) An NEH/LC Collaborative Program Enhancing access to historical newspapers Release: September 2006.
Access to Individual Harvested Sites in a Web Archive Tracy Meehleib DLF Fall Forum, Providence, RI November 13th, 2008.
Power to the People: The IUB Libraries' Website Digital Asset Management System Doug Ryner, Tadas Paegle, & Julie Hardesty.
Bibliography in the Digital Age - IFLA Satellite Meeting Warsaw, 9 August Online materials published in Austria collecting, archiving and metadata.
Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.
Tisch Technical Services FY 2011 Planning April 13, 2010.
WebArchiv Czech Web Archive IIPC 2007, Paris.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.
Web The Internet Archive. Agenda Brief Introduction to IA Web Archiving Collection Policies and Strategies Key Challenges (opportunities for.
Tool Academy: Web Archiving Nicholas Digital Cultural Heritage DC Meetup December 20, 2012 “cobwebbed screw driver” by Flickr user Colby.
Isabel Silver and Laurie Taylor IMLS Library Publishing Services Workshop May 5, 2011 UF Smathers Libraries Publishing Services.
City of Seattle Office of the City Clerk Open Government = Access Challenges and Opportunities with Digital Records.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
Ymchwil Research Ymchwil Research RESAW Ioan Isaac-Richards Ingest Processes Manager Head of Web Archiving
Ask A Librarian and QuestionPoint: Integrating Collaborative Digital Reference in the Real World (and in a really big library) Linda J. White Digital Project.
Web Archiving Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Web Archiving at the National Library of Australia National Library of Indonesia Staff 5 October 2010 Paul Koerbin Manager, Web Archiving National Library.
Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University
Digital Preservation: Lessons learned through national action Digital Preservation Interoperability Framework Workshop April 2010.
Caught in the Web: Web Archiving at U of A Libraries Geoff Harder and Kenton Good Digital Preservation Seminar | March 5, 2010 | University of Alberta.
Office of Strategic Initiatives All Hands Meeting-March 2010 Challenges in Web Archiving: Library of Congress Edition Abbie Grotke, Web Archiving Team.
The Legislative Library of Ontario’s Ontario Documents Repository Road to Partnership.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
The Real At Risk E-Content: University Web Resources EDUCAUSE Joanne Kaczmarek University of Illinois at Urbana-Champaign Taylor Surface OCLC October 12,
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
Netarkivet RESAW seminar, Dec 2-3, 2013 Day 1. Who are we today □Birgit N. Henriksen, head of digital preservation, KB □Bjarne Andersen, head of digital.
Web Archiving at the National Library of Australia Russell Latham Senior Web Archivist, National Library of Australia.
The Library of Congress Martha Anderson Program Officer, NDIIPP Office of Strategic Initiatives Library of Congress April 2005 LC Perspective : Preservation.
Rescuing EDL Data Presented by Adrian Tinio Jet Propulsion Laboratory, Pasadena, CA.
Webarchivering in het Audiovisuele Domein Web archiving in the audiovisual Domain Julia Vytopil- Nederlands Instituut voor Beeld en Geluid Netherlands.
CyberCemetery Preserving At-Risk Government Web Content.
The KB e-Depot long-term preservation of scientific publications in practice Marcel Ras, National library of The Netherlands.
World Wide Web Library 150 Week 8. The Web The World Wide Web is one part of the Internet. No one controls the web Diverse kinds of services accessed.
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /
The 3 M’s: MINERVA, MODS, and METS Allene Hayes (LC) Rebecca Guenther (LC) Leslie Myrick (NYU) DLF -- New Orleans April 20, 2004.
Preservation Program Digital Preservation Program Digital Preservation Services: Extending tools to meet campus needs Patricia Cruse, Director, Digital.
1 BCS, Oxfordshire, 19 February, 2004 WEB ARCHIVING issues and challenges Deborah Woodyard Digital Preservation Coordinator.
Metadata Extraction and Web Archives: Automating the Record Creation Process Tracy Meehleib Library of Congress, NDMSO NDIIPP June 25, 2009.
Building Collections on the Web BCWeb. What’s BCWeb ? BCWeb was developped entirely by the BnF for the content curators to replace its old selection tools.
Institutional Repositories July 2007 Intellectual property management : the DISA experience Dr D Peters DISA: Digital Innovation South Africa.
Open Access and Institutional Repositories. Accra, June 2007 Institutional repositories in SA research institutions: the DISA experience Dr D Peters.
Open Access Tools for Scholars Scholarly Communication Retreat Wednesday December 12, 2007 Presented by Marcia Salmon.
Strategies for archiving the Danish web space Bjarne Andersen Head of Digital Resources State and University Library, Aarhus
Publishing from the Library: New Roles for Libraries in Scholarly Communications David Ruddy Cornell University Library September, 2004.
The New Now: Institutional Repositories and Academia Institutional Repository USM April 17, 2015 Marilyn Billings Scholarly Communication Librarian.
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
Web Archiving Workshop Mark Phillips Texas Conference on Digital Libraries June 4, 2008.
2008 DOT GOV HARVEST PRESERVING ACCESS UNIVERSITY OF NORTH TEXAS LIBRARIES Cathy N. Hartman Mark E. Phillips FDLC Oct 21, 2008.
Archiving & Preserving Digital Content
National Digital Stewardship Alliance Web Archiving Survey Update
Challenges and Opportunities of Archiving the UK Web
László Drótos – Márton Németh National Széchényi Library Department of Electronic Library Services Web archiving Planning a new pilot project.
Preserving Our Collective Digital History
Presentation transcript:

Web and Twitter Archiving at the Library of Congress Nicholas Taylor Web Archiving Team Library of Congress Web Archive Globalization Workshop June 16, 2011

LIBRARY OF CONGRESS 2 why archive the web? preserve our nation’s history and culture identify and preserve at- risk digital content develop of tools, models, and methods for digital preservation “World Wide Web 1997: 2 Terabytes in 63 Inches”

LIBRARY OF CONGRESS 3 various collection strategies entire web domain—Internet Archive national domain—Sweden, Denmark, others selective (individual URLs) and thematic— Australia thematic or event-based—Library of Congress

LIBRARY OF CONGRESS 4 web archiving at LC began in 2000 with MINERVA pilotMINERVA identify policy issues, establish best practices, build tools (internally and w/ partners) broaden expertise and understanding of Web Archiving within LC collect, manage and sustain at-risk digital content LC Prints & Photographs: design for Minerva Congressional Librarydesign for Minerva Congressional Library

LIBRARY OF CONGRESS Curators/Recommending Officers In Library Services, Congressional Research Service, and the Law Library pick the collections and what URLs to archive, and research who to contact for permission. Bibliographic Access MODS records are created in Library Services: the Network Development and MARC Standards Office (NetDev) and Acquisitions and Bibliographic Access (ABA) staff do the cataloging. Web Archiving Team and Web Preservation Engineering Team In the Office of Strategic Initiatives (OSI). We are project managers and technical staff focused on capture, tools, and permissions. Information Technology Office and Technical Architecture Team Also in OSI. Supports Wayback Machine, Heritrix, repository and tools development, and data transfers. Contractors are also used in this area.

LIBRARY OF CONGRESS 6 LC collections: over 245 TB + 5 TB/month –ongoing collections, including: Congress/Legislative websites Legal Blawgs Public Policy Topics –event-based collections, including: U.S. National Elections—2000, 2002, 2004, 2006, 2008, 2010 Iraq War September and September 11 Remembrance 2002 Civil War Sesquicentennial Olympics 2002 Supreme Court Nominations Papal Transition Case Studies: health care, terrorism, visual image content, organizational Web sites, Crisis in Darfur, “single site” –Overseas Operations collections, including: Egypt 2008; Brazilian, Indian, Indonesian, Philippine, and Thai Elections; Afghanistan Government; Pakistan NationalismsOverseas Operations

LIBRARY OF CONGRESS 7 web archives access: loc.gov/lcwa

LIBRARY OF CONGRESS essential tools capture: Heritrix (contract crawling w/ IA and in-house)Heritrix replay: WaybackWayback permissions/seed management; capture quality review; reporting; transfer tracking: custom apps built on LAMP stackLAMP transfer: BagIt Library (based on BagIt spec); *nix ingest/staging/storage/access servers; Internet2 connectionBagIt LibraryBagIt spec 8

LIBRARY OF CONGRESS other useful tools web archiving workflow management: NetArchiveSuite, Web Curator Tool NetArchiveSuiteWeb Curator Tool small-scale web archiving: HTTrackHTTrack Firefox add-ons: Firebug, Web DeveloperFirebugWeb Developer 9

LIBRARY OF CONGRESS cataloging for access collection-level metadata site-level bibliographic metadata –nominators provide subject heading –HTML metadata extraction via cURLcURL –cataloger assigns keywords cataloging metadata stored in MODSMODS assisted keyword assignment: HIVEHIVE 10

LIBRARY OF CONGRESS collection-level record example

LIBRARY OF CONGRESS 12 MODS bibliographic record example

LIBRARY OF CONGRESS 13 Wayback Machine Resource Page

LIBRARY OF CONGRESS 14 example of an archived site

LIBRARY OF CONGRESS search and discovery bibliographic metadata searchsearch (not yet) Memento-enabledMemento full-text search based on NutchWAX unfeasibleNutchWAX Lucene/Solr looks promisingLuceneSolr 15

LIBRARY OF CONGRESS WEB ARCHIVES Challenges for 16

LIBRARY OF CONGRESS 17 challenges for web archives technical –large, deep, dynamic, interlinked –continuous transformation, simultaneously growing and disappearing intellectual property laws and regulations –legal deposit laws, mandates for preservation, laws that do not address web content economic environment –few good business models for sustaining web collections social environment –who is responsible and how is responsibility shared?

LIBRARY OF CONGRESS capture, replay, and preservation capturing websites – HeritrixHeritrix –“form-fronted” databases (i.e., “deep web”) –URLs the crawler can’t see that we want –…and URLs the crawler can see that we don’t –web 2.0 and other “new” web technologies replaying archived versions – WaybackWayback –non-rigorous website coding –live site “leakage” –significant interactivity may be lost preserving access to our archives –billions of files –thousands of file types –how do we ensure content is accessible in 10, 25, 50 or more years?

LIBRARY OF CONGRESS scaling capacity budgetary pressures limited access server disk space –competing w/ other big data projects new infrastructure for new capabilities 19 photophoto by Henrik Bennetsen under CC BY-SA 2.0Henrik BennetsenCC BY-SA 2.0

LIBRARY OF CONGRESS when the only tool you have is a library… 20 LC Prints & Photographs: exterior view of the LC Jefferson Buildingexterior view of the LC Jefferson Building

LIBRARY OF CONGRESS …many things look like collections archive behaves more like discrete records than web –archived sites not contiguously navigable –data doesn’t readily allow for downloading Twitter archive may prompt re-thinking web archive data access 21

LIBRARY OF CONGRESS web archiving and U.S. copyright law legal deposit requirement only applies to “published works” (§ 407)§ 407 § 108 of the Copyright Act provides library exceptions§ 108 –doesn’t address digital preservation and web archiving photophoto by Gabriel de Urioste under CC BY 2.0Gabriel de UriosteCC BY 2.0

LIBRARY OF CONGRESS why not rely on robots.txt? unreliable proxy for copyright permissions archival crawler ≠ search crawler LC disregards robots.txt but leaves contact info last.fm: robots.txtrobots.txt

LIBRARY OF CONGRESS capture and access permissions permissions-based approach began in 2002 permission plans for each collection developed w/ counsel permission requirements depend on site type more liberal about capture than about offsite access capture access offsite governmentno notice advocacy/ policy noticepermission newspermission

LIBRARY OF CONGRESS implications of opt-in permissions no response treated as denial –very few denials –many non-responsive case study: September 11, 2001September 11, 2001 –2300 cataloged, uncataloged URLs –many news sites (“high risk” permissions category) –no permissions sought –very few takedown requests 25

LIBRARY OF CONGRESS the future of permissions risk of more liberal approach appears low hope to move to more notice-based, opt-out policy may affect previously-captured sites as well 26 photophoto by RJ Sangosti, Denver Post under © (fair use)RJ SangostiDenver Post

LIBRARY OF CONGRESS TWITTER ARCHIVE Challenges for the 27

LIBRARY OF CONGRESS why archive Twitter? historical record of communication, news reporting, and social trends complements collections and mission

LIBRARY OF CONGRESS Twitter Archive FAQs currently receiving Tweets through GnipGnip includes only the public archive –deletions will propagate to archive access limitations –6 month embargo on new Tweets –no bulk distribution downstream users –no commercial use –no substantial re-distribution

LIBRARY OF CONGRESS a (literally) growing challenge ~3 years: time it took from 1st Tweet to billionth 1 week: time it now takes users to send a billion Tweets average Tweets/day in 3/10: 50 million average Tweets/day in 3/11: 149 million 30

LIBRARY OF CONGRESS tools we’re exploring Hadoop ElasticSearch Elephant-bird HBase Hive Pig 31 photophoto by The.Rohit under CC BY-NC 2.0The.RohitCC BY-NC 2.0

LIBRARY OF CONGRESS questions to consider how does archive fit in w/ existing collections? how are agreement guidelines interpreted and implemented technically? what kind(s) of access can we provide? what context do we provide for content?

LIBRARY OF CONGRESS additional goals justify value to Congress and public understand and respond to researcher needs push the institution beyond existing curatorial models

LIBRARY OF CONGRESS 34 for more information Library of Congress Web Archiving Program: Library of Congress Web Archives: National Digital Information Infrastructure and Preservation Program: Library of Congress – Twitter FAQ: twitter-an-faq/ twitter-an-faq/ Section 108 Study Group:

LIBRARY OF CONGRESS 35 for more information “Legal Issues in Building Social Media Collections:” “How the Library of Congress is building the Twitter archive:” of-congress-twitter-archive.htmlhttp://radar.oreilly.com/2011/06/library- of-congress-twitter-archive.html “Web Archives: The Future(s):”

questions? Nicholas