Download presentation
Presentation is loading. Please wait.
Published byJocelyn Stone Modified over 9 years ago
1
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc.govabgr@loc.gov Gina Jones / gjon@loc.govgjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team
2
Library of Congress Web Archives Since 2000, 20+ thematic, event-based collections 100 TB+ of data collected 12,500+ URLs http://www.loc.gov/lcwa
3
Web Archiving Tools Crawling: –Heritrix –WARC Access: –Wayback Machine –NutchWAX International Internet Preservation Consortium netpreserve.org
4
LC’s Web Archive Workflow Identify & select URLs (LS or LAW) Determine crawl strategy, create a seed list for crawling (OSI) Sites harvested by Internet Archive or in-house crawlers (OSI), Quality Review (OSI & curators) Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction
5
Describing the Archives Collection-level MARC record in OPAC Item-level MODS records in LCWA –One record per recommended URL for each distinct collection With so many thousands of URLs to process, how do we streamline the process?
6
XML MODS Template
7
Metadata Extraction For each URL that will be cataloged: –Get archived web site metadata –Combine with URL Nominations Database metadata –If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) Using XML template, we add collection and record level metadata Create a single file for delivery
8
Data Sources for Metadata Extraction
9
URL Nominations Database URL Access Rights Language(s) Category Subject Terms
10
Election Candidate Metadata Name URL Party Affiliation State Race District (House)
11
Archived Web Site Metadata From 1 st capture: Document Title Keywords Abstract Mime Types From Wayback index: Capture Dates (First & Last)
12
Combined Data in Template
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.