Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /

Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc.govabgr@loc.gov Gina Jones / gjon@loc.govgjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team

Library of Congress Web Archives Since 2000, 20+ thematic, event-based collections 100 TB+ of data collected 12,500+ URLs http://www.loc.gov/lcwa

Web Archiving Tools Crawling: –Heritrix –WARC Access: –Wayback Machine –NutchWAX International Internet Preservation Consortium netpreserve.org

LC’s Web Archive Workflow Identify & select URLs (LS or LAW) Determine crawl strategy, create a seed list for crawling (OSI) Sites harvested by Internet Archive or in-house crawlers (OSI), Quality Review (OSI & curators) Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction

Describing the Archives Collection-level MARC record in OPAC Item-level MODS records in LCWA –One record per recommended URL for each distinct collection With so many thousands of URLs to process, how do we streamline the process?

XML MODS Template

Metadata Extraction For each URL that will be cataloged: –Get archived web site metadata –Combine with URL Nominations Database metadata –If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) Using XML template, we add collection and record level metadata Create a single file for delivery

Data Sources for Metadata Extraction

URL Nominations Database URL Access Rights Language(s) Category Subject Terms

Election Candidate Metadata Name URL Party Affiliation State Race District (House)

Archived Web Site Metadata From 1 st capture: Document Title Keywords Abstract Mime Types From Wayback index: Capture Dates (First & Last)

Combined Data in Template

Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /

Similar presentations

Presentation on theme: "Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /

Similar presentations

Presentation on theme: "Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /"— Presentation transcript:

Similar presentations

About project

Feedback