Presentation is loading. Please wait.

Presentation is loading. Please wait.

Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /

Similar presentations


Presentation on theme: "Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /"— Presentation transcript:

1 Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc.govabgr@loc.gov Gina Jones / gjon@loc.govgjon@loc.gov Library of Congress Office of Strategic Initiatives Web Capture Team

2 Library of Congress Web Archives Since 2000, 20+ thematic, event-based collections 100 TB+ of data collected 12,500+ URLs http://www.loc.gov/lcwa

3 Web Archiving Tools Crawling: –Heritrix –WARC Access: –Wayback Machine –NutchWAX International Internet Preservation Consortium netpreserve.org

4 LC’s Web Archive Workflow Identify & select URLs (LS or LAW) Determine crawl strategy, create a seed list for crawling (OSI) Sites harvested by Internet Archive or in-house crawlers (OSI), Quality Review (OSI & curators) Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction

5 Describing the Archives Collection-level MARC record in OPAC Item-level MODS records in LCWA –One record per recommended URL for each distinct collection With so many thousands of URLs to process, how do we streamline the process?

6 XML MODS Template

7 Metadata Extraction For each URL that will be cataloged: –Get archived web site metadata –Combine with URL Nominations Database metadata –If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) Using XML template, we add collection and record level metadata Create a single file for delivery

8 Data Sources for Metadata Extraction

9 URL Nominations Database URL Access Rights Language(s) Category Subject Terms

10 Election Candidate Metadata Name URL Party Affiliation State Race District (House)

11 Archived Web Site Metadata From 1 st capture: Document Title Keywords Abstract Mime Types From Wayback index: Capture Dates (First & Last)

12 Combined Data in Template

13

14


Download ppt "Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /"

Similar presentations


Ads by Google