Download presentation
Presentation is loading. Please wait.
Published byAvice Barnett Modified over 9 years ago
1
9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK 11:00am - 12:30pm – Intro to Metadata extraction, Data Mining & the Web Archiving Lifecycle 12:30pm – LUNCH 1:30pm - 3pm – Data Mining Breakout sessions/Deep Dives 3pm – BREAK 3:15pm - 4:30pm – Data Mining Breakout sessions/Deep Dives 4:30pm – Wrap-up & Next Steps Welcome/Agenda IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013
2
Data Mining & Web Archiving ‘Lifecycles’ Kris Carpenter Negulescu Internet Archive IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013 2
3
Use Cases Election 2012 Collaborative NLNZ 2013 Domain and GOV Collections Wide00002/00005 Crawls http//home.us.archive.org/~vinay/wide/wide-00002.html http://home.us.archive.org/~vinay/wide/wide-00005.html https://webarchive.jira.com/wiki/display/~vinay/Embed+Analysis+for+the+Wide00005+Crawl IIPC General Assembly, The Hague, May 9, 2011 3
4
Traditional “Crawl” Lifecycles Crawl seeded & scoped Data Harvested and De-duplicated WARCs Written and Ingested WARCs Processed and Indexed Data & Services Audited and Monitored Collection QA’d, Reports Generated, Access Services Deployed/Updated CDXs/WATs WARCs Lucene Shards IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013
5
Analyzing Scope & Quality Target Resource “Analysis” Live Snapshot Generation WAT Generation ”Browser” Log Analysis “Crawl” Log Analysis Filtering APIs/Feeds Embed & Out-link Analyses Web Graph Generation In-link Analyses/Ranking Anchor, Description, Full Text Indexing & Mining Seeding a Crawler Frontier & Alternate Capture Mechanisms IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013
6
Preparing to Collect/Scoping/Framing a Crawl/Collection Pre “Crawl” Workflows Target identification (beyond curatorial selection…) Automated Filtering of Data Sources by Topic, Geo IP, file format, robots policy or other criteria Out-link analyses and ranking from selected sources, In-link analyses Mining Anchor text/Page Descriptions/Title tags (if not full text) “Test” Capture Analyses (…routing to proper capture mechanisms) IIPC General Assembly, The Hague, May 9, 2011 6
7
Your Browser: Behind the Scenes
8
IIPC General Assembly, The Hague, May 9, 2011 8
9
Extracted Metadata & Links (WAT) WAT is WARC ☺ WAT records are WARC metadata records WARC-Refers-To header identifies original WARC record WAT payload is JSON Can be combined with Curator generated metadata
10
Monitoring/Enhancing/Confirming Capture Comparing Live Resources to Files Written Evaluating Completeness (at all levels) Generating Snapshots of Live and Archived resources Eliminating Spam/Detecting Scoping Mistakes & Issues Mining Crawl Logs (HIVE) Mining Browser Logs Mining/Analyzing Links IIPC General Assembly, The Hague, May 9, 2011 10
11
Characterizing/Documenting/Preser ving Captures & Collections IIPC General Assembly, The Hague, May 9, 2011 11
12
Enabling Access & Research Host profiles Link Graphs, Tag Clouds, & Visualizations Collection Based: http://home.us.archive.org/~vinay/eot08-explore- data.html http://home.us.archive.org/~vinay/eot08-explore- data.html Archive wide: http://home.us.archive.org/~vinay/global/1995-2011/stats.html http://home.us.archive.org/~vinay/global/1995-2011/stats.html http://home.us.archive.org/~vinay/tld.html Site/Page Evolution http://archive.org/details/TheNewYorkTimesTimelapse1996-2010 Portal Browse/Search http://eotarchive.cdlib.org/ http://eotarchive.cdlib.org/ Research Use/Access History Tracker (Weber/Lazer) ARCLink (AlSum/Nelson) IIPC General Assembly, The Hague, May 9, 2011 12
14
HistoryTracker Tool 14 Beta Version! PIG Scripts in Hadoop Environment RU High-Speed Computing Cluster Link Lists Curated Data Sets
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.