9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK 11:00am - 12:30pm – Intro to Metadata extraction, Data Mining & the Web Archiving Lifecycle 12:30pm – LUNCH 1:30pm - 3pm – Data Mining Breakout sessions/Deep Dives 3pm – BREAK 3:15pm - 4:30pm – Data Mining Breakout sessions/Deep Dives 4:30pm – Wrap-up & Next Steps Welcome/Agenda IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013
Data Mining & Web Archiving ‘Lifecycles’ Kris Carpenter Negulescu Internet Archive IIPC GA Meeting Ljubljana, SloveniaApril 26,
Use Cases Election 2012 Collaborative NLNZ 2013 Domain and GOV Collections Wide00002/00005 Crawls http//home.us.archive.org/~vinay/wide/wide html IIPC General Assembly, The Hague, May 9,
Traditional “Crawl” Lifecycles Crawl seeded & scoped Data Harvested and De-duplicated WARCs Written and Ingested WARCs Processed and Indexed Data & Services Audited and Monitored Collection QA’d, Reports Generated, Access Services Deployed/Updated CDXs/WATs WARCs Lucene Shards IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013
Analyzing Scope & Quality Target Resource “Analysis” Live Snapshot Generation WAT Generation ”Browser” Log Analysis “Crawl” Log Analysis Filtering APIs/Feeds Embed & Out-link Analyses Web Graph Generation In-link Analyses/Ranking Anchor, Description, Full Text Indexing & Mining Seeding a Crawler Frontier & Alternate Capture Mechanisms IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013
Preparing to Collect/Scoping/Framing a Crawl/Collection Pre “Crawl” Workflows Target identification (beyond curatorial selection…) Automated Filtering of Data Sources by Topic, Geo IP, file format, robots policy or other criteria Out-link analyses and ranking from selected sources, In-link analyses Mining Anchor text/Page Descriptions/Title tags (if not full text) “Test” Capture Analyses (…routing to proper capture mechanisms) IIPC General Assembly, The Hague, May 9,
Your Browser: Behind the Scenes
IIPC General Assembly, The Hague, May 9,
Extracted Metadata & Links (WAT) WAT is WARC ☺ WAT records are WARC metadata records WARC-Refers-To header identifies original WARC record WAT payload is JSON Can be combined with Curator generated metadata
Monitoring/Enhancing/Confirming Capture Comparing Live Resources to Files Written Evaluating Completeness (at all levels) Generating Snapshots of Live and Archived resources Eliminating Spam/Detecting Scoping Mistakes & Issues Mining Crawl Logs (HIVE) Mining Browser Logs Mining/Analyzing Links IIPC General Assembly, The Hague, May 9,
Characterizing/Documenting/Preser ving Captures & Collections IIPC General Assembly, The Hague, May 9,
Enabling Access & Research Host profiles Link Graphs, Tag Clouds, & Visualizations Collection Based: data.html data.html Archive wide: Site/Page Evolution Portal Browse/Search Research Use/Access History Tracker (Weber/Lazer) ARCLink (AlSum/Nelson) IIPC General Assembly, The Hague, May 9,
HistoryTracker Tool 14 Beta Version! PIG Scripts in Hadoop Environment RU High-Speed Computing Cluster Link Lists Curated Data Sets