Presentation is loading. Please wait.

Presentation is loading. Please wait.

9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK.

Similar presentations


Presentation on theme: "9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK."— Presentation transcript:

1 9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK 11:00am - 12:30pm – Intro to Metadata extraction, Data Mining & the Web Archiving Lifecycle 12:30pm – LUNCH 1:30pm - 3pm – Data Mining Breakout sessions/Deep Dives 3pm – BREAK 3:15pm - 4:30pm – Data Mining Breakout sessions/Deep Dives 4:30pm – Wrap-up & Next Steps Welcome/Agenda IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013

2 Data Mining & Web Archiving ‘Lifecycles’ Kris Carpenter Negulescu Internet Archive IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013 2

3 Use Cases  Election 2012 Collaborative  NLNZ 2013 Domain and GOV Collections  Wide00002/00005 Crawls http//home.us.archive.org/~vinay/wide/wide-00002.html http://home.us.archive.org/~vinay/wide/wide-00005.html https://webarchive.jira.com/wiki/display/~vinay/Embed+Analysis+for+the+Wide00005+Crawl IIPC General Assembly, The Hague, May 9, 2011 3

4 Traditional “Crawl” Lifecycles Crawl seeded & scoped Data Harvested and De-duplicated WARCs Written and Ingested WARCs Processed and Indexed Data & Services Audited and Monitored Collection QA’d, Reports Generated, Access Services Deployed/Updated CDXs/WATs WARCs Lucene Shards IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013

5 Analyzing Scope & Quality Target Resource “Analysis” Live Snapshot Generation WAT Generation ”Browser” Log Analysis “Crawl” Log Analysis Filtering APIs/Feeds Embed & Out-link Analyses Web Graph Generation In-link Analyses/Ranking Anchor, Description, Full Text Indexing & Mining Seeding a Crawler Frontier & Alternate Capture Mechanisms IIPC GA Meeting Ljubljana, SloveniaApril 26, 2013

6 Preparing to Collect/Scoping/Framing a Crawl/Collection  Pre “Crawl” Workflows Target identification (beyond curatorial selection…) Automated Filtering of Data Sources by Topic, Geo IP, file format, robots policy or other criteria Out-link analyses and ranking from selected sources, In-link analyses Mining Anchor text/Page Descriptions/Title tags (if not full text)  “Test” Capture Analyses (…routing to proper capture mechanisms) IIPC General Assembly, The Hague, May 9, 2011 6

7 Your Browser: Behind the Scenes

8 IIPC General Assembly, The Hague, May 9, 2011 8

9 Extracted Metadata & Links (WAT)  WAT is WARC ☺  WAT records are WARC metadata records  WARC-Refers-To header identifies original WARC record  WAT payload is JSON  Can be combined with Curator generated metadata

10 Monitoring/Enhancing/Confirming Capture  Comparing Live Resources to Files Written  Evaluating Completeness (at all levels)  Generating Snapshots of Live and Archived resources  Eliminating Spam/Detecting Scoping Mistakes & Issues  Mining Crawl Logs (HIVE)  Mining Browser Logs  Mining/Analyzing Links IIPC General Assembly, The Hague, May 9, 2011 10

11 Characterizing/Documenting/Preser ving Captures & Collections IIPC General Assembly, The Hague, May 9, 2011 11

12 Enabling Access & Research  Host profiles  Link Graphs, Tag Clouds, & Visualizations  Collection Based: http://home.us.archive.org/~vinay/eot08-explore- data.html http://home.us.archive.org/~vinay/eot08-explore- data.html  Archive wide: http://home.us.archive.org/~vinay/global/1995-2011/stats.html http://home.us.archive.org/~vinay/global/1995-2011/stats.html http://home.us.archive.org/~vinay/tld.html  Site/Page Evolution http://archive.org/details/TheNewYorkTimesTimelapse1996-2010  Portal Browse/Search  http://eotarchive.cdlib.org/ http://eotarchive.cdlib.org/  Research Use/Access  History Tracker (Weber/Lazer)  ARCLink (AlSum/Nelson) IIPC General Assembly, The Hague, May 9, 2011 12

13

14 HistoryTracker Tool 14 Beta Version! PIG Scripts in Hadoop Environment RU High-Speed Computing Cluster Link Lists Curated Data Sets


Download ppt "9:00am – Welcome/Setting the Agenda for the Day 9:10am - 10:30am – Challenges of the Web Now & in the Future Response to these Challenges 10:30am – BREAK."

Similar presentations


Ads by Google