Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.

Slides:

Advertisements

Similar presentations

Building an Ontology for Crisis, Tragedy, and Recovery Oct. 1, 2009 NKOS Workshop, ECDL 2009 Corfu, Greece Uma Murthy, Edward Fox, Naren Ramakrishnan,

Advertisements

From CTRnet to IDEAL (and Qatar, VT, SiteStory, UPS, …) NSF IA WIRE Workshop Harvard -- June 16, 2014 Edward A. Fox,

Looking Ahead Archive-It Partner Meeting November 12, 2013.

Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.

1 Archiving and Preserving the Web Kristine Hanna Internet Archive July 2008.

Crisis, Tragedy, and Recovery Network Digital Library (CTRnet) + Web Archiving in Qatar and VT Edward A. Fox, Seungwon Yang, & CTRnet Team Department of.

1 CHCI Visit by Dean Benson, Associate Dean Lesko KW II Rm – 10/10/2011 Digital Library Research Laboratory Torgersen Hall Rm 2030 –

1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.

1 Archive-It Training University of Maryland July 12, 2007.

1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.

2 Consulting Services Products Solutions Managed Services Neudesic started its business providing best of breed consulting services on the Microsoft Platform.

Web Archiving at the Innsbruck Newspaper Archive Innsbrucker Zeitungsarchiv / IZA Presentation by Renate Giacomuzzi, Elisabeth Sporer, Armin Schleicher.

Archive-It collection on “Occupy Movement 2011/2012” Archiving Web Content.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.

Joanne Archer University of Maryland Kate Odell Archive-It Abbie Grotke Library of Congress Tessa Fallon Columbia University Creating and Maintaining Web.

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.

How to Face the Challenges of Web Archiving? The experiences of a small library on the edge. Chloe Martin, Internet Memory Catherine Ryan, National Library.

Edward A. Fox Virginia Tech, CS, Digital Library Research Laboratory

Digital Library Research Laboratory Torgersen Hall 2030 – (part of IT at VT) and Department of Computer Science CS4624: Multimedia, Hypertext,

Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin.

1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.

Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.

Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin.

Developing a Concept Extraction Technique with Ensemble Pathway Prat Tanapaisankit (NJIT), Min Song (NJIT), and Edward A. Fox (Virginia Tech) Abstract.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 March 6, 2014 Client Tarek Kanan 1.

CTRnet: A Crisis, Tragedy, & Recovery Network ( Oct.16, 2009 VCOM Research Day Blacksburg, VA USA Edward Fox Bidisha.

Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.

Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,

Integrate Full-Text Retrieval with Digital Archives System Reporter ： Chia-Hao Lee Computer System and Communication Lab, Academia Sinica Institute of.

Digital libraries and web- based information systems Mohsen Kamyar.

Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / Gina Jones /

VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.

ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.

Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.

- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.

1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.

Building Collections on the Web BCWeb. What’s BCWeb ? BCWeb was developped entirely by the BnF for the content curators to replace its old selection tools.

Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team

Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.

ELISQ Seminar Qatar National Library 20 May 2015 Introduction by Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA USA

Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.

GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.

ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.

CTRnet Digital Library for Disaster Information Services Seungwon Yang 1, Andrea Kavanaugh 1, Nádia P. Kozievitch 4, Lin Tzy Li 1,4,5, Venkat Srinivasan.

Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,

Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.

VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014.

Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,

Big Data Processing of School Shooting Archives

Archiving & Preserving Digital Content

CS6604 Digital Libraries Global Events Team Final Presentation

Collection Management Webpages

Collection Management

Joanne Archer University of Maryland Libraries

Big Data Science Workshop 12 January 2017, Virginia Tech Digital Libraries and Big Data Edward A. Fox Prashant Chandrasekar, Islam Harb,

Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:

Virginia Tech Blacksburg CS 4624

Clustering tweets and webpages

CS 5604 Information Storage and Retrieval

Event Focused URL Extraction from Tweets

Collection Management Webpages Final Presentation

CS6604 Digital Libraries IDEAL Webpages Presented by

Information Storage and Retrieval

Web archive data and researchers’ needs: how might we meet them?

Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li

VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624

Web archives as a research subject

Presentation transcript:

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar DLRL, Virginia Tech Nov. 18, 2014

Acknowledgments Related Funding: – : NSF IIS , DL-VT416: A Digital Library Testbed for Research Related to 4/16/2007 at Virginia Tech – : NSF IIS , Crisis, Tragedy, and Recovery network (CTRnet) – : NSF IIS , Integrated Digital Event Archive & Library (IDEAL) – : Villanova University (NSF DUE ): Computing in Context – 2014: Mellon/Columbia, Archiving Transactions Towards Uninterruptible Web Service (UPS – building on Memento and SiteStory) – Prashant Chandrasekar The Internet Archive (Kristine Hanna, co-PI): – Heritrix crawler and other tools and support – Hosting the crawls and resulting archives IDEAL team also includes Fox, Kavanaugh, Sheetz, Shoemaker, Lee

Outline Web Archiving Events archiving (disasters, community, and government) Automatic Seed URLs Generation – Social media – EFC Extending Web Archives Event Focused Crawler

Web Archiving Crawling approach Spontaneous disaster events archiving High quality seed URLs (curation) Crawling webpages – Frequency, Scope – Usually small number of seed URLs crawled frequently with many levels of scope Human intervention (time, effort, expertise, and management)

Events archiving Several types of webpages published online – Social media, news webpages, and formal (organization) webpages Different characteristics of event-related published content – huge number of relevant webpages (low to high quality) – most seed URLs are of one-level scope (not hub pages)

Automatic Seed URLs Generation (1/3) Tweet collections Tweet URLs – Extraction & Expansion (unshortening) URLs filtering (classification) URLs archiving – One time crawl (or according to quality of URLs) – One level crawl (only crawl the given URL)

Automatic Seed URLs Generation (2/3) Huge number of URLs extracted from tweets Unshortening takes a lot of time – Following redirection – Infinite redirection loop URL filtering using classification – Preparing training data Evaluating quality of resulting archive (curation)

Automatic Seed URLs Generation (3/3) Event Collect Tweets Tweet Collection Extract URLs Shortened URLs Expand Original URLs Fetch Webpages Archive WARC Index SOLR Browse Wayback Search Access Keyword/Hashtag Collect Archive/Organize/Analyze

Crawling Approach (1/2) Curator selects high quality seed URLs Use Event Focused Crawler (EFC) to retrieve webpages that are highly similar to those with the seed URLs Curator can configure EFC to adjust the number of webpages retrieved and the quality of retrieved webpages (similarity threshold)

Crawling Approach (2/2) Event Seed URLs Event Focused Crawler URLs Fetch Webpages Archive WARC Index SOLR Browse Wayback Search Access Collect Archive/Organize/Analyze

Extending Web Archives (1/2) Similar to previous scenario Archivists can use EFC to read WARC archives’ content and retrieve more webpages that are similar.

Extending Web Archives (2/2) WARC files Event Focused Crawler URLs Fetch Webpages Archive WARC Index SOLR Browse Wayback Search Access Extend Archive/Organize/Analyze

Event Focused Crawler (1/2) Modeling events – What happened, where, and when Information retrieval – Helps find What part (VSM, LDA) Natural language processing – Helps find Where and When parts (POS, NER) Archive textual and linguistic analysis – Event model can help provide linguistic characteristics of archive content – Frequent and important words – Frequent entities – Important sentences

Event Focused Crawler (2/2)

Thank You Questions? Mohamed Farag & Prashant Chandrasekar