From CTRnet to IDEAL (and Qatar, VT, SiteStory, UPS, …) NSF IA WIRE Workshop Harvard -- June 16, 2014 Edward A. Fox,

Slides:



Advertisements
Similar presentations
Building an Ontology for Crisis, Tragedy, and Recovery Oct. 1, 2009 NKOS Workshop, ECDL 2009 Corfu, Greece Uma Murthy, Edward Fox, Naren Ramakrishnan,
Advertisements

WDL Technical Architecture Working Group (TAWG) June 2010 Achievements and Recommendations Co-chaired by Noha Adly, Bibliotheca Alexandrina Babak Hamidzadeh,
ASIAES Project Overview Satellite Image Network for Natural Hazard Management in ASEAN+3 region Pakorn Apaphant Geo-Informatics and Space Technology Development.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Integrated Digital Event Archiving & Library (IDEAL) (includes proposal and 1 year report to NSF) Internal Advisory Board.
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
Reference 2.0: Using New Web Technologies to Enhance Public Service Texas Library Association Conference April 17, 2008 Stephen F. Austin State University’s.
Crisis, Tragedy, and Recovery Network Digital Library (CTRnet) + Web Archiving in Qatar and VT Edward A. Fox, Seungwon Yang, & CTRnet Team Department of.
1 CHCI Visit by Dean Benson, Associate Dean Lesko KW II Rm – 10/10/2011 Digital Library Research Laboratory Torgersen Hall Rm 2030 –
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Xpantrac Connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
ELISQ Discussion with QNL Director Lux 20 May 2015 Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA USA
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
CS 6604 Middle Term Report Computational Linguistics PJ -Explore Correlation between Newswires and Twitter by Tianyu Geng, Wei Huang, Ji Wang, and Xuan.
Edward A. Fox Virginia Tech, CS, Digital Library Research Laboratory
Digital Library Research Laboratory Torgersen Hall 2030 – (part of IT at VT) and Department of Computer Science CS4624: Multimedia, Hypertext,
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 May 6, 2014 Client Tarek Kanan 1.
Collaborative Research: Curriculum Development for Digital Library Education Presentation in May 1,2006
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Qatar Content Classification Presenter Mohamed Handosa VT, CS6604 March 6, 2014 Client Tarek Kanan 1.
CTRnet: A Crisis, Tragedy, & Recovery Network ( Oct.16, 2009 VCOM Research Day Blacksburg, VA USA Edward Fox Bidisha.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
CITIDEL: Computing & Information Technology Interactive Digital Educational Library Web Page: Contacts: Future.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.
Cyberinfrastructure Overview Russ Hobby, Internet2 ECSU CI Days 4 January 2008.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
ELISQ Seminar Qatar National Library 20 May 2015 Introduction by Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA USA
 digital methodologies for global media research Randy Kluver Dept of Communication Texas A&M University.
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
CTRnet Digital Library for Disaster Information Services Seungwon Yang 1, Andrea Kavanaugh 1, Nádia P. Kozievitch 4, Lin Tzy Li 1,4,5, Venkat Srinivasan.
1 Improving the ETD Landscape ETD 2014: 17 th Int’l Symposium on ETDs Leicester, England Edward A. Fox Executive Director, NDLTD,
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014.
CS6604 Digital Libraries Global Events Team Final Presentation
Collection Management Webpages
Big Data Science Workshop 12 January 2017, Virginia Tech Digital Libraries and Big Data Edward A. Fox Prashant Chandrasekar, Islam Harb,
Launch, Persevere, and Collaborate
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Clustering and Topic Analysis
Virginia Tech Blacksburg CS 4624
Clustering tweets and webpages
Computational Linguistics PJ
CS 5604 Information Storage and Retrieval
Event Focused URL Extraction from Tweets
Team FE Final Presentation
ETDs for Life Panel ETD 2014: 17th Int’l Symposium on ETDs Leicester, England Edward A. Fox Executive Director, NDLTD,
Latin American Government Documents Archive, LAGDA
Collection Management Webpages Final Presentation
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Michael Shuffett Virginia Tech Blacksburg, VA
Computational Linguistic Analysis of Earthquake Collections
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Tony Ardura, Austin Burnett, Rex Lacy, Shawn Neumann
Web archives as a research subject
Presentation transcript:

From CTRnet to IDEAL (and Qatar, VT, SiteStory, UPS, …) NSF IA WIRE Workshop Harvard -- June 16, 2014 Edward A. Fox, Professor, Dept. of Computer Science, Virginia Tech Director, Digital Library Research Laboratory Director, Networked Digital Library of Theses and Dissertations

Acknowledgments - 1 Related Funding: – : NSF IIS , DL-VT416: A Digital Library Testbed for Research Related to 4/16/2007 at Virginia Tech – : NSF IIS , Crisis, Tragedy, and Recovery network (CTRnet) – : NSF IIS , Integrated Digital Event Archive & Library (IDEAL) – : Villanova University (NSF DUE ): Computing in Context – : Qatar NPRP , Establishing a Qatari Arabic-English Library Institute – 2014: Mellon/Columbia, Archiving Transactions Towards Uninterruptible Web Service (UPS – building on Memento and SiteStory) The Internet Archive (Kristine Hanna, co-PI): – Heritrix crawler and other tools and support – Hosting the crawls and resulting archives LucidWorks (software and support – open jobs, internships)

Acknowledgments - 2 IDEAL: VT: PI: Fox, co-PIs: Andrea Kavanaugh, Steve Sheetz, Don Shoemaker; GRAs: Mohamed Magdy, Sunshin Lee; Egypt: Riham Mansour CTRnet: also Naren Ramakrishnan (co-PI); GRAs Seungwon Yang and Venkat Srinivasan DL-VT416: also Christopher North and Weiguo Fan Computing in Context: Villanova PI Robert Beck; Students: Xuan Zhang, Tarek Kanan: class to learn Computational Linguistics by 5-way better summarizing Web archive collections (extract words/sentences, find topics, use event templates) Qatar: Lead PI Fox, Co-PIs Mohammed Samaka (Qatar U.), Somaya Al- maadeed (QU), Krishna RoyChowdhury (Qatar National Library), C. Lee Giles (Penn State), Rick Furuta (Texas A&M); consultant John Impagliazzo (Hofstra), VT GRA Tarek Kanan Mellon: PI Zhiwu Xie, co-PI Fox, GRA Prashant Chandrasekar Other students: Kiran Chitturi, Rachel Coston, Ishita Ganotra, S.M.Shamimul Hasan, Christopher Jones, Rohan Kaul, Jun Kim, Lin Tzi Li, Ying Ni, Braeden Sebastian, and teams in CS4624, 5604, 6604 Collaborators in: Egypt, Tunisia, Mexico, Philippines WE WELCOME OTHER COLLABORATORS!

Memento – Time Travel for the Web: Across-Archive Method for Linking the Current & Past Web RFC 7089 (Martin Klein)

Related Projects Mellon/Columbia: enhance SiteStory by devising a webserver that also archives; use the archive automatically when server is down; capture the VT Web and bring up UPS on multiple campus sites Qatar: at Qatar U., Qatar National Library – Build a digital library community (consulting center) 4 DL books with M&C + l_Libraries – Build digital library infrastructure: SiteSeer (CiteSeerX, ChemXseer, TableSeer, …) with Arabic and CLIR support Heritrix, Wayback Machine, Solr, …

Web Archives 13 TB of IA Collections, e.g., 2013: Boko Haram attack, Boston Marathon blast, Global Emergency Overview, Texas fertilizer plant explosion CategoryNo. of Archives Accidents (plane crash, building collapse, ferry sinking) 11 Bombings4 Earthquakes (Japan)12 Fires2 Floods4 Hurricanes (Sandy), Tsunami, Cyclones, Typhoons 8 Shootings17

Tweet Collections > 120 Event-specific and general collections Total of 600 million tweets, from streaming API, using hashtags and keywords CategoryNo. of collections Accident (transportation)33 Bombing8 Community10 Earthquake18 Fire6 Flood11 General (including health)67 Hurricane, Tsunami39 Political (Middle East, Iran)40 Shooting29

CTRnet Collect, analyze, and visualize disaster information with a DL

Social Media Use in Political Crisis (1/2)(2/7 - 2/14, 2011)  Total 514,782 tweets No. Tweets

Social Media Use in Political Crisis (2/2) Opinion Leadership in Egypt Uprising 2011 – 514,782 tweets (one week around Mubarak’s resignation) – Total 79,000 unique users Presumably posting from Egypt  4,710 Individuals excluding organizations  3,675 – Opinion leaders ,000 followers in top 10% (365) individuals Bios: blogger/activist, writer/reporter, lawyer/executive director, social media consultant,…  ‘elite’ type actors

Visualizing Emergency Phases in Tweets (ISCRAM 2013) (1/2) Four phases of emergency management model

Visualizing Emergency Phases in Tweets (2/2)

Topic Tagging of Webpages: Xpantrac - 1 Seungwon Yang dissertation ➔ Input: text file ➔ Build query ◆ Every 5 words, 1 word overlap ➔ Send query to search API ➔ Web search (Seungwon) ➔ Wikipedia, our collection(s): CS4624 Spring 2014: Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman ➔ Find topics in retrieved documents ◆ Frequency of words ➔ Select most frequent as “topics” ➔ Output: topics

Topic Tagging of Webpages: Xpantrac - 2 Seungwon Yang (GMU postdoc now)

Water Main Break Visualization Sunshin Lee Tweets collected with keywords Selected tweets with location information (lat/long, geonames) Event locations displayed with details

Integrated Digital Event Archive and Library (IDEAL) Project Extension of CTRnet with broadened scope: – Event detection – Event data archiving & processing Multimedia (images, videos) shared in social media Digital government research – Community issue detection – Public opinion mining, mood perception, information flow Technologies: – Focused crawling, analysis/visualization services, integration of archive and DL capabilities

Event Ontology Event model – Who, What, When, Where, How – Organizations/entities participating in the event What – Topics of the Event Where – Event location When – Event time frame (and later times of interest, e.g., anniversaries)

IDEAL Proposal Architecture

IDEAL System Architecture Sunshin Lee (built low-cost cluster)

IDEAL Data Architecture Sunshin Lee

Event Focused Crawler Mohamed Magdy Focus of research

Baseline vs. Event Focused Crawler Mohamed Magdy Harvest ratio: relevant crawled webpages vs. cumulative set of crawled webpages

Extracted News Events on a Time Line CS6604 Spring 2014: Tianyu Geng, Wei Huang, Ji Wang, Xuan Zhang 02/28 03/01 03/08 03/09 03/12 03/14 03/16 03/20 03/23 03/26 04/12 04/16 ukraine, crimea, crisis, putin, russia, minister russia, bank, sanctions, ukraine, crisis, crimea ukraine, tensions, data, rise, shares, china, stocks ukraine, house, imf, u.s, bill, white, aid ukraine, russia, talks, aid, crisis, sanctions, deal ukraine, aid, support, government, talks, house, russian ukraine, yanukovich, crisis, minister, sign, russian crimea, ukraine, russia, minister, referendum, vote crimea, ukraine, russian, troops, border gas, ukraine, russian, russia, europe, talks, energy History: 3/7 referendum annulled 3/14: UN draft resolution

Who When Where Topic Event 3 Pre- processor LDA NER Who When Where Topic Event 2 Who When Where Topic Event 1 Who When Where Topic Event 3 Who When Where Topic Event 2 Who When Where Topic Event 1 Correlation Event Extraction Sys. Pre- processor LDA NER Event Extraction Sys. News-Tweet Architecture CS6604 Spring 2014: Tianyu Geng, Wei Huang, Ji Wang, Xuan Zhang

IDEAL Spreadsheet CS4624 Spring 2014: Tony Ardura, Austin Burnett, Rex Lacy, Shawn Neumann (based on ArcSpread by Andreas Paepcke et al.)

Recommended Collection-Level Metadata CS6604 Spring 2014: Michael Shuffett Dublin Core – Title, Description PROV-O – Starting Point Classes – Collection process, organization, hadMember, atLocation ISO for locations W3/XMLSchema#dateTime PLUS: TweetID tool for tweet collections – Extracts tweet and collection level metadata – Compares / combines tweet collections

Thank you! Questions/Comments? Office: 2160G Torgersen Hall Campus Mail: 114 McBryde Hall, M/C 0106, Dept. of CS, Virginia Tech, Blacksburg, VA