CS6604 Digital Libraries Global Events Team Final Presentation

Slides:



Advertisements
Similar presentations
HTML5 ETDs Edward A. Fox, Sung Hee Park, Nicholas Lynberg, Jesse Racer, Phil McElmurray Digital Library Research Laboratory Virginia Tech ETD 2010, June.
Advertisements

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Shootings CS-4984 Computational Linguistics Arjun Chandrasekaran Saurav Sharma Peter Sulucz Jonathan Tran December 2014 Virginia TechBlacksburg, VA.
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
LHCbPR V2 Sasha Mazurov, Amine Ben Hammou, Ben Couturier 5th LHCb Computing Workshop
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
1 Archive-It Training University of Maryland July 12, 2007.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Wanna know how to get from “Y” to“K” ? Farisai Mabvudza Uma Rudraraju & George Wells Greg Foster & Presented By…Supervised By…
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Developing an improved focused crawler for the IDEAL project Ward Bonnefond, Chris Menzel, Zack Morris, Suhas Patel, Tyler Ritchie, Mark Tedesco, Franklin.
Integrated Collaborative Information Systems Ahmet E. Topcu Advisor: Prof Dr. Geoffrey Fox 1.
CTRnet: A Crisis, Tragedy, & Recovery Network ( Oct.16, 2009 VCOM Research Day Blacksburg, VA USA Edward Fox Bidisha.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
Event-Based Hybrid Consistency Framework (EBHCF) for Distributed Annotation Records Ahmet Fatih Mustacoglu Advisor: Prof. Geoffrey.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.
Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
CTRnet Digital Library for Disaster Information Services Seungwon Yang 1, Andrea Kavanaugh 1, Nádia P. Kozievitch 4, Lin Tzy Li 1,4,5, Venkat Srinivasan.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624 Clients: Mohamed Magdy and Tarek Kanan Blacksburg, VA 5/6/2014.
Big Data Processing of School Shooting Archives
Global Event Detector Final Project Presentation
Collection Management (Tweets) Final Presentation
Collection Management Webpages
Presenters: Radha Krishnan, Sneha Mehta April 28, 2016
Collection Management
ArchiveSpark Andrej Galad 12/6/2016 CS-5974 – Independent Study
Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.
Big Data Science Workshop 12 January 2017, Virginia Tech Digital Libraries and Big Data Edward A. Fox Prashant Chandrasekar, Islam Harb,
Zenodo Data Archive Irtiza Delwar, Michael Culhane, John Sizemore, Gil Turner Client: Dr. Seungwon Yang Instructor: Dr. Edward A. Fox CS 4624 Multimedia,
Extraction, aggregation and classification at Web Scale
CLA Team Final Presentation CS 5604 Information Storage and Retrieval
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
VR4GETAR CS4624: Multimedia, Hypertext and Information Access
Visualizations of School Shootings
Trail Study Kevin Cianfarini, Shane Davies, Marshall Hansen, Andrew Eason … CS4624: Multimedia, Hypertext, and Information Access Instructor: Dr. Edward.
Clustering and Topic Analysis
Virginia Tech Blacksburg CS 4624
Clustering tweets and webpages
CS 5604 Information Storage and Retrieval
The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt
CS6604 Digital Libraries IDEAL Webpages Presented by
Event Focused URL Extraction from Tweets
Team FE Final Presentation
Collection Management Webpages Final Presentation
Stream Field Final Project Presentation
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Tracking FEMA Kevin Kays, Emily Maier, Tyler Leskanic, Seth Cannon
Introduction to Apache
CS6604 Digital Libraries IDEAL Webpages Presented by
Validation of Ebola LOD
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Computational Linguistic Analysis of Earthquake Collections
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Adam Lech Joseph Pontani Matthew Bollinger
CS5984:Big Data Text Summarization
VT Web Archiving Anthony Rinaldi and Dev Mehta CS 4624
Python4ML An open-source course for everyone
Presentation transcript:

CS6604 Digital Libraries Global Events Team Final Presentation Presenters: Liuqing Li, Islam Harb, Andrej Galad {liuqing, iharb, agalad}@vt.edu Instructor: Dr. Edward A. Fox Virginia Polytechnic Institute and State University Blacksburg, VA, 24061 April 27, 2017

Outline Background Implementation Future Work Acknowledgement Data Collection Data Processing Data Visualization Future Work Acknowledgement

Background GETAR* Global Event and Trend Archive Research Architecture * Edward A Fox, Donald Shoemaker, Chandan Reddy, Andrea Kavanaugh, III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR), NSF grant IIS - 1619028, 2017-2019. http://eventsarchive.org

Implementation – Architecture Event Focused Crawler (EFC) WARC Files CDX Files CDX Writer ArchiveSpark Apache Spark Stanford NER Regular Expression Score Function Entity-based Results Standalone HBase Web Application Data Collection Data Processing Data Visualization

School Shooting Events Events of Interest School Shooting Events Year Virginia Tech Shooting 2007 Northern Illinois University Shooting 2008 Dunbar High School Shooting 2009 University of Alabama Shooting 2010 Worthing High School Shooting 2011 Sandy Hook Elementary School Shooting 2012 Sparks Middle School Shooting 2013 Reynolds High School Shooting 2014 Umpqua Community College Shooting 2015 Townville Elementary School Shooting 2016

Focused Crawler – Collecting / Archiving START END Manually Curate Seeds Yes URLs Queue Extract URLs All URLs? No Download Page Calculate Relevancy Process Page & Convert into WARC Format Yes Relevant? Append Result warc.gz Event File No Discard

WARC Libraries Wget (Version 1.14 or later)

WARC Libraries Wpull

WARC Libraries WARCIO: WARC (and ARC) Streaming Library Python 2.7+ and 3.3+ Post-Processing: Read / Write WARC format

Ten Events Collections Naming Convention [location]_[year].warc.gz

Tools for Data Processing ArchiveSpark Apache Spark framework for Web Archives Easy data extraction Input: WARC and CDX files CDX Writer Python script to create CDX files of WARC files Format: CDX N b a m s k r M S V g e.g., edu,vt,cnre)/ 20170422005601 http://cnre.vt.edu text/html 200 BT3ILJXROIILHBKQPNYDUCUVZRDKG3OA - - 9478 20104749 data/Virginia-Tech-Shooting_20070416.warc.gz

Data Preprocessing Webpage Cleaning Extract Raw Text payload.string.html.body.text Remove jQuery & JavaScript { WPGroHo.syncProfileData( hash, id ); }, … Remove tags <br>, <p>, … Remove markers *, |, +, … Remove stopwords a, about, the, …

Data Processing Entity Extraction Score Function Basic Parsing event name and date Stanford NER (Integrated model) entities, shooter name Regular Expression event date shooter name and age number of victims weapon list Score Function 𝑡𝑓∗𝑑𝑓

HBase Build-in ImportTsv Utility Import Data into HBase Table Name globalevents Row_Key Event_Date + Event Hash Value 20070416217787922 Column Family event Column event: name Virginia Tech Shooting event: date 20070416 event: shooter_age 23-year-old event: shooting_victims 32 victims event: entities Virginia;Tech;VA;University;… event: entities_count 146900;62415;13940;7732;… event: entities_url url1,url2,url3,url4,url6;url2,url3,url4,url5;url1,url3,url4,url5,url6;…

Data Processing – Demo Key Stages Initialization Processing Create Spark Session Create NLP Core Create Storage Processing Extract Event Name/Date/URL Extract Name Entities Extract Other Event Features Export and Import Generate TSV file Import TSV file into HBase

Global Events Viewer Efficient visualization of long-term global events Show representative terms -> link to corresponding URLs Visualize events’ trends over time (time series) Java 7 Spring Boot Web application Build system - Gradle Embedded Tomcat Web server Backend - HBase, in-memory Frontend - D3.js, Bootstrap https://github.com/dedocibula/global-events-viewer

Global Events Viewer – Demo Key Components WordCloud, Range Selection, URL List, Trends

Problem Faced Data Collection Encoding problems (UTF-8, ASCII and others) Get more relevant seeds for old events Data Processing Lack of documentation (ArchiveSpark) Version conflict (CDX Writer, Kernel in Jupyter) JVM issue (Spark) Data Visualization Spring boot IntelliJ setup JQuery UI

Lessons Learned Data Collection WARCIO Focused Crawler Data Processing ArchiveSpark Spark & Scala (Map/Reduce Process) Data Visualization D3 WordCloud D3 Dynamic Line Charts

Future Work Data Collection Wayback Machine Automatic Routine for Focused Crawler Event Extension (Sources, Time, Space) Data Processing Standalone Mode -> Cluster Mode Name Entity Recognizer Automatic Processing (CDX Writer and HBase) Data Visualization Localization – Datamaps Weapons

Acknowledgement Projects NSF IIS - 1319578 III: Small: Integrated Digital Event Archiving and Library (IDEAL) NSF IIS - 1619028 III: Small: Collaborative Research: Global Event and Trend Archive Research (GETAR) Organizations Internet Archive L3S Research Center Persons Instructor Dr. Edward A. Fox Alumnus Dr. Mohamed Magdy Farag Labmates Prashant Chandrasekar, Xuan Zhang

Thank you ! Questions?