GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.

Slides:



Advertisements
Similar presentations
Building an Ontology for Crisis, Tragedy, and Recovery Oct. 1, 2009 NKOS Workshop, ECDL 2009 Corfu, Greece Uma Murthy, Edward Fox, Naren Ramakrishnan,
Advertisements

Maines Sustainability Solutions Initiative (SSI) Focuses on research of the coupled dynamics of social- ecological systems (SES) and the translation of.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
Don’t Let Anybody Slip into Your Network! Using the Login People Multi-Factor Authentication Server Means No Tokens, No OTP, No SMS, No Certificates MICROSOFT.
Crisis, Tragedy, and Recovery Network Digital Library (CTRnet) + Web Archiving in Qatar and VT Edward A. Fox, Seungwon Yang, & CTRnet Team Department of.
Unified Logs and Reporting for Hybrid Centralized Management
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Wowza and Microsoft Azure Enable Easy Deployment and Management of Cloud-Based Streaming Solutions that Deliver Live and On-Demand Video to Any Device.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
The Natural Resources Digital Library Needs, Partners, and Challenges Bonnie Avery, Janine Salwasser, & Janet Webster Oregon State University.
Frankie Pike. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year Why care?
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
CTRnet: A Crisis, Tragedy, & Recovery Network ( Oct.16, 2009 VCOM Research Day Blacksburg, VA USA Edward Fox Bidisha.
An Introduction to HDInsight June 27 th,
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
NanoHUB.org and HUBzero™ Platform for Reproducible Computational Experiments Michael McLennan Director and Chief Architect, Hub Technology Group and George.
XXDL and CSTC and Virginia Tech NSDL Fall 2000 PI Meeting September 22-24, 2000 NSF, Arlington, VA Edward A. Fox CS DLRL.
Securely Synchronize and Share Enterprise Files across Desktops, Web, and Mobile with EasiShare on the Powerful Microsoft Azure Cloud Platform MICROSOFT.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
OpenField Consolidates Stadium Data, Provides CRM and Analysis Functions for an Intelligent, End-to-End Solution COMPANY PROFILE : OPENFIELD Founded by.
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
Powered by Microsoft Azure, PointMatter Is a Flexible Solution to Move and Share Data between Business Groups and IT MICROSOFT AZURE ISV PROFILE: LOGICMATTER.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Open Spatial Adds Scalable Functionality to As Constructed Design Certification Validation Portal Using Microsoft Azure MICROSOFT AZURE APP BUILDER PROFILE:
MidVision Enables Clients to Rent IBM WebSphere for Development, Test, and Peak Production Workloads in the Cloud on Microsoft Azure MICROSOFT AZURE ISV.
Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team
+ Logentries Is a Real-Time Log Analytics Service for Aggregating, Analyzing, and Alerting on Log Data from Microsoft Azure Apps and Systems MICROSOFT.
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.
Microsoft Azure Powers the Convenios e Obras Module for the Connected Government Solution, Which Can Integrate, Speed Up Decision-Making MICROSOFT AZURE.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Built on the Powerful Microsoft Azure Platform, Forensic Advantage Helps Public Safety and National Security Agencies Collect, Analyze, Report, and Distribute.
CTRnet Digital Library for Disaster Information Services Seungwon Yang 1, Andrea Kavanaugh 1, Nádia P. Kozievitch 4, Lin Tzy Li 1,4,5, Venkat Srinivasan.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Powered by Microsoft Azure, The Tyros Allows Sports Coaches, Athletes, and Officials to Share and Analyze Game Videos Anywhere There’s an Internet Connection.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
© 2007 IBM Corporation IBM Software Strategy Group IBM Google Announcement on Internet-Scale Computing (“Cloud Computing Model”) Oct 8, 2007 IBM Confidential.
The Future of Whole Human Genome Data Management and Analysis, Available on the Microsoft Azure Platform Today MICROSOFT AZURE APP BUILDER PROFILE: SPIRAL.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Viet Tran Institute of Informatics, SAS Slovakia.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
CS6604 Digital Libraries Global Events Team Final Presentation
Collection Management Webpages
Big Data Science Workshop 12 January 2017, Virginia Tech Digital Libraries and Big Data Edward A. Fox Prashant Chandrasekar, Islam Harb,
Hadoop Clusters Tess Fulkerson.
Clustering and Topic Analysis
Virginia Tech Blacksburg CS 4624
Ministry of Higher Education
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Voice Analytics on Microsoft Azure Allows Various Customers to Get the Most Out of Conversations with Clients Through Efficient Content Analysis MICROSOFT.
NRV Tweets Midterm Presentation VT CS4624, Blacksburg, VA
Introducing Qwory, a Business-to-Business Search Engine That’s Powered by Microsoft Azure and Detects Vital Contact Information for Businesses MICROSOFT.
Collection Management Webpages Final Presentation
CS6604 Digital Libraries IDEAL Webpages Presented by
Information Storage and Retrieval
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Presentation transcript:

GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea Kavanaugh, Donald Shoemaker, Steven Sheetz, Mohamed Magdy and Sunshin Lee IDEAL project, DLRL, CS, Virginia Tech Feb. 11, 2016 Acknowledgments: CHCI, CCSR # and NSF grants IIS , IIS , IIS , DUE

Topics CCSR project with Arlington and IBM IDEAL: project, example collections Big data collection, processing, tools Case study, demo: water main breaks Discussion: connecting IDEAL & GFURR

Center for Community Security & Resilience (CCSR) Social Media for Cities, Counties and Communities Funded by CCSR # with Arlington County, VA

Number of Followers for 34 Civic Orgs. Crisis, Tragedy, and Recovery Network Unique Followers: 22,325

Orgs Followers’ Followers Count Crisis, Tragedy, and Recovery Network

ArlingtonUW (ArlingtonUnwired.com) Org bio: Active. Mobile. Community. Your source for everything Arlington Followers’ bio Followers’ recent 20 tweets Arlington Tweet Analysis

Facebook Analysis Arlington Facebook Analysis Posts by Arlington County o 112 posts over August and September 2010 o 824 responses to those posts Posts highly consistent with Social Media Policy Evaluated county posts to identify the topics being communicated Identified the number and overall nature (positive or negative) of responses for each post

Facebook Analysis Topic Frequency Arlington Facebook Analysis

824 Responses 18% of the 4500 fans on Facebook –Responded in last 2 months (assuming 1 post per person) Mostly Positive Responses –Many “LIKES” (button on Facebook) Top 21 (19%) posts received 50% of responses Facebook Analysis Responses Arlington Facebook Analysis

Facebook Analysis Top 21 Post Responses by Topic Arlington Facebook Analysis

Facebook Analysis Overall Response to Post Arlington Facebook Analysis

Tag Clouds for Arlington County Produced from 1,800 YouTube Videos Search for videos containing the phrase “Arlington County” o Search performed using a Perl Script o Generated from all videos that met these criteria 2 Types of Tag Clouds Generated: 1) Using video titles 2) Using video tags (presented in next slide) What can we learn from these representations of social media use? o Size of words represents the frequency with which each term appeared in the search o Provides some indication of the importance of certain civic issues to members of the community Arlington YouTube Tag Analysis

Prior History, Studies, Connections Prior grants related to: – 4/16 archiving – Collection and infrastructure for events related to crises, tragedies, and community recovery Ontologies, emergency management, civil unrest Education connections – Problem/project based learning (PBL) – Computational linguistics (NLP): CS4984 – Information retrieval (search engines): CS5604

Integrated Digital Events Archiving and Library (IDEAL) Project Collections – 66 webpage collections hosted by the Internet Archive through Archive-It, curated by Virginia Tech (11TB in size) – 1.1 billion tweets (across about 1000 collections): many related to important local, national, and global events /concerns Services – Collecting, archiving, analyzing, searching, browsing, and visualizing -- utilizing our Hadoop cluster to aid researchers and other interested parties

Collecting Webpages Started 2007 Used Internet Archive (IA) – 66 collections – 11TB Shootings, earthquakes, bombings, hurricanes, …

Collecting tweets Collections for multiple projects – Tweets from YourTwapperKeeper, DMI-TCAT

Collection Example 1: School Shooting Collection – Over 1 million tweets concerning school shootings – A map of worldwide school shootings and a timeline of international school shootings Users – First responders – Urban and emergency planners – Treatment and counseling therapists – Social science researchers studying tragic events and their aftermaths (including personal and community resilience and recovery)

Collection Example 2: GETAR project Global Event and Trend Archive Research – Tackle key global challenges, e.g., climate change (as well as opportunities), innovation and resilience Collection – Started 10/8/2015 – 306 collections – 30,961,650 tweets (as of 2/10/2016) – Including global warming, Internet of things, population, and environment

What is Big Data and Hadoop Definition – Big data a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. 1) – Apache Hadoop a framework for distributed processing of large data sets across clusters of computers using simple programming models. 2) 1) Big data definition: wikipedia.org 2) Hadoop definition: hadoop.apache.org

Hadoop solutions Hadoop – Cloudera Academic Partnership, software – MapReduce (YARN: MapReduce V2) a programming model for processing large data sets with a parallel, distributed algorithm on a cluster – HDFS a distributed, scalable, reliable, and portable file- system written in Java for the Hadoop framework

Archiving and Analyzing using Bigdata Hadoop cluster Hadoop (using Desktop PC) – # of Nodes: 20 – CPU: Intel i5 Haswell Quad core 3.3Ghz – RAM: 640 GB (20 * 32GB RAM) – HDD: 60 TB (20 * 3TB HDD) – Backup: 12TB, 8.3TB NAS Servers – Tweet collecting – Web crawling – Geocoding – Search (Solr)

DLRL cluster - Services

Archiving and Analyzing using Bigdata Hadoop cluster

Tools for research Spark or Mahout for machine learning: – Classification, clustering – Topic analysis (LDA), Frequent Patterns Mining Solr/Lucene: Search/(Faceted) Browse Natural Language Processing and Named Entity Recognition: NLTK (Python), SNER Information visualization (social networks) Connections with GIS, other data/info systems

Demo: Analyze a tweet collection for water main breaks (WMBs)

Processing (also for CS5604)

What Causes Water Main Breaks? MassLive.com AccuWeather.com

What Causes Water Main Breaks? Earthquakes (USGS) Mar. 1 – Apr. 5, 2012

Fix water pipe – Water utility – city/town utility Traffic – Police Affected – Citizen Others … Who is involved in a WMB ? Lakewood, NJ, June West Philadelphia, PA, June. 2015

Discussion Questions? How can IDEAL help GFURR? How can GFURR help IDEAL? Collaborations, proposals, partners, … (Possible supplement related to smart and connected communities)