Computational Linguistic Analysis of Earthquake Collections

Slides:



Advertisements
Similar presentations
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Advertisements

Shootings CS-4984 Computational Linguistics Arjun Chandrasekaran Saurav Sharma Peter Sulucz Jonathan Tran December 2014 Virginia TechBlacksburg, VA.
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Arabic Natural Language Processing: P-Stemmer, Browsing Taxonomy, Text Classification, RenA, ALDA, and Template Summaries — for Arabic News Articles Tarek.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Paragraph One: TOKYO - A powerful earthquake shook northern Japan early Tuesday, and small tsunami waves struck coastal towns about 200 miles from the.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Deriving Paraphrases for Highly Inflected Languages from Comparable Documents Kfir Bar, Nachum Dershowitz Tel Aviv University, Israel.
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
A Case Study Example. Mark Scheme You will be awarded grades based on your investigation skills within Geography. Merits will also be available for the.
Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.
Earthquakes and Tsunamis. Learning Intentions To understand the frequency of earthquakes around the world. To explore how earthquakes may lead to tsunamis.
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
The American Revolution Kristen Byrne EDU Prof. R. Moroney Summer 2010.
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
Earthquakes in Rich Countries
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Big Data Processing of School Shooting Archives
NLTK Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.
CS6604 Digital Libraries Global Events Team Final Presentation
Collection Management Webpages
Rdoc2vec Jake Clark, Austin Cooke, Steven Rolph, Stephen Sherrard
Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.
Memory Standardization
Floods Joe Acanfora, Myron Su, David Keimig and Marc Evangelista
“Earthquake centered in Delaware shakes Philadelphia region”
Team C Stanislaw Antol, Souleiman Ayoub, Carlos Folgar, Steve Smith
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
VR4GETAR CS4624: Multimedia, Hypertext and Information Access
Trail Study Kevin Cianfarini, Shane Davies, Marshall Hansen, Andrew Eason … CS4624: Multimedia, Hypertext, and Information Access Instructor: Dr. Edward.
Clustering tweets and webpages
Pathways Web CS4624 Multimedia, Hypertext, and Information Access
Text Analytics Giuseppe Attardi Università di Pisa
CS 5604 Information Storage and Retrieval
Earthquakes 7.1 Earthquakes occur along faults. 7.2
TWIN EARTHQUAKES HIT WESTERN CHINA ON JULY 22, Deaths Despite Being Moderate-Magnitude Events Walter Hays, Global Alliance for Disaster Reduction,
Social Knowledge Mining
The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt
Phil Durrant Debra Myhill Mark Brenchley
Event Focused URL Extraction from Tweets
Literacy Content Specialist, CDE
NRV Tweets Final Presentation VT CS4624, Blacksburg, VA
Collection Management Webpages Final Presentation
Earthquakes 7.1 Earthquakes occur along faults. 7.2
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Twitter Equity Firm Value
Earthquakes 7.1 Earthquakes occur along faults. 7.2
LucidWorks: Vectorize Workflow Module
Arabic News Summarization
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Katrina Database SearchKat
Through the Fire and Flames
CS5984:Big Data Text Summarization
CS 5984: Team 5 Final Presentation
Big Data Text Summarization Westminster Attack
Statistical n-gram David ling.
Introduction to Text Analysis
CS4984/CS598: Big Data Text Summarization
Team 7 → Final Presentation
PPT and video are due no later than March 1, 2019
Python4ML An open-source course for everyone
Presentation transcript:

Computational Linguistic Analysis of Earthquake Collections Team F: Kevin Kokal, Kwamina Orleans-Pobee, Chris Wakeley, Reed Bialousz CS4984 - 12/08/2014 Virginia Tech, Blacksburg, VA 24061 Kevin

Outline Problem Statement Acclimation (Units 1-3) Exploration (Units 4-7) Culmination (Units 8-9) Results Lessons Learned Kevin

Problem Statement Objective: Construct a meaningful summary of an earthquake event using a combination of computational linguistic techniques. Collection Location Year # of Files Small Lushan, China 2013 454 Big Virginia, U.S.A. 2011 4599 Kevin

Acclimation: Overview Familiarize with NLTK/Python Examine Value of Word Frequency Refine Lemmatization Word Stemming Familiarize with Hadoop/MapReduce Examine Part of Speech Tagging Utilize Cluster Kevin

Acclimation Unit 1: Word Frequencies with ClassEvent With restrictions, decent results Collocations improved results Best small set of word pairs: Long Island; New York; Weather Service; National Weather; Bay Shore; heavy rains; Suffolk County; rain fell; Associated Press Kevin

Acclimation (cont.) Unit 2: Improve on Unit 1 Solution Utilize WordNet Lemmatize Explore Hadoop/MapReduce Noise in Small Collection After Simple Cleaning, Results Improved Reed

Acclimation (cont.) Unit 3: Incorporate Part of Speech Tagging Constructed a Trigram Part of Speech Tagger Back-Off Model Lemmatized Top Nouns and a Few Top Verbs Mixed Results More Cleaning Required (See Below) Small: ['earthquake', '2717'], ['quake', '1965'], ['sichuan', '1473'] Reed Big: [' earthquake', ' 36399'], [' comment', ' 31146'], [' video', ' 31131']

Exploration: Overview Clean Collections Incorporate Named Entity Recognition Summarize by Topics Using Mahout/LDA Explore Clustering to Summarize by Indicative Sentences Reed

Exploration Unit 4: Focus on Cleaning Collections Using Bigram Features Internet Stop Words List Decision Tree Classifier (> 90% Accuracy) Rerunning Past Units Produced Improved Results The powerful earthquake that rocked China has claimed 165 lives so far and injured more than 6,700, according to state media reports. The quake that southwest China on Saturday wrecked homes and triggered landslides in an area devastated by a major tremor in 2008. initialComments:true! pubdate:08/25/2011 06:54 EDT! commentPeriod:3! commentEndDate:8/28/11 6:54 EDT! currentDate:9/5/11 8:51 EDT! allowComments:false! Reed

Exploration (cont.) Unit 5: Examining Named Entities Locations = Most Important Stanford NER > NLTK.ne_chunk Unhelpful results for “People” for both technologies Solr Queries: More Locations = Better Results Small: [('China’, 450), ('Sikkim', 244), ('Sichuan', 171)] Big: [('Virginia', 46), ('East', 15), ('Washington', 14), ('Coast', 12)] Chris

Exploration (cont.) Unit 6: Summarizing by Topic Explored LDA using Mahout on Hadoop Cluster A Few Meaningful Results: Solr Results for Top 5 Topics: Great for Small, Mediocre for Big Big Collection Needed Cleaning {sichuan:0.034,china:0.033,strikes:0.025,deadly:0.021,saturday:0.014,cnn:0.011,quake:0.010,april:0.009,chinese:0.009,sunday:0.008} Chris

Exploration (cont.) Unit 7: Sentence Clustering Cleaned Big Collection Experimented with Number of Clusters Generally not meaningful, with Exceptions 180 people have died, at least another 24 are missing. Chris

Culmination: Overview Construct a Template of Relevant Information for Earthquakes Fill in Template for Small and Big Collections Construct English Report Using Filled in Template Chris

Culmination Unit 8: Construct and Fill “Earthquake Template” Researched Earthquakes to Construct Template Used Regular Expressions to Fill Corpus Language Influences Results (i.e. Magnitude) Mode = Most Accurate Great Results Kwamina

Culmination (cont.) Big Corpus Earthquake Analysis: Date: August 23, 2011 Location: Virginia Magnitude: 5.8 Epicenter: Louisa Deaths: 140 Time: 1:51 p.m. Small Corpus Earthquake Analysis: Date: April 20, 2013 Location: Lushan Magnitude: 6.6 Epicenter: Lushan Deaths: 186 Time: 8:02 a.m. Kwamina

Culmination (cont.) Unit 9: Using Filled Templates to Construct English Reports Improved Template: Added booleans for aftershocks, tsunamis, etc. Created a report structure for any earthquake Kwamina

Results YourBig: On 23 August, 2011 at 1:51 p.m., a 5.8 magnitude earthquake struck Virginia. The epicenter of the quake was located at Louisa. There were aftershocks that followed the earthquake and no tsunami was caused by the earthquake. There were no reports of landslides due to this earthquake. A total of 140 deaths occurred. YourSmall: On 20 April, 2013 at 8:02 a.m., a 6.6 magnitude earthquake struck Lushan. The epicenter of the quake was located at Lushan. There were aftershocks that followed the earthquake and no tsunami was caused by the earthquake. There were reports of landslides due to this earthquake. A total of 186 deaths occurred. Kwamina

Lessons Learned Cleanliness is Key Familiarize -> Explore -> Practice -> Utilize Parallelize Document and Reuse Implement Using Multiple Strategies Kevin

Acknowledgements CS4984 Instructors CS4984 Classmates Sponsors Dr. Edward Alan Fox (fox@vt.edu) Tarek Kanan (tarekk@vt.edu) Xuan Zhang (xuancs@vt.edu) Mohamed Magdy (mmagdy@vt.edu) CS4984 Classmates Sponsors NSF DUE-1141209 IIS-1319578 Kevin

References Bird, Steven, and Ewan Klein. Natural Language Processing with Python. Beijing: O'Reilly, 2009. Print. See “References” Page of CS4984 Wiki http://www.nltk.org/ https://docs.python.org/2/

Questions Kevin http://www.cardiomyopathy.org/assets/images/news/question.jpg