Floods Joe Acanfora, Myron Su, David Keimig and Marc Evangelista

Slides:



Advertisements
Similar presentations
SS7G10 The student will discuss environmental issues across South & East Asia. a) Describe the causes and effects of pollution on the Yangtze and Ganges.
Advertisements

Shootings CS-4984 Computational Linguistics Arjun Chandrasekaran Saurav Sharma Peter Sulucz Jonathan Tran December 2014 Virginia TechBlacksburg, VA.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech Feb. 18, 2015 presentation for.
Big Data and Hadoop and DLRL Introduction to the DLRL Hadoop Cluster Sunshin Lee and Edward A. Fox DLRL, CS, Virginia Tech 21 May 2015 presentation for.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
CS525: Big Data Analytics Machine Learning on Hadoop Fall 2013 Elke A. Rundensteiner 1.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.
L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.
What environmental issues are illustrated in the pictures?
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
By Jacob. Where floods typically occur. Floods usually occur on rivers, creeks and bays. Floods also occur after tsunamis and hurricanes.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.
Pakistan Floods Causes  On 29 July 2010 flash floods and landslides caused by unusually heavy monsoon rains caused widespread flooding in North.
A flash flood that shocked the country.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Big Data Processing of School Shooting Archives
Which flood? Boscastle 2004 Bangladesh 2004 Slow onset Flood
A Simple Approach for Author Profiling in MapReduce
Introduction Most samples in Household Travel Surveys (HTS) complete via web Geocoding is an important element in HTS collection Online geocoding services.
Measuring Monolinguality
CS6604 Digital Libraries Global Events Team Final Presentation
Tools for Natural Language Processing Applications
Collection Management Webpages
Text Based Information Retrieval
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
What environmental issues are illustrated in the pictures?
Daniel Bevis William King Villanova University Spring 2006 CS9010
Natural Language Processing (NLP)
WARM UP Name all 7 continents..
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Clustering and Topic Analysis
Clustering tweets and webpages
Text Analytics Giuseppe Attardi Università di Pisa
What environmental issues are illustrated in the pictures?
NEW UNIT: COASTAL ENVIRONMENTS Factors Affecting the Coastline
Hurricane Camille.
Collection Management Webpages Final Presentation
Extracting Semantic Concept Relations
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Final Presentation: Neural Network Doc Summarization
Information Storage and Retrieval
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Text Transformation May 5th, 2015 CS Multimedia/Hypertext
Computational Linguistic Analysis of Earthquake Collections
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Katrina Database SearchKat
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Through the Fire and Flames
CS5984:Big Data Text Summarization
Big Data Text Summarization Westminster Attack
Automatic Summarization of News Articles about Hurricane Florence
CS4984/CS598: Big Data Text Summarization
Natural Language Processing (NLP)
What environmental issues are illustrated in the pictures?
Environmental Issues:
Introduction to Sentiment Analysis
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Natural Language Processing (NLP)
Python4ML An open-source course for everyone
Presentation transcript:

Floods Joe Acanfora, Myron Su, David Keimig and Marc Evangelista CS4984: Computation Linguistics Virginia Tech, Blacksburg December 10, 2014

Introduction Objective Discussion of corpora Final results Tools we used for cleaning the data Tools we used for language processing Tools we did not use What we learned Conclusion

Objective Generate summaries of flooding events based on collections of news articles.

Flood Data ClassEvent - Islip_Flood YourSmall - China_Flood 11 Files YourSmall - China_Flood 537 files YourBig - Pakistan_Flood 20,416 files Unclean data

U9 Results In June 2011 a flood spanning 9.94 miles caused by heavy rain affected the yangtze river in China. The total rainfall was 170.0 millimeters and the total cost of damages was 760 million dollars. The flood killed 255 people, left 87 injured, and approximately 4 million people were affected. In addition 168 people are still missing. The cities of Wuhan Beijing and Lancing were affected most by flooding, in the provinces of Zhejiang Hubei and Hunan. Finally nearly all of the flood damage occurred in the state of China.

U9 Results In August 2010 a flood spanning 600 miles caused by heavy monsoon affected the indus river in Pakistan. The total rainfall was 200.0 millimeters and the total cost of damages was 250 million dollars. The flood killed 3000 people, left 809 injured, and approximately 15 million people were affected. In addition 1300 people are still missing. The cities of Nasirabad Badheen and Irvine were affected most by flooding, in the provinces of Sindh Mandalay and Punjab. Finally nearly all of the flood damage occurred in the state of Pakistan.

Tools We Used...

Cleaning the data Removed files less than 5KiB Machine Learning DecisionTreeClassifier = 90% NaiveBayesClassifier = 80% MaxEntropyClassifier= 73% SklearnClassifier = 92% Picked top paragraphs from corpus Used WordNet on 20 words Tokenized by paragraph Picked paragraphs with at least 2 WordNet results Sklearn couldn’t be on hadoop

Cleaned Data Merged remaining documents to one for parsing Collection Pre-clean size Post-clean size % bytes reduced YourSmall 2.0 MiB 288 KiB 86% YourBig 136.7 MiB 3.7 MiB 98% Merged remaining documents to one for parsing easier to work with relevant paragraphs in one doc

Classifier Machine learning through decision tree classifier Accurate Inaccurate Percentage YourSmall 90 10 90% YourBig 83 17 83%

Frequency Analysis Purposes Cleaning data Generating summary Building YourWord list

POS Tagging Used the POS tagger for our regular expression “cause” string Checked to see if the cause string returned by the regular expression contained some subject (noun) In June 2011 a flood spanning 9.94 miles caused by heavy rain affected the yangtze river in China.

Regex Best used on cleaned data Patterns prevalent in news reports Same methods of describing flooding event

Regex examples "affected by ____", "result of ____", "caused by _____", "by ____" day/month/year ____ people killed/missing/injured ____ (b|m|tr|etc...)illions dollars ____ miles/km/etc... marc reads this, pass off to Joe

NER Tagger Rather than using the NER tagger for tagging locations we decided to use a Google Maps API... Joe does NER

Contextualizing Locations Google Geocoder API pygeocoder Python package

Tools We Did Not Use...

Bigrams & N-grams Not used extensively Bigrams were good, but already in YourWords Operations we used were based on single words Did help with regex

flash flooding heavy rains inches rain rain fell Useful bigrams YourWords flash flooding heavy rains inches rain rain fell flood rain overflow dam storm severe water damage submerge washed collapsed river discharge downpour flash sweep torrential runoff

flash flooding heavy rains inches rain rain fell Useful bigrams Some regexes flash flooding heavy rains inches rain rain fell (\d+.\d+\smillimeters)|(\d+.\d+\smm))|(\d+.\d+\s(inches|inch) due\sto(\s[A-Za-z]{3,}){1,3}|result\sof(\s[A-Za-z]{3,}){1,3}|caused\sby(\s[A-Za-z]{3,}){1,3}|by\s([A-Za-z]{4,}){1,2})|heavy\s([A-Z a-z]{3,} we could see the causes of flooding

Clustering & Mahout Documents similar enough that clusters would be indistinguishable Wanted data from all good sources Clean data was good enough

Chunking Finds multitoken sequences Knowledge of existing data brainstormed our own chunks, which was good enough would be helpful if we didn’t know patterns Regular expressions alone did the job well on clean data chunking finds phrases like “a result of” and stuff

Conclusion

Wrap Up - Challenges New Technologies Group Logistics Hadoop - Map/Reduce NLTK Library Group Logistics Times Work Distribution

Wrap Up - Strengths Technical Strengths Team Strengths Python LaTeX Willing to learn Team synergy

Conclusion - Improvements Underestimates Deaths Damages Build statistical model to improve accuracy Spatial locations Mean distances Generate map using Google API

Citations https://pypi.python.org/pypi/geocoder/0.9.1 http://www.nltk.org/book_1ed

Many Thanks Dr. Edward Fox GTA Tarek Kanan GTA Xuan Zhang GRA Mohamed Magdy Gharib Farag National Science Foundation, Computing in Context, NSF DUE-1141209 Villanova

Questions