Through the Fire and Flames

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Shootings CS-4984 Computational Linguistics Arjun Chandrasekaran Saurav Sharma Peter Sulucz Jonathan Tran December 2014 Virginia TechBlacksburg, VA.
Methods in Computational Linguistics II Queens College Lecture 1: Introduction.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Chapter 5: Information Retrieval and Web Search
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Apache Mahout Feb 13, 2012 Shannon Quinn Cloud Computing CS
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真.
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.
PhD Dissertation Defense Scaling Up Machine Learning Algorithms to Handle Big Data BY KHALIFEH ALJADDA ADVISOR: PROFESSOR JOHN A. MILLER DEC-2014 Computer.
Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.
Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Big Data Processing of School Shooting Archives
Event Detection and Opinion Mining
Presented by Khawar Shakeel
Sentiment analysis algorithms and applications: A survey
Collection Management
Structured Browsing for Unstructured Text
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Memory Standardization
Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,
Natural Language Processing (NLP)
Word AdHoc Network: Using Google Core Distance to extract the most relevant information Presenter : Wei-Hao Huang   Authors : Ping-I Chen, Shi-Jen.
Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi
Floods Joe Acanfora, Myron Su, David Keimig and Marc Evangelista
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Clustering and Topic Analysis
Social Knowledge Mining
The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt
Multi-Dimensional Data Visualization
Event Focused URL Extraction from Tweets
NRV Tweets Final Presentation VT CS4624, Blacksburg, VA
Dept. of Computer Science University of Liverpool
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Computer Science A Level
Computational Linguistic Analysis of Earthquake Collections
Automatic Detection of Causal Relations for Question Answering
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
CS5984:Big Data Text Summarization
Big Data Text Summarization Westminster Attack
Machine Learning in Practice Lecture 23
Chapter 5: Information Retrieval and Web Search
CS4984/CS598: Big Data Text Summarization
Team 7 → Final Presentation
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
Introduction to Sentiment Analysis
From Unstructured Text to StructureD Data
Natural Language Processing (NLP)
Python4ML An open-source course for everyone
Presentation transcript:

Through the Fire and Flames Michael Zamani, Hayden Lee, Michael Trujillo, Jordan Plahn CS 4984: Computational Linguistics Virginia Tech, Blacksburg, VA December 8, 2014 Hayden

Outline Goals Driving Question Corpus Details Summary of Results Lessons learned Deliverables Acknowledgements Hayden

Team Goals Natural Language Processing Hadoop Solving open ended problems Hayden

Driving Question What is the best summary that can be automatically generated for a document collection about a fire? Hayden

Avg. # lines per file pre-cleanup Corpus Details Corpus Avg. # lines per file pre-cleanup Avg. # lines per file post-cleanup % Duplicate Files Small 1001 14 78 Large 3846 56 85 Small Texas Wildfire ~19,500 files Large Brazil Nightclub Fire ~690,000 files Duplicates File composition Jordan (1 minute)

Cleaning the Collections Many duplicate documents due to the web crawler matching “forest park contact” sites. Each document contained sentences that had been scraped from irrelevant sections such as navigation menus and advertisements. These duplicate documents and sections were deleted.

Summary of Results Jordan

Feature Set Extraction Frequency with experimental filters Stopwords Word length Synonyms via NLTK WordNet Part of speech Nouns and verbs N-grams

NLTK Most Informative Features Odds (True : False) flames 21.3 : 1.0 fires 20.1 : 1.0 burned 20.0 : 1.0 firefighting 16.4 : 1.0 acres 15.3 : 1.0 drought 15.2 : 1.0 evacuate 12.8 : 1.0 evacuated 12.6 : 1.0 burn 12.3 : 1.0 wildfires 12.0 : 1.0 Hayden (2 minutes) Began the class using a lot of frequency dependent algorithms, must x chars long Collocations n-grams One technique we found particularly great was NLTK’s most informative features method

Classification 5-Fold Validation Results Small: 17.1% of ~7,200 classified positive. Large: 9.87% of ~30,000 classified positive. Chose ME classifier over DT, because training set was relatively small compared to our overall corpus size, to mitigate risks of over-fitting training data using the DT classifier. Fold # Maximum Entropy Decision Tree Naive Bayes 1 0.78 0.80 0.68 2 0.98 0.95 3 0.87 0.88 4 0.97 0.93 0.90 5 0.77 Overall ~0.874 ~0.872 ~0.834 Michael Z 2 min

Topic Summarization Gensim Latent Dirichlet Allocation (LDA) Discover semantic structure of document Corpus Topics Small Topic 1 : 0.006*texas + 0.005*news + 0.005*fire + 0.005*2011 + 0.003*ago + 0.003*us + 0.003*new + 0.003*people + 0.002*1 + 0.002*said Topic 2 : 0.017*fire + 0.006*2011 + 0.005*september + 0.004*texas + 0.004*news + 0.003*us + 0.003*2010 + 0.003*firefighter + 0.003*firefighters + 0.003*new Large Topic 1 : 0.018*fire + 0.016*nightclub + 0.010*brazil + 0.007*people + 0.006*santa + 0.005*club + 0.005*said + 0.004*sign + 0.004*news + 0.004*maria, Topic 2 : 0.010*fire + 0.006*brazil + 0.006*sign + 0.006*people + 0.006*nightclub + 0.005*news + 0.004*santa + 0.004*youtube + 0.004*club + 0.003*ago Jordan (2 minutes)

k-means Clustering Apache Mahout Grouping of object sets similar to each other based on a feature Results: Word k-means distance state 3.629 counties 4.476 bastrop 2.630 erupted wildfires largest 3.965 Michael T.

Extracting and Refining Results Used Regular Expressions to extract data for each attribute. Narrowed data to top 10 most frequent results for each attribute. Used parts of speech tagging to ignore inappropriate results (such as a verb when an adjective was expected) Built a basic grammatical model to adjust our template based on the best result (e.g., inserting ‘ended up’ if a verb ending in ‘ing’ was present, or ‘there were’ if a number followed by a noun was present) Conjugated present tense verbs to past tense in selected result (using the Python Pattern library)

Results Small: In September 2011, there was a fire started by a historic drought in Bastrop. This fire, fueled by hot temperatures, strong winds, grew to encompass 33,000 acres, burned for several days, and ended up killing four. 400 firefighters responded to the wildfire. 700 homes were affected as a result of the fire. Large: In January 2013 there was a fire started by indoor fireworks in Santa Maria. This fire, fueled by ignited foam, grew to the size of the building, engulfed the club and ended up killing 309. Firefighters worked to douse a fire at the Kiss Club. One exit was made unavailable for a period of time. Compared to previous fires in the city it was fast-moving. Michael T

Lessons Learned Spend the time learning the theory BEFORE attempting the problem Designate a domain ‘expert’ for each topic covered. Test algorithms (i.e., MapReduce) locally on small data sets before scaling to full collections. Garbage in = Garbage out Michaechael Z

Acknowledgements We would like to thank: Dr. Fox Tarek Kanan Xuan Zhang Mohamed Magdy National Science Foundation (for providing the grant for this class) Jordan

References Rehurek, Radim. "Introduction." Gensim: Topic Modelling for Humans. Gensim, 17 Nov. 2014. Web. 08 Dec. 2014. "Clustering - K-means." A Tutorial on Clustering Algorithms. Polytechnic University of Milan, Web. 08 Dec. 2014. Jordan

Questions? Hayden

Contact Michael Zamani - mzamani1@vt.edu Hayden Lee - hjl33@vt.edu Michael Trujillo - mtruj@vt.edu Jordan Plahn - jplahn@vt.edu