Shootings CS-4984 Computational Linguistics Arjun Chandrasekaran Saurav Sharma Peter Sulucz Jonathan Tran December 2014 Virginia TechBlacksburg, VA.

Slides:



Advertisements
Similar presentations
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
Advertisements

Today Is Sunday By Dr Jean.
Presenters: Arni, Sanjana.  Subtask of Information Extraction  Identify known entity names – person, places, organization etc  Identify the boundaries.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
Event Extraction Using Distant Supervision Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D. Manning, Daniel Jurafsky 30 May 2014 Language.
Ain't It A Shame 1-4 Aint it a shame to work on Sunday, Aint it a shame, (a working shame,) Aint it a shame to work on Sunday, Aint it a shame, (a working.
Literary Style Classification with Deep Linguistic Features Hyung Jin Kim Minjong Chung Wonhong Lee.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
2012 CALENDAR. JANUARY 2012 Sunday 日 Monday 月 Tuesday 火 Wednesday 水 Thursday 木 Friday 金 Saturday 土
Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.
Saturday May 02 PST 4 PM. Saturday May 02 PST 10:00 PM.
Arabic Natural Language Processing: P-Stemmer, Browsing Taxonomy, Text Classification, RenA, ALDA, and Template Summaries — for Arabic News Articles Tarek.
Jan 4 th 2013 Event Extraction Using Distant Supervision Kevin Reschke.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
ARISE Summer Institute UC Davis June 17-June 22, 2011.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
What day is today? What`s the date?. Sunday Monday Tuesday Wednesday Thursday Friday Saturday What day is today?
CS 6604 Middle Term Report Computational Linguistics PJ -Explore Correlation between Newswires and Twitter by Tianyu Geng, Wei Huang, Ji Wang, and Xuan.
Days of the week instructions. Children can; Order the days of the week. Use sentence strips and activity cards to write sentences about what they do on.
A1 Visual C++.NET Intro Programming in C++ Computer Science Dept Va Tech August, 2002 © Barnette ND & McQuain WD 1 Quick Introduction The following.
4A : Unit 7 Busy Lizzie She has piano lessons on Sundays.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
WELCOME! Mrs. Krsolovic Grade 3 Room 201. Teacher’s Bio Starting my 13th year here at Saint Gabriel School Goal – to work together in building confident.
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
PRIS at Slot Filling in KBP 2012: An Enhanced Adaboost Pattern-Matching System Yan Li Beijing University of Posts and Telecommunications
DATE# OF OCCURRENCES ADDITIONAL COMMENTS Frequency Data Collection Form Name: ____________________________________________ Target Behavior: ____________________________________.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
UNIT 12 A SOCCER FAN’S WEBSITE. VOCABULARY, GRAMMAR SPEAKING  Write the time expressions in the correct columns. Last Friday afternoonthe evening Thursday.
Medical Information Retrieval: eEvidence System By Zhao Jin Mar
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Week of: October 13 th – October 16 th News to Know CONFERENCES: Next Tuesday, Wednesday and Thursday are our first round of parent/teacher conferences!
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Observe and Report Sandy Graf April 12, Keeping Schools Safe What is safety? Merriam-Webster Dictionary defines safety as the condition of being.
Teaching Big Data Through Problem-Based Learning Richard Gruss, Business Information Technology, Virginia Tech Tarek Kanan Software Engineering Department.
Class Time Modules for the Semester Calendar Additional Senate Discussion
Expense Report Total Miles MONDAY 5 5 TUESDAY WEDNESDAY 6 6 THURSDAY FRIDAY SATURDAY 0 SUNDAY0 TOTAL 23.
Happy Days by Charles Fox and Norman Gimbel PowerPoint by Camille Page.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,
PART OF SPEECH DEFINITION SYNONYM SENTENCE NAME TEACHER AND DATE1 WORDS OF THE WEEK.
PART OF SPEECH DEFINITION SYNONYM SENTENCE NAME TEACHER AND DATE1 WORDS OF THE WEEK.
Big Data Processing of School Shooting Archives
CS6604 Digital Libraries Global Events Team Final Presentation
Collection Management Webpages
Collection Management
CRF &SVM in Medication Extraction
Floods Joe Acanfora, Myron Su, David Keimig and Marc Evangelista
CLA Team Final Presentation CS 5604 Information Storage and Retrieval
Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:
Visualizations of School Shootings
Clustering and Topic Analysis
CS 5604 Information Storage and Retrieval
Event Focused URL Extraction from Tweets
Collection Management Webpages Final Presentation
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris
Computational Linguistic Analysis of Earthquake Collections
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Katrina Database SearchKat
Through the Fire and Flames
CS5984:Big Data Text Summarization
Team 7 → Final Presentation
Introduction to Sentiment Analysis
Python4ML An open-source course for everyone
Presentation transcript:

shootings CS-4984 Computational Linguistics Arjun Chandrasekaran Saurav Sharma Peter Sulucz Jonathan Tran December 2014 Virginia TechBlacksburg, VA

Outline Problem Statement Introduction Approach Sanitizing Data Filtering Data Generating Paragraph Summaries Results Improvements Interesting Findings Lessons Learned Acknowledgements

Problem Statement Generate an easily readable summary of an event given a large collection of webpages

Introduction Analyze Results Conclude and Communicate Explore Problem Space Develop a Solution Strategy Implement Iterate

Data Set Team J – Shootings Small Collection Adam Lanza fatally shoots 20 children and 6 adults at Sandy Hook Elementary School in Newtown Connecticut Big Collection Jared Lee Loughner shoots Congresswoman Gabriel Giffords in the head, in Tucson Arizona

Sanitizing Data Stop-words List Stop-words Using Default NLTK stop-words Website specific words added to a comprehensive stop- words list News Points Comment Password Login

Sanitizing Data Removing Files Lower case - consistency Positively classified documents – Provided strong training set Removed suspicious sentences – Sentences with words over 16 characters. Removed tiny files – primarily files with hyperlinks Small CollectionLarge Collection Initial Size Size after Sanitization Percent reduction41.9%85.6%

Sanitizing Data Classifying Small Event: 342 training files 160 positive files ClassifierAccuracy Time taken to train (s) Naïve Bayes Decision Tree MaxEntropy SVM ClassifierAccuracy Time taken to train (s) Naïve Bayes Decision Tree MaxEntropy SVM16.02  Training the Small Collection  Training the Big Collection Big Event: 77 training files 20 positive files

Named Entity Recognition Named EntitiesTags lanzaPerson connecticutLocation newtownLocation sandyPerson hookPerson

Process Extract Data Filter Data Apply Grammar

Generating Paragraph Summaries Regular Expressions Regular expressions to match our regular grammar Examples: ( shooter|murderer|killer|gunman)[\s,]+([a-z]+\s[a-z]+) (([0-9]+\s(injured|wounded|hurt|damaged|lived))) Date (Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday) Time of day (Morning | Afternoon | Evening | Night)

Names POS tagging (backoff tagger) Stop-words Numbers Extract and Verify value Verify if 1 < = date <= 31 Stop-words Generating Paragraph Summaries Filtering

Generating Paragraph Summaries Regular Grammar Non-TerminalsTerminals Summary introOn the of, opened fire in weapon descriptionThe fired rounds out of his. killed (children | people) lost their lives. wounded of the victims were hurt, and are being treated for their injuries. age rangeThe victims were between the ages of and weapon time(morning | afternoon | night | evening) date, day of week(monday | tuesday | wednesday | thursday | friday | saturday | sunday) shooter location number[0-9]+ word[a-z]+

Hadoop Job Run Timings JobTiming Run 174 seconds Run 270 seconds Run 3100 seconds Average:81.3 seconds JobTiming Run 1231 seconds Run 2213 seconds Run 3226 seconds Average:223.3 seconds Big Collection Small Collection

Results Newton School Shooting: On the morning of saturday, december 15, adam lanza opened fire in connecticut. The gunman fired 100 rounds out of his rifle. 27 children lost their lives. 2 of the children were hurt, and are being treated for their injuries. The victims were between the ages of 6 and 7. Tucson Gabrielle Giffords Shooting: On the night of sunday, january 9, jared lee opened fire in tucson. The suspect fired 5 rounds out of his rifle. 6 people lost their lives. 32 of the people were hurt, and are being treated for their injuries. The victims were between the ages of 40 and 50.

Improvements Common Sense Verification Utilizing Dependencies and Context Better Candidacy Selection Iterative Slot Filling Better Cleaning up Big Collection Capitalized NEs More informative summary

Interesting Findings Recall vs. Precision Selecting the best candidate Killer's name vs victims' names 100 vs. Hundreds POS vs. NER

Lessons Learned Allocate time for iterations Cleaning a collection is vital to getting good results Small files can be considered noise Simplicity is key Reuse Code File Structure for Organization

Acknowledgements Dr. Edward A. Fox Xuan Zhang Tarek Kanan Mohamed Magdy Gharib Farag With support from NSF DUE and IIS

Questions ? Arjun Chandrasekaran Saurav Sharma Peter Sulucz Jonathan Tran