Big Data Processing of School Shooting Archives

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

Team : Priya Iyer Vaidy Venkat Sonali Sharma Mentor: Andy Schlaikjer Twist : User Timeline Tweets Classifier.

Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.

Shootings CS-4984 Computational Linguistics Arjun Chandrasekaran Saurav Sharma Peter Sulucz Jonathan Tran December 2014 Virginia TechBlacksburg, VA.

Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.

CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.

SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :

Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.

Reducing Noise CS5604: Final Presentation Xiangwen Wang, Prashant Chandrasekar.

Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono.

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.

Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,

Medical Data Classifier undergraduate project By: Avikam Agur and Maayan Zehavi Advisors: Prof. Michael Elhadad and Mr. Tal Baumel.

Amy Dai Machine learning techniques for detecting topics in research papers.

Chapter 6: Information Retrieval and Web Search

Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,

Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones,

Spam Detection Ethan Grefe December 13, 2013.

Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.

IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:

VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.

ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.

Nuhi BESIMI, Adrian BESIMI, Visar SHEHU

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.

Information Storage and Retrieval(CS 5604) Collaborative Filtering 4/28/2016 Tianyi Li, Pranav Nakate, Ziqian Song Department of Computer Science Blacksburg,

Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.

IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:

A Simple Approach for Author Profiling in MapReduce

A Smart Tool to Predict Salary Trends of H1-B Holders

Sentiment Analysis of Twitter Messages Using Word2Vec

Name: Sushmita Laila Khan Affiliation: Georgia Southern University

Evaluating Classifiers

CS6604 Digital Libraries Global Events Team Final Presentation

Collection Management Webpages

Collection Management

Measuring Sustainability Reporting using Web Scraping and Natural Language Processing Alessandra Sozzi

CLA Team Final Presentation CS 5604 Information Storage and Retrieval

Text Classification CS5604 Information Retrieval and Storage – Spring 2016 Virginia Polytechnic Institute and State University Blacksburg, VA Professor:

Visualizations of School Shootings

Trail Study Kevin Cianfarini, Shane Davies, Marshall Hansen, Andrew Eason … CS4624: Multimedia, Hypertext, and Information Access Instructor: Dr. Edward.

Clustering and Topic Analysis

Virginia Tech Blacksburg CS 4624

Clustering tweets and webpages

CS 5604 Information Storage and Retrieval

CIKM Competition 2014 Second Place Solution

Event Focused URL Extraction from Tweets

Collection Management Webpages Final Presentation

Tracking FEMA Kevin Kays, Emily Maier, Tyler Leskanic, Seth Cannon

Twitter Equity Firm Value

CS6604 Digital Libraries IDEAL Webpages Presented by

News Event Detection Website Joe Acanfora, Briana Crabb, Jeff Morris

Computational Linguistic Analysis of Earthquake Collections

Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li

CS5984:Big Data Text Summarization

Mark Chavira Ulises Robles

Elena Mikhalkova, Nadezhda Ganzherli, Yuri Karyakin, Dmitriy Grigoryev

Kanchana Ihalagedara Rajitha Kithuldeniya Supun weerasekara

Austin Karingada, Jacob Handy, Adviser : Dr

Python4ML An open-source course for everyone

Presentation transcript:

Big Data Processing of School Shooting Archives PRANAV NAKATE Dr. Edward fox Independent Study (CS 5974) Computer Science Virginia Tech Blacksburg, Virginia 24060 This material is based upon work supported by the National Science Foundation under Grant No. NSF - IIS1319578: SMALL: ideal

Goals Help Dr. Shoemaker (Professor, Sociology, and co-PI on IDEAL) in his research on school shootings Collect, clean, and organize existing webpage and tweet collections to make them searchable for researchers Remove unnecessary and unrelated content from each collection Remove stop words and profane words from the content of each collection

Collections News articles about past school shootings Tweets about school shootings The collections contain: Noise in the webpages / tweets Stop words Profane words Broken pages Duplicate pages Useful webpages and tweets WARC file format

Task Pipeline Extract Webpage Locations Clean Page Content WARC Positive Samples Negative Samples Extract Webpage Locations Clean Page Content Create Sample Sets Train Classifier Classify Collection Remove Duplicates Remove Stop Words, Profanity Word Lemmatization SVM Naïve Bayes SOLR

Collection Statistics HTML non-HTML non-English Duplicates Northern Illinois University 73307 33175 766 31619 1.04% 43.13% Alabama University 30970 4807 76 11659 0.25% 37.65% Youngstown Shooting 11697 13609 210 4549 1.80% 38.89% Brazilian School Shooting 3995 12298 209 1813 5.23% 45.38% Norway Shooting 10321 36093 -- 3724 36.08% Connecticut School Shooting 11710 32315 698 5657 5.96% 48.31%

Webpage Cleaning Readability Beautiful Soup NLTK LancasterStemmer Regular Expressions (Python) Stop Words Profane Words Cleaned Content Raw Page Extract HTML Extract Main Body Word Lemmatization Regular Expressions

Create Training Data Automated Script Input: Collection of pages Output: Positive sample file and negative sample file Take a sample of pages from the collection Display the content of the sample to the user Label a sample: positive or negative (manually by the user) Store in positive and negative sample files

Sample sets size What should be the size of positive and negative sample sets? Number of unique documents in the collection Average length of relevant, non-relevant pages Classifier training and accuracy with existing size of the sample sets If accuracy is lower (below 70%) for all values of parameter K, then add more positive and negative samples

Classifier Training Input: positive and negative sample files 75% positive samples + 75% negative samples for training 25% as test data Feature Selection CountVectorizer TfidfTransformer SelectKBest SVM Classifier Naïve Bayes Classifier Calculate accuracy on training data and test data List of Documents Count Vectorizer Tfidf Transformer SelectKBest

Results - 1

Results - 2

Results - 3

Results - 4

Results - 5

Results - 6

F1 Measure – with SVM Classifier average precision, recall and F1-score with total support values Collection Precision Recall F1-score Support Northern Illinois University 0.98 81 Alabama University 0.96 56 Youngstown Shooting 0.88 0.87 55 Brazilian School Shooting 0.84 0.81 Norway Shooting 0.79 0.69 0.66 113 Connecticut School Shooting 0.92 0.91 100

F1 Measure – Naïve Bayes Classifier average precision, recall and F1-score with total support values Collection Precision Recall F1-score Support Northern Illinois University 0.83 0.74 0.72 81 Alabama University 0.82 0.75 0.7 56 Youngstown Shooting 0.26 0.51 0.34 55 Brazilian School Shooting 0.53 0.73 0.61 Norway Shooting 0.65 0.62 113 Connecticut School Shooting

In the pipeline… Upload the final classified pages to Solr Histogram of word count vs. page count in each collection Stop words, Profane words statistics

Future Work New classification features K fold cross validation Page title Word count of the page K fold cross validation Display top K features Process other file types such as PDF, Txt Paragraph extraction and classification Moving deduplication to the start of process pipeline

Thank you! Questions?