Team 7 → Final Presentation

Slides:



Advertisements
Similar presentations
ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,
Advertisements

Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
1 Multi-topic based Query-oriented Summarization Jie Tang *, Limin Yao #, and Dewei Chen * * Dept. of Computer Science and Technology Tsinghua University.
Topic Extraction From Turkish News Articles Anıl Armağan Fuat Basık Fatih Çalışır Arif Usta.
Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.
Presented by Zeehasham Rasheed
Chapter 5: Information Retrieval and Web Search
From Words to Meaning to Insight Julia Cretchley & Mike Neal.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Virtual Mechanics Fall Semester 2009
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Chapter 6: Information Retrieval and Web Search
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.
What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
CSC 594 Topics in AI – Text Mining and Analytics
ITCS 6265 Details on Project & Paper Presentation.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
TRANS: T ransportation R esearch A nalysis using N LP Technique S Hyoungtae Cho, Melissa Egan, Ferhan Ture Final Presentation December 9, 2009.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
VisIt Project Overview
Big data classification using neural network
AP CSP: Cleaning Data & Creating Summary Tables
AP CSP: Finding a Data Story
Sentiment analysis algorithms and applications: A survey
Text Mining CSC 600: Data Mining Class 20.
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:
Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,
A Deep Learning Technical Paper Recommender System
Personalized Social Image Recommendation
Accelerating Benchmark Testing Through Automation
MID-SEM REVIEW.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Data Mining: Concepts and Techniques Course Outline
The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt
NRV Tweets Midterm Presentation VT CS4624, Blacksburg, VA
Multi-Dimensional Data Visualization
Fabiano Ferrari Software Engineering Federal University of São Carlos
NETWORK-BASED MODEL OF LEARNING
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Computational Linguistic Analysis of Earthquake Collections
Through the Fire and Flames
Chapter 7 Lexical Analysis and Stoplists
CSE 635 Multimedia Information Retrieval
CS5984:Big Data Text Summarization
Hybrid Summarization of Dakota Access Pipeline Protests (NoDAPL)
Abstractive Text Summarization of the Parkland Shooting Collection
Big Data Text Summarization Westminster Attack
Text Mining & Natural Language Processing
Automatic Summarization of News Articles about Hurricane Florence
Introduction to Text Analysis
Text Mining CSC 576: Data Mining.
CS4984/CS598: Big Data Text Summarization
Word embeddings (continued)
MississaugaTalks! Saif Shaikh March 5, 2016 Code and the City
Welcome! Knowledge Discovery and Data Mining
From Unstructured Text to StructureD Data
Presented By: Grant Glass
Neal Kurande, WinaGodwin Anyanwu Jr., Adam Chau
Presentation transcript:

Team 7 → Final Presentation CS4984/CS5984 Big Data Text Summarization November 29, 2018 Blacksburg, Va 24061 Anuj Arora Jixiang Fan Yi Han Shuai Liu Chreston Miller

Roadmap Introduction Project Management Motivation Golden Standard Article Collection Pre-Processing Data Deep Learning for Summarization Post-Processing Lessons Learned Conclusion give highlights

Introduction

Project Management …and Weekly Team Meetings Slack Team Drive Code Github image from: https://goo.gl/images/fsZaaq Slack Team Drive Code …and Weekly Team Meetings

Motivation How do we summarize multiple articles, when manually reading through each of them becomes impractical?

Golden Standard Create outline Use Solr and Google search related material Update outline Built-in summary tools in Mac OS Finish first draft Read feedback Update summary

Article Collection - NeverAgain Collection of articles concerned with and focused around school shootings and gun policy The following table shows the article counts before and after the cleanup Dataset Before Noise Removal After Duplicate Removal After noise filtering Duplicate Percentage Total Reduction Percentage NeverAgain 12,111 7,631 3,669 37% 69%

Pre-Processing Raw Data Conversion and Indexing (WARC/CDX to JSON) Duplicated Entries Removal Noise Filtering Iteration 1: Word Filtering Iteration 2: jusText and Word Filtering Iteration 3: jusText and more restrictive Word Filtering Stopwords, Punctuation Removal and Part of Speech Tagging Lemmatization and Tokenization

Exploratory Results After tokenization, we came up with a list of the most frequently occuring words in the cleaned corpus. The top words are shown below Word Frequency gun 25,009 school 21,726 people 12,853 student 12,234 shoot 10,838

(’ORGANIZATION’, ’NRA’) Exploratory Results NLTK’s Named Entity Chunker was used to generate named entities, the most frequent of which are shown below Named Entities Frequency (’ORGANIZATION’, ’NRA’) 4,510 (’GPE’, ’Parkland’) 4,265 (’GPE’, ’Florida’) 2,470 (’PERSON’, ’Trump’) 2,234 (’GPE’, ’U.S.’) 2,233

Topic 1 (General Shootings) Topic Modelling The tokenized sentences were vectorized using the Bag of Words (BOW) approach Gensim was used to implement the LDA (Latent Dirichlet Allocation) algorithm Topic Keywords Topic 0 (Santa Fe) 0.024*"school" + 0.021*"police" + 0.019*"shoot" + 0.012*"people" + 0.012*"kill" + 0.011*"2018" + 0.011*"high" + 0.010*"santa" + 0.010*"fe" + 0.010*"shot" Topic 1 (General Shootings) 0.030*"school" + 0.021*"student" + 0.014*"gun" + 0.011*"people" + 0.010*"shoot" + 0.010*"high" + 0.009*"parkland" + 0.008*"life" + 0.008*"like" + 0.008*"violence" Topic 2 (Policy) 0.025*"gun" + 0.012*"state" + 0.008*"nra" + 0.008*"law" + 0.007*"trump" + 0.006*"people" + 0.006*"year" + 0.005*"use" + 0.005*"right" + 0.005*"firearm"

Deep Learning Method Exploration Separate summaries based on LDA identified topics Pointer-Generator Network Hyperparameter tuning Training data - CNN/Daily Mail Used a pre-trained model - saved much time Processing Local Ubuntu server with a high power graphics card (GPU)

Deep Learning Visualization Generated Summary Highlighted = High Generation Probability

Post-Processing Removing redundancy Fixing flow issues Rearranging sentences Given the separate topics, arranging them in a logical order and creating transitions between them

Lessons Learned GPU acceleration - surprisingly did not need a cluster Software versions are very important Ubuntu Virtual Machine would not work with GitHub This was using Parallels virtualization software How to use Docker How to use Solr Pre-processing large datasets to extract meaningful information Deep Learning architectures

Conclusion Abstractive summarization is challenging Careful pre-processing is necessary for filtering articles Deep learning solutions/approaches are available via open source repositories Provides a working foundation Current methods for abstractive summarization are not a complete solution, however, techniques continue to improve

Questions! Anuj Arora: aarora23@vt.edu Jixiang Fan: jfan12@vt.edu Yi Han: hy@vt.edu Shuai Liu: shuail8@vt.edu Chreston Miller: chmille3@vt.edu