Team 7 → Final Presentation

Slides:

Advertisements

Similar presentations

ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,

Advertisements

Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.

1 Multi-topic based Query-oriented Summarization Jie Tang *, Limin Yao #, and Dewei Chen * * Dept. of Computer Science and Technology Tsinghua University.

Topic Extraction From Turkish News Articles Anıl Armağan Fuat Basık Fatih Çalışır Arif Usta.

Sarah Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON.

Presented by Zeehasham Rasheed

Chapter 5: Information Retrieval and Web Search

From Words to Meaning to Insight Julia Cretchley & Mike Neal.

Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.

The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Virtual Mechanics Fall Semester 2009

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.

Chapter 6: Information Retrieval and Web Search

LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.

Text Analytics in Action: Using Text Analytics as a Toolset TBC 4:15 p.m. - 5:00 p.m. Marjorie Hlava Semantic enrichment / Semantic Fingerprinting.

What Is Text Mining? Also known as Text Data Mining Process of examining large collections of unstructured textual resources in order to generate new.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

CSC 594 Topics in AI – Text Mining and Analytics

ITCS 6265 Details on Project & Paper Presentation.

Nuhi BESIMI, Adrian BESIMI, Visar SHEHU

TRANS: T ransportation R esearch A nalysis using N LP Technique S Hyoungtae Cho, Melissa Egan, Ferhan Ture Final Presentation December 9, 2009.

Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.

Making Sense of Large Volumes of Unstructured Responses K. M. P. N. Jayathilaka Department of Statistics University of Colombo.

Topic Modeling for Short Texts with Auxiliary Word Embeddings

VisIt Project Overview

Big data classification using neural network

AP CSP: Cleaning Data & Creating Summary Tables

AP CSP: Finding a Data Story

Sentiment analysis algorithms and applications: A survey

Text Mining CSC 600: Data Mining Class 20.

CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:

CSE5544 Final Project Interactive Visualization Tool(s) for IEEE Vis Publication Exploration and Analysis Team Name: Publication Miner Team Members:

Application of Classification and Clustering Methods on mVoC (Medical Voice of Customer) data for Scientific Engagement Yingzi Xu, Department of Statistics,

A Deep Learning Technical Paper Recommender System

Personalized Social Image Recommendation

Accelerating Benchmark Testing Through Automation

MID-SEM REVIEW.

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Data Mining: Concepts and Techniques Course Outline

The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt

NRV Tweets Midterm Presentation VT CS4624, Blacksburg, VA

Multi-Dimensional Data Visualization

Fabiano Ferrari Software Engineering Federal University of São Carlos

NETWORK-BASED MODEL OF LEARNING

Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.

Computational Linguistic Analysis of Earthquake Collections

Through the Fire and Flames

Chapter 7 Lexical Analysis and Stoplists

CSE 635 Multimedia Information Retrieval

CS5984:Big Data Text Summarization

Hybrid Summarization of Dakota Access Pipeline Protests (NoDAPL)

Abstractive Text Summarization of the Parkland Shooting Collection

Big Data Text Summarization Westminster Attack

Text Mining & Natural Language Processing

Automatic Summarization of News Articles about Hurricane Florence

Introduction to Text Analysis

Text Mining CSC 576: Data Mining.

CS4984/CS598: Big Data Text Summarization

Word embeddings (continued)

MississaugaTalks! Saif Shaikh March 5, 2016 Code and the City

Welcome! Knowledge Discovery and Data Mining

From Unstructured Text to StructureD Data

Presented By: Grant Glass

Neal Kurande, WinaGodwin Anyanwu Jr., Adam Chau

Presentation transcript:

Team 7 → Final Presentation CS4984/CS5984 Big Data Text Summarization November 29, 2018 Blacksburg, Va 24061 Anuj Arora Jixiang Fan Yi Han Shuai Liu Chreston Miller

Roadmap Introduction Project Management Motivation Golden Standard Article Collection Pre-Processing Data Deep Learning for Summarization Post-Processing Lessons Learned Conclusion give highlights

Introduction

Project Management …and Weekly Team Meetings Slack Team Drive Code Github image from: https://goo.gl/images/fsZaaq Slack Team Drive Code …and Weekly Team Meetings

Motivation How do we summarize multiple articles, when manually reading through each of them becomes impractical?

Golden Standard Create outline Use Solr and Google search related material Update outline Built-in summary tools in Mac OS Finish first draft Read feedback Update summary

Article Collection - NeverAgain Collection of articles concerned with and focused around school shootings and gun policy The following table shows the article counts before and after the cleanup Dataset Before Noise Removal After Duplicate Removal After noise filtering Duplicate Percentage Total Reduction Percentage NeverAgain 12,111 7,631 3,669 37% 69%

Pre-Processing Raw Data Conversion and Indexing (WARC/CDX to JSON) Duplicated Entries Removal Noise Filtering Iteration 1: Word Filtering Iteration 2: jusText and Word Filtering Iteration 3: jusText and more restrictive Word Filtering Stopwords, Punctuation Removal and Part of Speech Tagging Lemmatization and Tokenization

Exploratory Results After tokenization, we came up with a list of the most frequently occuring words in the cleaned corpus. The top words are shown below Word Frequency gun 25,009 school 21,726 people 12,853 student 12,234 shoot 10,838

(’ORGANIZATION’, ’NRA’) Exploratory Results NLTK’s Named Entity Chunker was used to generate named entities, the most frequent of which are shown below Named Entities Frequency (’ORGANIZATION’, ’NRA’) 4,510 (’GPE’, ’Parkland’) 4,265 (’GPE’, ’Florida’) 2,470 (’PERSON’, ’Trump’) 2,234 (’GPE’, ’U.S.’) 2,233

Topic 1 (General Shootings) Topic Modelling The tokenized sentences were vectorized using the Bag of Words (BOW) approach Gensim was used to implement the LDA (Latent Dirichlet Allocation) algorithm Topic Keywords Topic 0 (Santa Fe) 0.024*"school" + 0.021*"police" + 0.019*"shoot" + 0.012*"people" + 0.012*"kill" + 0.011*"2018" + 0.011*"high" + 0.010*"santa" + 0.010*"fe" + 0.010*"shot" Topic 1 (General Shootings) 0.030*"school" + 0.021*"student" + 0.014*"gun" + 0.011*"people" + 0.010*"shoot" + 0.010*"high" + 0.009*"parkland" + 0.008*"life" + 0.008*"like" + 0.008*"violence" Topic 2 (Policy) 0.025*"gun" + 0.012*"state" + 0.008*"nra" + 0.008*"law" + 0.007*"trump" + 0.006*"people" + 0.006*"year" + 0.005*"use" + 0.005*"right" + 0.005*"firearm"

Deep Learning Method Exploration Separate summaries based on LDA identified topics Pointer-Generator Network Hyperparameter tuning Training data - CNN/Daily Mail Used a pre-trained model - saved much time Processing Local Ubuntu server with a high power graphics card (GPU)

Deep Learning Visualization Generated Summary Highlighted = High Generation Probability

Post-Processing Removing redundancy Fixing flow issues Rearranging sentences Given the separate topics, arranging them in a logical order and creating transitions between them

Lessons Learned GPU acceleration - surprisingly did not need a cluster Software versions are very important Ubuntu Virtual Machine would not work with GitHub This was using Parallels virtualization software How to use Docker How to use Solr Pre-processing large datasets to extract meaningful information Deep Learning architectures

Conclusion Abstractive summarization is challenging Careful pre-processing is necessary for filtering articles Deep learning solutions/approaches are available via open source repositories Provides a working foundation Current methods for abstractive summarization are not a complete solution, however, techniques continue to improve

Questions! Anuj Arora: aarora23@vt.edu Jixiang Fan: jfan12@vt.edu Yi Han: hy@vt.edu Shuai Liu: shuail8@vt.edu Chreston Miller: chmille3@vt.edu