Rdoc2vec Jake Clark, Austin Cooke, Steven Rolph, Stephen Sherrard

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
AskMe A Web-Based FAQ Management Tool Alex Albu. Background Fast responses to customer inquiries – key factor in customer satisfaction Costs for customer.
Self Organization of a Massive Document Collection
“Not Fully Specified (Project) Objectives” CS524 – Software Engineering I Azusa Pacific University Professor Dr. Sheldon X. Liang Fall I 2007 Ernie Rosales.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
COMPUTER TERMS PART 1. COOKIE A cookie is a small amount of data generated by a website and saved by your web browser. Its purpose is to remember information.
Group practice in problem design and problem solving
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
What is a neural network? Collection of interconnected neurons that compute and generate impulses. Components of a neural network include neurons, synapses,
Grickit William Vuong, Michael Long Date: 4/28/2015Course: 4624 Institution: Virginia TechInstructor: Ed Fox Department: Computer ScienceClient: Dr. Steven.
Analysis of a Neural Language Model Eric Doi CS 152: Neural Networks Harvey Mudd College.
Position Reconstruction in Miniature Detector Using a Multilayer Perceptron By Adam Levine.
1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.
Expanding the CASE Framework to Facilitate Load Balancing of Social Network Simulations Amara Keller, Martin Kelly, Aaron Todd.
The Prime Bid Presentation 18 tabs = bid packages Roofing is selected. Roofing has 8 sections and 11 bidders + yourself. 2 subs are excluded. 4 subs.
A Networked Machine Management System 16, 1999.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Tweets Metadata May 4, 2015 CS Multimedia, Hypertext and Information Access Department of Computer Science Virginia Polytechnic Institute and State.
Covenant College November 27, Laura Broussard, Ph.D. Professor COS 131: Computing for Engineers Chapter 5: Functions.
Introduction to Neural Networks. Biological neural activity –Each neuron has a body, an axon, and many dendrites Can be in one of the two states: firing.
Chapter 1 Computers, Compilers, & Unix. Overview u Computer hardware u Unix u Computer Languages u Compilers.
C - IT Acumens. COMIT Acumens. COM. To demonstrate the use of Neural Networks in the field of Character and Pattern Recognition by simulating a neural.
The Development Process Compilation. Compilation - Dr. Craig A. Struble 2 Programming Process Problem Solving Phase We will spend significant time on.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Software Engineering Algorithms, Compilers, & Lifecycle.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Big Data Processing of School Shooting Archives
TensorFlow– A system for large-scale machine learning
Big Data is a Big Deal!.
Michael Liu, Andrew Chuba, Divya Sengar, James Wong, Alan Kai
Deep learning David Kauchak CS158 – Fall 2016.
Common Crawl Mining Team: Brian Clarke, Tommy Dean, Ali Pasha, Casey Butenhoff Manager: Don Sanderson (Eastman Chemical Company) Client: Ken Denmark.
Background Check Website for R4 OpSec, LLC
Information Retrieval and Web Search
Zenodo Data Archive Irtiza Delwar, Michael Culhane, John Sizemore, Gil Turner Client: Dr. Seungwon Yang Instructor: Dr. Edward A. Fox CS 4624 Multimedia,
MONITORING MICROSOFT WINDOWS SERVER 2003
Trail Study Kevin Cianfarini, Shane Davies, Marshall Hansen, Andrew Eason … CS4624: Multimedia, Hypertext, and Information Access Instructor: Dr. Edward.
Clustering tweets and webpages
Generalization ..
Maptivity Conor O’Neill, Kaz Eslami, Cody Douglass
The Team Ernesto Cortes Kipp Dunn Sar Gregorczyk Alex Schmidt
Tracking Theatre/Cinema Production Experience
Hey everyone, I’m Sunny …harsh caroline xavier
Satellite Image Finder Parking Lot & Spots
Social Interactome Recommender Team Final Presentation
Stream Field Final Project Presentation
Event Trend Detector Ryan Ward, Skylar Edwards, Jun Lee, Stuart Beard, Spencer Su CS 4624 Multimedia, Hypertext, and Information Access Instructor: Edward.
Final Presentation: Neural Network Doc Summarization
Tracking FEMA Kevin Kays, Emily Maier, Tyler Leskanic, Seth Cannon
Twitter Equity Firm Value
Validation of Ebola LOD
Text Categorization Assigning documents to a fixed set of categories
Information Storage and Retrieval
Social Interactome Recommender Team
Katrina Database SearchKat
Word Embedding Word2Vec.
Adam Lech Joseph Pontani Matthew Bollinger
Dynamic Authentication of Typing Patterns
Blacksburg to Guatemala Archive
CSCI N317 Computation for Scientific Applications Unit 1 – 1 MATLAB
Interrupt handling Explain how interrupts are used to obtain processor time and how processing of interrupted jobs may later be resumed, (typical.
Image Compression Using An Adaptation of the ART Algorithm
Artificial Neural Networks
Word2Vec.
Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab
Attention for translation
Word representations David Kauchak CS158 – Fall 2016.
Adult Day Services Promotional Video
Python4ML An open-source course for everyone
Presentation transcript:

Rdoc2vec Jake Clark, Austin Cooke, Steven Rolph, Stephen Sherrard CS4624: Multimedia, Hypertext, and Information Access Final Presentation Dr. Edward A. Fox Virginia Tech Blacksburg VA 24061 Surge 109 5/2/2017

Parsing and Neural Network Added Functionality Allows for custom stop words Can trim vocabulary for words below a certain frequency Reduces the Size of the Problem Space Less weights to keep track off Reduces the size of the neural network [1] Less weights to train Reduces the time spent training [1]

Building our own Neural Network The basic structure of a neural network Hidden = Weight1 * Input1 + Weight2 * Input2 + ...

Building our own Neural Network These operations can be imagined as matrix operations “One-hot” vector

Building our own Neural Network Back Propagation is the training phase Incomplete and untested as of yet

Sample Documents Eastman unable to provide their document sets due to proprietary concerns Scrape from wikipedia Beautiful Soup Library Wikipedia Python Library Bluescreens Running out of memory Collected new smaller document sets

Testing Results (Virginia) Achieved decent results (Similarity <.9) Ran into memory issues Googling says that windows begins to kill key processes when running out of memory and this R error message typically indicates that this has occurred

Testing Results (Virginia2) .931 (Original) .307 (Truncated) Moving forward Acquire additional computing resources Hone in on the size/quantity of documents that can be tested (Get more information from client) Use existing gensim Rdoc2vec to compare against our results once meaningful data has been collected and stored

Saving Results Results saved to .csv file Can save new file or append to file and return full data set Built using ‘readr’ library for increased performance ‘readr’ write_csv is twice as fast as R base ‘write.csv’[5] Future Work: Extend to handle other file types

Plotting Results t-SNE Algorithm & Reducing Dimensions Future Work: Allows visualization of high dimensional data in 2D and 3D ‘Rtsne’ package uses the Barnes-Hut-SNE algorithm Barnes-Hut: O(n log n) [6] Baseline t-SNE: O(n²) [6] Future Work: Implement using ‘rtsne’ R package

2,000 most common English words (300 dimensions to 2) [6] t-SNE Example 2,000 most common English words (300 dimensions to 2) [6]

Lessons Learned Timeline / Schedule Research Better defining goals Attempting DBOW and Distributed Memory Research Finding a balance between research and decision making Better defining goals Be more realistic about scope

Demo Running the script First we build a shared vocabulary Then we create a document vector for each document Finally we have a list containing cosine similarity between two documents.

References [1] https://amsterdam.luminis.eu/2017/01/30/implementing-doc2vec/ [2] https://en.wikipedia.org/wiki/Cat [3] https://en.wikipedia.org/wiki/Dog [4] https://en.wikipedia.org/wiki/Education [5] http://www.sthda.com/english/wiki/fast-writing-of-data-from-r-to-txt-csv-files-readr-package [6] http://learningaboutdata.blogspot.com/2014/06/plotting-word-embedding-using-tsne-with.html [7] https://tex.stackexchange.com/questions/162326/drawing-back-propagation-neural-network

Acknowledgements Dr. Edward Fox CS 4624 Professor Don Sanderson fox@vt.edu Don Sanderson Service Manager, Marketing Solutions Eastman Chemical Company donsanderson@eastman.com Adam Spannbauer R Programmer and Data Scientist adamspannbauer@eastman.com