A Deep Learning Technical Paper Recommender System

A Deep Learning Technical Paper Recommender System
By Janelle Blankenburg

Outline Introduction Methodology Conclusion Problem Description
Major Objectives Methodology Semantic Analysis Data Preprocessing Model Training Network Generation Three Network Versions Recommendation Mechanism Finding and Ranking Similar Papers Experimental Evaluation Dataset Evaluation Metrics Conclusion 1 min

Introduction

Problem Description Technical paper recommendation General process
Identify keywords related to research interests Input keywords into online textual search service Google Scholar, CiteSeer, arXiv Modify keyword list and repeat 1 min

Problem Description Cont.
Simple keyword search is not sufficient Collaborative Filtering (CF) methods [1] Like-minded users Content-Based Filtering (CBF) methods [2] Previous “purchases” Content-Based Filtering issues: Use only title and abstract of paper Use word counts as basis for similarity models 2 min

Major Objectives Verify that using content from the body of a paper can result in better recommendations that using only the title and abstract Develop a novel recommender approach which utilizes deep learning and network science fundamentals Use semantic analysis of text instead of word counts Consists of meaningful relations between recommended papers Visualize relations through generated networks 1 min

Methodology

Semantic Analysis Simple word counts are not sufficient
Use semantic analysis to extract meaning from full text of paper Compare content across various papers to get similarity

Data Pre-Processing Given source code of set of papers written in LaTex [3] Parse the source code to extract the following Title Abstract Other sections via “\section” 1 min

Data Pre-Processing Separate other sections into the following categories: Introduction Trivial Related Work/Background If both exist, combine Conclusion/Future Work Methodology Remaining sections assumed to be methodology 1 min

Data Pre-Processing Final categories:
Title, Abstract, Introduction, Related Work, Methodology, Conclusion Pre-processing through gensim [4] Phrase extraction “machine” and “learning” ⇒ “machine_learning” Output: List of keywords for each paper Text cleaning Tokenize text into words, remove punctuation, lowercase letters, etc. Output: List of words in sequential order from each category of each paper 2 min

Semantic Analysis Analyze aspects of the text through two approaches:
Word2vec [5] Doc2vec [6] Should be around 15 minutes here!!!

Word2Vec Natural Language Processing (NLP) tool
Computes numerical vector representations of words Allows us to use numerical metrics to perform similarity comparisons between sets of words Example: king−man+women=queen 2 min

Doc2Vec Also used for NLP Extension of word2vec
Computes numerical vector representations of documents instead of words Documents can be: Short 140 character tweet Single paragraph such as an abstract A full article 1 min

Semantic Analysis Word2vec model: Doc2vec models: 1 model
Input: List of keywords from each paper in dataset Doc2vec models: 7 models Title Abstract Introduction Related Work Methodology Conclusion Full text - all 6 categories concatenated together Input: List of words from respective sections for each paper in dataset 1 min

Network Generation Want to verify that including content from body produces better recommendations Given pre-processed data and trained models Generate three networks using different combinations of models Title and abstract All 7 category-based models All 8 models: 7 category-based models with keyword model 1 min

Network Generation Iterate through dataset, comparing papers pairwise
Use trained models to generate similarity scores between pairs of papers Cosine similarity Use scores to create edges in the three networks If score > threshold Create edge between pair of papers First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models 2 min

Network Generation After all pairs have been checked, network generation is complete Result is three different network representations of similarities between papers in dataset 1 min

Recommendation Mechanism
Goal: Return top m papers from dataset most similar to input paper Given: Input paper from the user Trained similarity models Generated networks 1 min

Use gensim [4] to obtain the most similar paper to the input paper This paper becomes the top recommendation Gather friends of this paper as candidate recommendations If not at least m candidates, gather friends of friends Continue gathering layers of friends until at least m candidates 1 min

2 min

Get similarity scores for candidate papers Similarity scoring same as pairwise process used to generate network edges: First network: score = average scores from title and abstract models Second network: score = average scores from 7 category-based models Third network: score = average scores from all 8 models Order candidate papers based on highest similarity Return top m papers as the list of recommendations to user 2 min

1 min

Experimental Evaluation
Aim to test approach in small, preliminary experiment Manually verify logical recommendations Serves as proof of concept Full-scale experiment (time permitting) Can perform further numerical analysis of recommender system Unable to verify recommendations manually 2 min

Dataset Preliminary experiment: Source code from 100 papers from arXiv
10 subareas within computer science Computer vision Robotics Machine learning Graphics Networking Computer security Operating systems Parallel computing Compiler theory Software engineering 10 papers per subarea 1 min

Dataset Full-scale experiment: Specialize dataset: General dataset:
Hep-th portion of arXiv from 29,000 papers General dataset: Crawl arXiv to get papers from all major areas physics, mathematics, computer science, quantitative finance, electrical engineering, 50,0000 papers, 10,000 per area 1 min

Evaluation Metrics Want to determine which of the three generated networks provide best recommendations Preliminary experiment: Manually examine ranking from each network Recommendations should be in same subarea Compare underlying scores for top m papers from each network More similar papers should produce better scores Full-scale experiment: Average scores for top m papers Average number of recommendations in same area 2 min

Conclusion

Conclusion Want to verify that utilizing content from body of paper generates better recommendations than only using the abstract and title Proposed a novel technical paper recommender system Utilizes the full content of the paper Combines deep learning methods with network science foundations Generalizable and consists of meaningful relation between papers Connections between papers can be easily visualized Uses semantic analysis to generate similarity models instead of relying on word counts 1 min

References [1] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, “Grouplens: an open architecture for collaborative filtering of netnews,” in Proceedings of the 1994 ACM conference on Computer supported cooperative work, pp. 175–186, ACM, 1994. [2] F. Ricci, L. Rokach, and B. Shapira, “Introduction to recommender systems handbook,” in Recommender systems handbook, pp. 1–35, Springer, 2011. [3] L. Lamport, LATEX: a document preparation system: user’s guide and reference manual. Addison-wesley, 1994. [4] R. Řehůřek and P. Sojka, “Software Framework for Topic Modelling with Large Corpora,” in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, (Valletta, Malta), pp. 45–50, ELRA, May [5] T. Mikolov, K. Chen, G. Corrado, J. Dean, L. Sutskever, and G. Zweig, “word2vec,” 2014. [6] Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1188–1196, 2014.

Questions?

A Deep Learning Technical Paper Recommender System

Similar presentations

Presentation on theme: "A Deep Learning Technical Paper Recommender System"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Deep Learning Technical Paper Recommender System

Similar presentations

Presentation on theme: "A Deep Learning Technical Paper Recommender System"— Presentation transcript:

Similar presentations

About project

Feedback